Identifying Trends in SQL with Linear Regression

One of the best ways to learn how a statistical model really works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.

Data Smart Cover

In the book Data Smart, John Foreman introduces a bunch of awesome methodologies by walking you through how to build them in Excel…

Of course, doing regression in SQL also has (some) practical use as well! For example, suppose you wanted to identify which city in a database of temperature records had the biggest warming trend in the last month. This method would send you on your way without having to bring your data into an external tool. Nifty!

The math

Simple linear regression is basically the process of finding the equation of a line (slope and intercept) that is the best fit for a series of data. We only really need to calculate two values in order to make this happen – B0 (our intercept) and B1 (our slope). Turns out, the formulas for these are pretty simple – thanks, Wikipedia!

Regression Equations

That’s all there is to it! If we can apply these formulas, we can do linear regression in SQL!

Note that, in this case, the X with a line over the top, or “X Bar,” refers to the average X value, while X subscript i refers to each individual X value. The same applies for Y. Also note that the derivatives used for calculating these equations are extraordinarily simple and are explained in this intro to Simple Linear Regression, if you’re interested in learning more. I highly recommend taking a look.

The code

Putting this into practice in SQL is pretty simple. The code below consists of three nested queries, which will enable us to calculate slope and intercept based on a set of X and Y values. The innermost query uses a windowing function to associate the value of X Bar and Y Bar with each observation. The middle query calculates B1 (the slope) using the formula listed above, and maintains the values of X Bar and Y Bar since they’re needed for the calculation of B0. The outer query simply applies the formula for B0 (the intercept) and returns both slope and intercept. That’s it!

Of course, if you wanted to group your data and apply these calculations to each group individually, modifying the parameters of the windowing functions and introducing a couple of “group by” clauses would be all that’s necessary. Super simple stuff.

Testing it out

So, does it work? Well, let’s try it out on a dummy data set I put together…

If I run a regression on this data, I get an intercept of -.14 and a slope of 1.03. Which just happens to be exactly what Excel tells me!

Regression Graph from Excel

Conclusion

Of course, this doesn’t give you the full suite of tools you’d usually get in a statistical package. For example, you don’t get statistical significance estimates, multiple regression features, R^2 calculations, etc., etc. Still, it’s a handy tool to have in your back pocket. And the process of working through the math is valuable for its own sake.

If you’re interested in seeing other algorithms implemented in SQL, check out my other post on implementing Naive Bayes in SQL.

14 Responses

  1. Paul January 14, 2016 / 11:57 pm

    This is a great article and a huge help! How could I modify the formula to duplicate what Excel does when the ‘Constant is Zero’ is checked? Thanks in advanced

    • daynebatten January 15, 2016 / 9:10 am

      The results of some quick Googling tell me the formula is this:

      Regression with no intercept.

      That can be implemented even more easily in SQL…

      select
      cast(sum(x * y) as float) / cast(sum(x ^ 2) as float) as slope
      from
      the_data;

      • Dason March 28, 2017 / 10:41 am

        That’s for regression through the origin which isn’t quite the same. That would be if you wanted to force the intercept to be 0 so that your regression line ALWAYS goes through the point (0,0).

        • daynebatten March 28, 2017 / 10:49 am

          How is it not quite the same? Admittedly, I’m not familiar with regression in Excel so perhaps that checkbox does something wildly different from what it sounds like it does, but if you set the constant to 0 in a regression equation, the line will always go through (0, 0).

          y = mx + b without the b is just y = mx. And m * 0 = 0.

  2. Horacio Diez September 7, 2016 / 5:21 pm

    Thanks for share. This is a really great job.

  3. Nigel Applewhite December 1, 2016 / 2:09 am

    i can’t get this to work – can you help! its not recognised x bar y bar and the rows function

    I am trying to identify trends in SPC – seven increasing values or seven decreasing values

    • daynebatten December 15, 2016 / 3:10 pm

      What error message are you getting? What RDBMS?

  4. Ivan March 23, 2017 / 11:05 am

    How about regression with N variables.
    Like Y = k1 * x1 + k2 * x2 + k3 * x3 …… + y0
    —————————-
    Have anybody example?

    • daynebatten March 24, 2017 / 9:48 am

      Unfortunately, multivariate regression requires more complex matrix operations that can’t be easily accomplished in straight SQL. I’m sure it could be done (the language is Turing-complete, after all), but I’m not sure you would want to…

        • daynebatten March 29, 2017 / 6:22 am

          Most of those answers appear to also be doing simple linear regression with only a slope and intercept.

          It does look like Colin Campbell’s implementation can do multivariate regression (after a very quick scan), but it’s using procedural elements of Transact-SQL in addition to vanilla SQL. If you’re in a Microsoft SQL Server environment, that could be what you need, though!

  5. "GW May 4, 2017 / 9:30 am

    How do we do (in SQL) forecasting?
    Add Rows to the Source Data with X values, manually calculating the Y value, then plotting?
    NOT in Excel or SSRS
    thanks

    • daynebatten May 11, 2017 / 1:35 pm

      Yeah, sounds like you pretty much have it down. You could have two tables – one of known X and Y values, one with known X values for which you want to forecast Y values. Run the regression on the first table. Then simply plug the results into select [beta_0] + [beta_1] * x from table_2. Boom! Predictions!

Leave a Reply

Your email address will not be published. Required fields are marked *