One of the best ways to learn how a statistical model *really* works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.

Of course, doing regression in SQL also has (some) practical use as well! For example, suppose you wanted to identify which city in a database of temperature records had the biggest warming trend in the last month. This method would send you on your way without having to bring your data into an external tool. Nifty!

## The math

Simple linear regression is basically the process of finding the equation of a line (slope and intercept) that is the best fit for a series of data. We only really need to calculate two values in order to make this happen – B0 (our intercept) and B1 (our slope). Turns out, the formulas for these are pretty simple – thanks, Wikipedia!

That’s all there is to it! If we can apply these formulas, we can do linear regression in SQL!

*Note that, in this case, the X with a line over the top, or “X Bar,” refers to the average X value, while X subscript i refers to each individual X value. The same applies for Y. Also note that the derivatives used for calculating these equations are extraordinarily simple and are explained in this intro to Simple Linear Regression, if you’re interested in learning more. I highly recommend taking a look.*

## The code

Putting this into practice in SQL is pretty simple. The code below consists of three nested queries, which will enable us to calculate slope and intercept based on a set of X and Y values. The innermost query uses a windowing function to associate the value of X Bar and Y Bar with each observation. The middle query calculates B1 (the slope) using the formula listed above, and maintains the values of X Bar and Y Bar since they’re needed for the calculation of B0. The outer query simply applies the formula for B0 (the intercept) and returns both slope and intercept. That’s it!

Of course, if you wanted to group your data and apply these calculations to each group individually, modifying the parameters of the windowing functions and introducing a couple of “group by” clauses would be all that’s necessary. Super simple stuff.

## Testing it out

So, does it work? Well, let’s try it out on a dummy data set I put together…

If I run a regression on this data, I get an intercept of -.14 and a slope of 1.03. Which just happens to be exactly what Excel tells me!

## Conclusion

Of course, this doesn’t give you the full suite of tools you’d usually get in a statistical package. For example, you don’t get statistical significance estimates, multiple regression features, R^2 calculations, etc., etc. Still, it’s a handy tool to have in your back pocket. And the process of working through the math is valuable for its own sake.

If you’re interested in seeing other algorithms implemented in SQL, check out my other post on implementing Naive Bayes in SQL.

PaulJanuary 14, 2016 / 11:57 pmThis is a great article and a huge help! How could I modify the formula to duplicate what Excel does when the ‘Constant is Zero’ is checked? Thanks in advanced

daynebattenJanuary 15, 2016 / 9:10 amThe results of some quick Googling tell me the formula is this:

That can be implemented even more easily in SQL…

select

cast(sum(x * y) as float) / cast(sum(x ^ 2) as float) as slope

from

the_data;

DasonMarch 28, 2017 / 10:41 amThat’s for regression through the origin which isn’t quite the same. That would be if you wanted to force the intercept to be 0 so that your regression line ALWAYS goes through the point (0,0).

daynebattenMarch 28, 2017 / 10:49 amHow is it not quite the same? Admittedly, I’m not familiar with regression in Excel so perhaps that checkbox does something wildly different from what it sounds like it does, but if you set the constant to 0 in a regression equation, the line will always go through (0, 0).

y = mx + b without the b is just y = mx. And m * 0 = 0.

Horacio DiezSeptember 7, 2016 / 5:21 pmThanks for share. This is a really great job.

Nigel ApplewhiteDecember 1, 2016 / 2:09 ami can’t get this to work – can you help! its not recognised x bar y bar and the rows function

I am trying to identify trends in SPC – seven increasing values or seven decreasing values

daynebattenDecember 15, 2016 / 3:10 pmWhat error message are you getting? What RDBMS?

IvanMarch 23, 2017 / 11:05 amHow about regression with N variables.

Like Y = k1 * x1 + k2 * x2 + k3 * x3 …… + y0

—————————-

Have anybody example?

daynebattenMarch 24, 2017 / 9:48 amUnfortunately, multivariate regression requires more complex matrix operations that can’t be easily accomplished in straight SQL. I’m sure it could be done (the language is Turing-complete, after all), but I’m not sure you would want to…

IvanMarch 29, 2017 / 3:56 amI found this:

http://stackoverflow.com/questions/2536895/are-there-any-linear-regression-function-in-sql-server

daynebattenMarch 29, 2017 / 6:22 amMost of those answers appear to also be doing simple linear regression with only a slope and intercept.

It does look like Colin Campbell’s implementation can do multivariate regression (after a very quick scan), but it’s using procedural elements of Transact-SQL in addition to vanilla SQL. If you’re in a Microsoft SQL Server environment, that could be what you need, though!

"GWMay 4, 2017 / 9:30 amHow do we do (in SQL) forecasting?

Add Rows to the Source Data with X values, manually calculating the Y value, then plotting?

NOT in Excel or SSRS

thanks

daynebattenMay 11, 2017 / 1:35 pmYeah, sounds like you pretty much have it down. You could have two tables – one of known X and Y values, one with known X values for which you want to forecast Y values. Run the regression on the first table. Then simply plug the results into select [beta_0] + [beta_1] * x from table_2. Boom! Predictions!