One of the best ways to learn how a statistical model really works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.
Of course, doing regression in SQL also has (some) practical use as well! For example, suppose you wanted to identify which city in a database of temperature records had the biggest warming trend in the last month. This method would send you on your way without having to bring your data into an external tool. Nifty!
Simple linear regression is basically the process of finding the equation of a line (slope and intercept) that is the best fit for a series of data. We only really need to calculate two values in order to make this happen – B0 (our intercept) and B1 (our slope). Turns out, the formulas for these are pretty simple – thanks, Wikipedia!
That’s all there is to it! If we can apply these formulas, we can do linear regression in SQL!
Note that, in this case, the X with a line over the top, or “X Bar,” refers to the average X value, while X subscript i refers to each individual X value. The same applies for Y. Also note that the derivatives used for calculating these equations are extraordinarily simple and are explained in this intro to Simple Linear Regression, if you’re interested in learning more. I highly recommend taking a look.
Putting this into practice in SQL is pretty simple. The code below consists of three nested queries, which will enable us to calculate slope and intercept based on a set of X and Y values. The innermost query uses a windowing function to associate the value of X Bar and Y Bar with each observation. The middle query calculates B1 (the slope) using the formula listed above, and maintains the values of X Bar and Y Bar since they’re needed for the calculation of B0. The outer query simply applies the formula for B0 (the intercept) and returns both slope and intercept. That’s it!
Of course, if you wanted to group your data and apply these calculations to each group individually, modifying the parameters of the windowing functions and introducing a couple of “group by” clauses would be all that’s necessary. Super simple stuff.
Testing it out
So, does it work? Well, let’s try it out on a dummy data set I put together…
If I run a regression on this data, I get an intercept of -.14 and a slope of 1.03. Which just happens to be exactly what Excel tells me!
Of course, this doesn’t give you the full suite of tools you’d usually get in a statistical package. For example, you don’t get statistical significance estimates, multiple regression features, R^2 calculations, etc., etc. Still, it’s a handy tool to have in your back pocket. And the process of working through the math is valuable for its own sake.
If you’re interested in seeing other algorithms implemented in SQL, check out my other post on implementing Naive Bayes in SQL.