Last week, I looked into options for increasing the performance of an Amazon Redshift cluster that was currently using 10 large dense compute nodes (dc1.large). While investigating, I noticed that a cluster of 32 dc1.large nodes (the maximum for that node type) had the same number of CPUs, the same amount of storage, and a comparable amount of RAM to a cluster of 2 dc1.8xlarge nodes (the minimum for that node type), while offering significant cost savings for anything less than a 3-year term. This got me wondering... if you want a cluster that's around that size, which is a better bet for performance?
We'll get the answer out of the way early for the tl;dr crowd.
The results of testing out three diverse queries on various-sized Redshift clusters.
The cluster of 32 dc1.large nodes provided substantially (13%, on average) faster performance for all but the shortest query in my test batch. Given the superior performance and lower cost, it was the obvious choice for a cluster of this size. Whenever more performance is called for, it will make sense to switch to a cluster of 3 dc1.8xlarge nodes...
My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.
Based on all available evidence, the survival curve for aliens singing disco music hits 0 at about 43 seconds (N = 1).
As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!
Pretty much any language commonly used for data analysis (R, SAS, Python) can calculate the distance between two geographic coordinates with relative ease. But always having to pull your data out of your data warehouse any time you want to do some basic geographic analysis can be frustrating - sometimes it's nice to keep simple queries all in one system. If you've got a spatially enabled version of Postgres or SQL Server, you're in business. But if not, you'll need to roll your own SQL solution.
Because the earth is a sphere, the quickest route between two points is a "Great Circle," which may appear curved on flat maps...
In today's post, we're going to write our own code in vanilla SQL to calculate the distance between two latitude and longitude coordinates.
One of the best ways to learn how a statistical model really works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.
In the book Data Smart, John Foreman introduces a bunch of awesome methodologies by walking you through how to build them in Excel…
Of course, doing regression in SQL also has (some) practical use as well! For example, suppose you wanted to identify which city in a database of temperature records had the biggest warming trend in the last month. This method would send you on your way without having to bring your data into an external tool. Nifty!
Naive Bayes is an extraordinarily diverse algorithm for categorizing things… it can separate fraudulent from non-fraudulent credit card purchases, spam from legitimate email, or dinosaurs from fictional monsters. It’s most frequent application is in “bag of words” models… statistical models that classify a block of text based on the words that appear within it.
This is not a dinosaur…
In this post, we’ll explore the basics of Naive Bayes classification, and we’ll illustrate using a relatively common problem – assigning genders to a group of people when all you have is their names. And, just for kicks and giggles, we’re going to do it in SQL. Because nobody trains models in SQL.