My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.

As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!

## Basic Math

As I've mentioned in a previous survival analysis post, Kaplan-Meier survival curves are a great way to illustrate customer churn. They essentially give you a graph of the percentage chance that a customer will be around X time periods (we'll say days) after they sign up for your service. Something like this:

Calculating these curves is surprisingly simple, with the right data. You need to know the time period (we'll say date) that you started observing a customer (likely when they signed up for your service). You need to know the time period when you stopped observing your customer (because they churned, or because it's today and you don't know the future). And you need to know whether that customer has churned or not. Easy.

The next step is to calculate the percentage chance that a customer survives a given day. This is easy. You just take the number of customers alive at the end of the day, divide by the number of customers alive at the beginning of the day and you get the percentage that survive the day.

Then, to calculate the probability that somebody makes it to day X without churning, you simply take a cumulative product of their probability of surviving all the days up to X. If you're not quite with me, that's OK. It should be clear as we get to the code below.

## Generating Numbers in SQL

One thing we'll need to do for this calculation is enumerate days. We might want to know survival chances for, say, days 0 to 999. But that requires pulling numbers out of thin air in SQL, which isn't something we do often. A table of integers and some self-joins can help with that. I'll illustrate:

If you run the code above, you'll get every number from 0 to 999. Awesome!

## Cumulative Product in SQL

Though it may exist in some flavors of SQL, Redshift doesn't provide a cumulative product window function (or aggregate function, but that's beside the point). So, we're going to need to do some evil floating point bit-level hacking. Seriously, though, we'll just do some simple math. As you may remember from high school, logarithms have some cool properties! For example...

This makes it relatively trivial to calculate a cumulative product in SQL. Provided all of your values are positive, you can calculate the cumulative sum of the logged values (something that *is* possible in SQL), then simply exponentiate to un-log (is that a word?) your resulting value. For a cumulative product of survival probabilities, it looks something like this:

At this point, we should have all of our pieces together. Let's build some survival curves!

## Kaplan-Meier Survival Curves in SQL

Here it is... the moment you've all been waiting for. We're going to calculate survival curves in SQL. I won't spill too much ink here, but I've commented my code pretty heavily, so it should be easy to follow. (Please note that you will need to have an integer table set up for this to work - see above.)

That's all there is to it!

## Periscope Integration

This part is actually extremely simple... you can, of course, use the SQL query in Periscope directly. But if you want to make the feature re-usable, you'll want to set it up as a SQL snippet. In the Periscope UI, create a new SQL snipped, name it something appropriate (like "kaplan_meier") and set it up with all of the code from "day_shift as" on down...

Whenever you want to create a survival curve, simply set up your CTE for your customer data, then just type "[kaplan_meier]". Periscope will pop in the rest of your code, execute it, and you'll get a beautiful survival curve, like this one I built with dummy data:

It's truly that easy!

## Conclusion

For our purposes, this has proved to be a great solution for doing some basic survival analysis. Obviously, you don't get the statistical rigor of, say Cox Regression, but the survival curves calculated by the SQL algorithm match the values calculated by the R survival package perfectly... and our Redshift cluster beats the snot out of my laptop at generating them quickly on large data sets. And you can't beat having survival curves directly in your dashboarding software without having a bunch of complicated ETL steps involved.

Of course, as always, there's improvements that could be made here. You could set it up to have a grouping variable so you could compare curves. You could hook up Periscope filters so that you could drill down into different sections of your data. You could allow for the time period under consideration to be changed. Go nuts and let me know what you come up with!

Ruben KogelJune 21, 2017 / 10:58 pmcouple of suggestions to make your SQL code simpler..

• instead of `cast(XX as float)` you could write XX :: float

• also no need to cast both numerators and denominators as float, Redshift will execute the computation using the most precise type of the two numbers – e.g. if you divide an integer by a float or conversely, Redshift will express the output as a float

• instead of using the function `dateadd` you can simply add an integer to a date – e.g. ‘2017-06-20’ +7 is interpreted as ‘2017-06-27’

daynebattenJune 22, 2017 / 7:52 amThanks for the suggestions. I use cast() rather than :: due to compatibility with other databases. As much as possible, I prefer to write SQL code that could execute in any environment.

The other two are definitely matters of personal preference. I choose to write it that way to make sure it’s explicit what’s going on. I do a lot of things like this… for example, if I want to do a cartesian product join, I’ll join on 1 = 1. It’s unnecessary, sure, but it tells me “yes, I really did want to do a cartesian product join here.”