If your company operates on any type of Software as a Service or subscription model, you understand the importance of customer churn to your bottom line. When a customer leaves, you lose not only a recurring source of revenue, but also the marketing dollars you paid out to bring them in. As such, small changes in customer churn can easily bankrupt a profitable business, or turn a slow-mover into a powerhouse.
If you’re ready to get a handle on customer churn in your business, you’re ready to start doing some survival analysis. These statistical methods, which have been applied for decades in medicine and engineering, come in handy any time you’re interested in understanding how long something (customers, patients, car parts) survives and what actions can help it survive longer.
And the best part? The methods involved are mathematically simple, easy to understand and interpret, and widely available in free tools. You don’t need a PhD in stats to do this!
Let’s frame the survival analysis idea using an illustrative example. We’ll be using this example (and associated dummy datasets) throughout this series of posts on survival analysis and churn.
Suppose you work at NetLixx, an online startup which maintains a library of guitar tabs for popular rock hits. Guitar enthusiasts can pay $5 a month for a subscription to your library, which lets them display the tabs on their computer, tablet, or phone while they rock out. After a year of hard work, you’ve got a working site, an extensive database of music, and a couple thousand customers.
But you’re also concerned. You’ve had a lot of people sign up for your service, but many seem to be quitting in only a couple of months. You want to know how long your customers are likely to stay with you, and whether customers with a certain demographic profile tend to churn more slowly.
You could, of course, try some basic statistics, but you’ll quickly find yourself stuck between a rock and a hard place.
- The rock – You need to follow customers for a few months (ideally a year) to get any meaningful information, especially since this is a monthly subscription.
- The hard place – Your business has only been around for a year! A good majority of your customers don’t have 5 or 6 months (let alone a year) of data to follow up on.
So, what’s an analyst to do? Well, luckily, this is a case where you can have your cake and eat it too. How? With Kaplan-Meier estimators, for starters!
Although the math behind Kaplan-Meier estimators is extremely simple (Wikipedia link, for those interested), we won’t go into it here. Instead, suffice it to say that Kaplan-Meier estimators predict survival probabilities over a given period of time for “right-censored” data. “Right-censored” just means that some of the observations in the data weren’t observed for as long as the period the researcher is interested in analyzing. (For example, we want to look at a year of churn, but some of our customers signed up a month ago). Kaplan-Meier estimators reliably incorporate all available data at each individual time interval to estimate how many observations are still “surviving” at that time.
To do simple survival analysis using these estimators, all you need is a table of customers with a binary value indicating whether they’ve churned, and a “follow-up time.” The follow-up time can take on one of two values. If the customer churned, it’s the number of days (or weeks, months, whatever) between the day they subscribed and the day they unsubscribed. Otherwise, it’s just the number of days between the day they subscribed and today (or the day the data was pulled).
For this post, we’ll be using a simple CSV file of NetLixx data as an example. (Download the NetLixx data here.) The data includes follow-up time, a churn binary, and a gender indicator. The first few observations are displayed below. Note how the second customer has a follow-up time of 360, while the third has a follow-up time of 8, even though neither have churned. This means customer 2 signed up 360 days ago, but customer 3 signed up only 8 days ago. Neither have left us yet!
Alright! Let’s plot some data! For this analysis, we’ll be using R and the “survival” package, since both are free tools, and they work great for basic survival analysis.
Please note… teaching R is beyond the scope of this post, but there’s plenty of resources online – both serious and pirate-themed. Or you can just learn the way I do and muck around with code samples and Google until it makes sense…
Here’s some simple R code that uses the survival package to fit Kaplan-Meier estimators and plot a simple survival curve. (If you’re new to this… don’t forget to install the package with ‘install.packages(“survival”).’)
And here’s the resulting plot! That line in the middle represents the best estimate of the percent of customers surviving at each time interval. The dashed lines represent a 95% confidence interval. The confidence interval spreads out as we get closer to 365 days, since we have less and less customers with that much data to work with.
Looking at this graph, we know that we can expect 75% of customers (give or take) to make it through their first year with us… not bad!
So, there you have it… the basic Kaplan-Meier estimator.
Looking for Trends
Of course, knowing how fast our customers churn is all well and good, but what we’re really interested in is understanding and analyzing churn… We want to know what makes a customer more likely to churn, and what causes them to stick around.
One easy way to do that is to create different Kaplan-Meier survival curves for each subset of subscribers you want to look at. The statistical significance of the differences can be tested in many ways, including the Log-Rank test, which we’ll apply below. The Log-Rank test simply evaluates whether the underlying population survival curves for the two sampled groups are likely to be the same. The p-value is essentially the probability that the curves are the same, so statistical significance (I’ll use p < .05) is good!
Let’s go ahead and try this out, using the gender variable I mentioned earlier!
Here’s our results on a graph…
And here’s the results of the Log-Rank test. (Amazing how nice your analysis comes out when you make dummy data!)
Now we’ve got some business insight! Those crazy rocker chicks are sticking with us much longer than their male counterparts. Maybe we should do a survey to see why men don’t like our business as much. Or, perhaps, we should start marketing towards women so we can attract loyal customers. NetLixx will never be the same.
Of course, this is all very basic analysis, and old hat for many readers. That’s OK. We’ll be getting a lot more complex in future posts. Stay tuned for tips on:
- Cox Regression analysis to model churn on multiple explanatory variables
- Restricted Mean Survival Time analysis to understand how churn impacts revenue
- Pseudo-Observation creation, so we can do vanilla stats on restricted survival times (and revenue!)