If your company operates on any type of Software as a Service or subscription model, you understand the importance of customer churn to your bottom line. When a customer leaves, you lose not only a recurring source of revenue, but also the marketing dollars you paid out to bring them in. As such, small changes in customer churn can easily bankrupt a profitable business, or turn a slow-mover into a powerhouse.

If you’re ready to get a handle on customer churn in your business, you’re ready to start doing some survival analysis. These statistical methods, which have been applied for decades in medicine and engineering, come in handy any time you’re interested in understanding how long something (customers, patients, car parts) survives and what actions can help it survive longer.

And the best part? The methods involved are mathematically simple, easy to understand and interpret, and widely available in free tools. You don’t need a PhD in stats to do this!

## The Problem

Let’s frame the survival analysis idea using an illustrative example. We’ll be using this example (and associated dummy datasets) throughout this series of posts on survival analysis and churn.

Suppose you work at NetLixx, an online startup which maintains a library of guitar tabs for popular rock hits. Guitar enthusiasts can pay $5 a month for a subscription to your library, which lets them display the tabs on their computer, tablet, or phone while they rock out. After a year of hard work, you’ve got a working site, an extensive database of music, and a couple thousand customers.

But you’re also concerned. You’ve had a lot of people sign up for your service, but many seem to be quitting in only a couple of months. You want to know how long your customers are likely to stay with you, and whether customers with a certain demographic profile tend to churn more slowly.

You could, of course, try some basic statistics, but you’ll quickly find yourself stuck between a rock and a hard place.

- The rock – You need to follow customers for a few months (ideally a year) to get any meaningful information, especially since this is a monthly subscription.
- The hard place – Your business has only been around for a year! A good majority of your customers don’t have 5 or 6 months (let alone a year) of data to follow up on.

So, what’s an analyst to do? Well, luckily, this is a case where you can have your cake and eat it too. How? With Kaplan-Meier estimators, for starters!

## Kaplan-Meier Estimators

Although the math behind Kaplan-Meier estimators is extremely simple (Wikipedia link, for those interested), we won’t go into it here. Instead, suffice it to say that Kaplan-Meier estimators predict survival probabilities over a given period of time for “right-censored” data. “Right-censored” just means that some of the observations in the data weren’t observed for as long as the period the researcher is interested in analyzing. (For example, we want to look at a year of churn, but some of our customers signed up a month ago). Kaplan-Meier estimators reliably incorporate all available data at each individual time interval to estimate how many observations are still “surviving” at that time.

To do simple survival analysis using these estimators, all you need is a table of customers with a binary value indicating whether they’ve churned, and a “follow-up time.” The follow-up time can take on one of two values. If the customer churned, it’s the number of days (or weeks, months, whatever) between the day they subscribed and the day they unsubscribed. Otherwise, it’s just the number of days between the day they subscribed and today (or the day the data was pulled).

For this post, we’ll be using a simple CSV file of NetLixx data as an example. (Download the NetLixx data here.) The data includes follow-up time, a churn binary, and a gender indicator. The first few observations are displayed below. Note how the second customer has a follow-up time of 360, while the third has a follow-up time of 8, even though neither have churned. This means customer 2 signed up 360 days ago, but customer 3 signed up only 8 days ago. Neither have left us yet!

Alright! Let’s plot some data! For this analysis, we’ll be using R and the “survival” package, since both are free tools, and they work great for basic survival analysis.

*Please note… teaching R is beyond the scope of this post, but there’s plenty of resources online – both serious and pirate-themed. Or you can just learn the way I do and muck around with code samples and Google until it makes sense…*

Here’s some simple R code that uses the survival package to fit Kaplan-Meier estimators and plot a simple survival curve. (If you’re new to this… don’t forget to install the package with ‘install.packages(“survival”).’)

And here’s the resulting plot! That line in the middle represents the best estimate of the percent of customers surviving at each time interval. The dashed lines represent a 95% confidence interval. The confidence interval spreads out as we get closer to 365 days, since we have less and less customers with that much data to work with.

Looking at this graph, we know that we can expect 75% of customers (give or take) to make it through their first year with us… not bad!

So, there you have it… the basic Kaplan-Meier estimator.

## Looking for Trends

Of course, knowing how fast our customers churn is all well and good, but what we’re really interested in is understanding and analyzing churn… We want to know what makes a customer more likely to churn, and what causes them to stick around.

One easy way to do that is to create different Kaplan-Meier survival curves for each subset of subscribers you want to look at. The statistical significance of the differences can be tested in many ways, including the Log-Rank test, which we’ll apply below. The Log-Rank test simply evaluates whether the underlying population survival curves for the two sampled groups are likely to be the same. The p-value is essentially the probability that the curves are the same, so statistical significance (I’ll use p < .05) is good!

Let’s go ahead and try this out, using the gender variable I mentioned earlier!

Here’s our results on a graph…

And here’s the results of the Log-Rank test. (Amazing how nice your analysis comes out when you make dummy data!)

Now we’ve got some business insight! Those crazy rocker chicks are sticking with us much longer than their male counterparts. Maybe we should do a survey to see why men don’t like our business as much. Or, perhaps, we should start marketing towards women so we can attract loyal customers. NetLixx will never be the same.

## Conclusion

Of course, this is all very basic analysis, and old hat for many readers. That’s OK. We’ll be getting a lot more complex in future posts. Stay tuned for tips on:

- Cox Regression analysis to model churn on multiple explanatory variables
- Restricted Mean Survival Time analysis to understand how churn impacts revenue
- Pseudo-Observation creation, so we can do vanilla stats on restricted survival times (and revenue!)

KuldeepJune 5, 2017 / 6:44 amHi Dayne

I followed all the command for survival analysis for my data.

Got struck after this command ‘plot(fit, lty = 1, mark.time = FALSE, ylim=c(.75,1), xlab = ‘Days since Subscribing’, ylab = ‘Percent Surviving’)

title(main = ‘NetLixx Survival Curve’)

Error i get is : Error: unexpected ‘,’ in “plot(fit, lty = 1, mark.time = FALSE, ylim=(.75,”

Not sure why it happened – please provide some guidance

daynebattenJune 5, 2017 / 7:39 amThe error makes it look like there might have been a weird line break or something in there where the full plot call got cut off, but I’m not really sure without being able to see your screen.

Shouldn’t be too hard to figure out, honestly…

EmilyJune 15, 2017 / 7:29 am^ looks like you are missing the “c” in ylim=c(0.75,1)

KuldeepJune 5, 2017 / 8:33 amHi Dayne

Thanks for the quick reply.

Faced another challenge in ‘legend(20, .8, c(‘Male’, ‘Female’), lty=1:2, bty = ‘n’, ncol = 2)’

Error message is: “legend” is missing, with no default

Regards

kuldeep

daynebattenJune 5, 2017 / 8:42 amSorry… again, it’s hard to diagnose without seeing your screen. Have you copied the code directly from the GitHub Gist (using the “view raw” link)?

KuldeepJune 5, 2017 / 11:54 amHi Dayne

am using my data – agree it is difficult to diagnose without looking at data str

what is not clear is legend(20, .8)……..what does 20 stand for and also .8 – what pieces of data these represent – rest followed your recipe of trial and error and googling to look for answer

Thanks in advance

Regards

rkresnadiJune 9, 2017 / 8:38 amHi Dayne,

Nice writing. I understand that this survival analysis could provide a high level information about the probability of surviving by time. What I am trying to do is to predict from the model to identify what’s probability of surviving of a specific customer, how long would this person stay and it’s probability. is this something that can be done?

Thanks in advance.

daynebattenJune 9, 2017 / 2:02 pmYou could definitely do that. There’s lots of ways to go about it… probably the most sophisticated one I’ve worked with is called a ‘WTTE-RNN’ model. You can read about it here;

http://daynebatten.com/2017/02/recurrent-neural-networks-churn/

rkresnadiJune 12, 2017 / 4:17 pmThanks Dayne, it seems like I need to learn more on your WTT-RNN – I am still not quite familiar with it yet. However, it may take a while from my brain to shift to that topic, while I am still getting used to with basic survival analysis.

If I used WTT RNN for predicting, and use R survival package to build the probability chart like above article, I am afraid that the results may not complement each other and my user/audience will get confused. Can I use this R survival package to predict instead? So predict from the survfit model?

Or, would you suggest to shift to WTT RNN from the get go?

John CageJuly 21, 2018 / 5:53 amI would love to see this answered too. It’s easy to build and analyze a model, but how do I test it on new data and use it for predictions?

daynebattenAugust 10, 2018 / 12:17 pmYour prediction from a survfit model is literally just the value of the relevent survival curve. So, if the kaplan-meier estimator says 30% of observations are still “alive” at time period 100, then you’d expect future similar individuals to have a 70% chance of experiencing the event of interest by day 100.

If you want to do more complicated predictions involving lots of independent variables, you probably want to use cox regression and predict.coxph.

These resources should help:

http://daynebatten.com/2015/02/customer-churn-cox-regression/

https://cran.r-project.org/web/packages/survival/survival.pdf

RachitJune 28, 2017 / 10:51 amThanks Dayne, This is very useful. I starting to model churn for my employer’s subscription business. Could you please point me towards some good online resources to learn more about Survival package, and perhaps churn modelling in general?

Thanks a ton!

daynebattenJuly 6, 2017 / 7:32 amSure… I’ve written a bunch of blog posts about survival analysis… You can find them here, though note that they’re listed in reverse chronological (and logical) order, so you may want to start with the oldest posts. http://daynebatten.com/category/survival-analysis/

Also, definitely read the manual and the vignettes for the survival package. https://cran.r-project.org/web/packages/survival/index.html

MaxFebruary 22, 2018 / 6:08 amHi,

I am interested in using you data for a project but I would like to know where it comes from first and I can’t find NetLixx on internet. Could you help me about this?

Thanks!

daynebattenFebruary 26, 2018 / 8:19 amNetLixx is a fictional company I made up for the purposes of this series… All the data is fake. Feel free to use if you’d like.