Last week, we discussed using Kaplan-Meier estimators, survival curves, and the log-rank test to start analyzing customer churn data. We plotted survival curves for a customer base, then bifurcated them by gender, and confirmed that the difference between the gender curves was statistically significant. Of course, these methods can only get us so far… What if you want to use multiple variables to predict churn? Will you create 5, 12, 80 survival curves and try to spot differences between them? Or, what if you want to predict churn based on a continuous variable like age? Surely you wouldn’t want to create a curve for each individual age. That’s preposterous!

Luckily, statisticians (once again, primarily in the medical and engineering fields) are way ahead of us here. A technique called cox regression lets us do everything we just mentioned in a statistically accurate and user-friendly fashion. In fact, because the technique is powerful, rigorous, and easy to interpret, cox regression has largely become the “gold standard” for statistical survival analysis. Sound cool? Let’s get started.

## Cox Regression

What is cox regression, you ask? If we were writing analogies on the old SAT, we might put it this way: “cox regression is to the log-rank test as linear regression is to the T-test.” Instead of just bifurcating some data and looking for differences, it lets us create a best-fit model that predicts a dependent variable’s value based on numerous explanatory variables. More specifically, we’re creating a regression model that predicts how changes in the independent variables multiply (or divide) the average “hazard rate” (or churn rate, in our case). Our results tell us things like “the hazard rate for men is 1.2 times the hazard rate for women,” which can also be expressed as “the hazard rate for men is 20% higher.” Once again, the math behind the model isn’t too bad, but I won’t cover it here.

So, once again, let’s take a look at some data on NetLixx, a fictional company that sells monthly access to a library of guitar tabs for popular songs. Since we need to be doing regression across multiple variables, we’ll use a slightly more complicated data set than we used last week. This time, the data will incorporate a user’s gender, their age, and whether or not they signed up with a “first month free” coupon. And, just like last week, we’ll know whether or not they’ve already churned, as well as a follow-up time (start date minus churn date for churners, start date minus today’s date for non-churners). You can get the data in CSV format here. And you can check out a quick preview below…

So, how do we run a cox regression? If you’ve been following this series, you know we’ve been using the freely available Survival Package for R to do our survival analysis so far. And it just so happens that it’s great for doing cox regression as well.

And here’s our results!

First of all, we can see that the coefficient for the female variable is not statistically significant. (If this was a real analysis, you’d probably remove it and run the model again, but this is a blog post with fake data…) But the other two variables did produce significant results! In order to interpret magnitude of the effects, we’re most interested in that exp(coef) column. This gives us the multiplicative relationship between the two hazard (churn) rates. According to these results, coupon users churn 1.8 times faster (or 80% faster) than the baseline survival rate. Wow! The age results are a bit more complicated… The coefficient means that a 1-year increase in customer age multiplies the hazard rate by .994, or 99.4%. So, we get a slight reduction in churn for every additional year of a customer’s age. Not bad! Let’s target those old guys!

## Validating Assumptions

Of course, as with regular regression, cox regression is built on some assumptions and, if your data violates those assumptions, your statistics will be all wrong. And, unfortunately, cox regression comes with one *particularly* big assumption – proportional hazards.

What does “proportional hazards” mean? Well, remember how we were talking about a multiplicative relationship between the baseline hazard rate and the hazard rate for a particular group? Like “coupon users churn 1.8 times faster than non-coupon users?” Yeah, well, cox regression assumes that all relationships are multiplicative throughout time. In other words, the model assumes the churn rate for coupon users is *always* 1.8 times higher than non-coupon users.

But what if that’s not true? What if coupon users churn a lot faster in their first month (a lot of them take their free month, then quit), but then churn at regular rates later? In fact, if we look at the data, we see that’s exactly what happens.

Looks like we might have “non-proportional hazards,” which means that our cox regression results are potentially faulty. Dang it. But are we sure? Can we test for this? It turns out, we can test for non-proportional hazards. And we can test the whole model at the same time, so we get results for all of our variables, as well as the model as a whole. Once again, that’s a lot better than making graphs and taking guesses! Here’s how you run the test… yup, it’s that easy.

And here’s the results!

Since we’re testing for non-proportional hazards, statistical significance here is a bad thing. And, as we expected, we’ve got some pretty heavy-duty significance on that coupon variable, which is also invalidating the assumptions for the global model. If we want to make sure we have useful statistics, we’re going to have to remove the coupon variable from the model. Unfortunate – but that’s the way things go sometimes. If you don’t want to report garbage, you have to throw out the garbage.

## Conclusion

Even though we had to drop the coupon variable, we still learned several important things from our cox regression experiment. First, as people get older, they churn less. Second, there doesn’t seem to be a relationship between gender and churn (at least using this dummy data set). And, finally, our investigation into the non-proportionality of the coupon variable told us that coupon users tend to leave quickly in the first month, but they’re pretty much normal users after that. That’s still useful knowledge!

Of course, real data is rarely this clean, and often models will violate the proportional hazards assumption in ways that aren’t immediately obvious. (That’s why there’s a test for it.) In the next couple of weeks, we’ll discuss some statistical methods that we can use to analyze our data even when it violates the proportional hazards assumption… and we’ll link churn to revenue in the process. These new methods might not quite live up to the “gold standard” of cox regression for survival analysis, but they’ll still generate extremely useful business insight. Check back soon!

jimmyJuly 7, 2015 / 10:56 pmhi,I have some questions…what is variable time mean?how do you define it?

thanks~

daynebattenJuly 8, 2015 / 8:15 amHi Jimmy,

Thanks for following up. That’s a great question.

The time variable is basically useless. It was an intermediate randomized variable I created while generating demo data for this post. After I got the demo data finalized, I should have removed it from the data set. Feel free to ignore it…

Please let me know if you have any other questions.

Thanks,

Dayne

jimmyJuly 9, 2015 / 4:27 amSo…the variable followtime in NetLixx is equal to variable time in NetLixxCox?

Q1:if a customer have many bills in different day,do you choose fist bill and last bill to count start and time for churners?

Q2:How do you define customer churned or not churned?

Q3:If you define customer churned with no trade behavior for over 6 months,then some customers have trade records recently but don’t have any trade records in the middle time for over 6 months?how can I solve it

sorry,I apologize for poor English and so many questions

thanks~

daynebattenJuly 10, 2015 / 8:55 amYeah… followtime is the follow-up time from either sign-up to churn, or sign-up to today (for those that haven’t churned). On to your other questions…

Q1: I think it depends on the time frame you’re looking at. If you’ve got multiple transactions per user per day and you’re interested in minute-by-minute churn, you can monitor follow-up time by the minute (or second, or whatever). Or, you could simply aggregate those interactions into a daily “yes/no” variable.

Q2: In the dummy example of NetLixx, I would define churning as cancelling their account.

Q3: I have two thoughts on this. The first is that customer churn is primarily a metric associated with subscription businesses – somebody is paying a regular monthly (or whatever) fee, and they “churn” when they cancel their service. It sounds like your business may not follow this model and, if so, churn may not be an appropriate metric to study.

Second, if you want to determine churn by non-activity, just arbitrarily pick a cutoff (6 months is as good as any) and say a person has churned if they haven’t been active for that long or longer. If they come back after a 6-month period of inactivity, I would treat them as a new customer for research purposes – just as I would suggest NetLixx treat a re-subscriber as a new customer if they signed back up after cancelling…

jimmyJuly 12, 2015 / 9:15 pmyeah,thanks for your answers,I want to analyze the customer trade data by survival analysis,but the trade data is not regularly as you said,””so I am confused that can I use survival analysis to analyze the data?”” “”Is it fitted?””

and some other questions I discovered….

Q4:I recently find a interesting result that data can not include zero data of variable time,or you can not explain the result by Kaplan-Meier estimators.

Q5:In the cox model,I saw coupon and GLOBAL were significant,it means they are important factors affect customer churn? or they show significant differences on churn(just like the churn on female and male maybe different)?

Q6:this is the deepest question,how can I calculate the hazard rate of that if a certain customer survive 100 days and he will stop(churn) on the day of 101 in R?

Q7:I analyze the data from SQL,then a lot of repeated data like same customer have different trade records,in SQL it define this is different customer in different rows,how can I solve it?

thanks a lot,I don’t know how to appreciate your help…

daynebattenJuly 13, 2015 / 6:47 amQ4: Theoretically, time can be 0. Your Kaplan-Meier curve would just have a starting value less than 100%. Don’t know how R handles it, though.

Q5: I think you’re looking at the proportional hazards test. Those results were showing that the coupon variable is violating the assumptions of the cox regression model.

Q6: You could look at the difference between the value of the KM curve between day 100 and day 101, then simply do the math.

Q7: I can’t really offer advice without seeing your data. Honestly, that’s a question that’s a good candidate for Stack Overflow.

-Dayne

jimmyJuly 13, 2015 / 10:14 amAbout…

Q4:Yes,R can draw Kaplan-Meier curve with time variable equal to zero,but this is weird,I can’t explain why the churn rate on time zero is less than 100%,that’s it…

Q5:I confused with the different result,sorry,so in the coxph(formula = survival ~ female + age + coupon, data = netlixx_cox),like p of coupon is 4.3e-13,does it show significant effect on churn or significance differences on coupon owner and non coupon owner?

Q7:sorry,data is secret…but I can make a example to show my question

Q9:may I using survival analysis to study my data?my data are customer orders(irregular)…this should be the most important question

Q10:If the variable violate the assumption of cox model,how can I do for it?(remove or find new feature or whatever)

thanks

jjreddickOctober 23, 2015 / 6:38 pmHello Dayne,

Can this model be extended to obtain probability of a customer churning, say in the next 30 days? Looks like the “predictSurvProb” function on R’s PEC package may help. Any thoughts on how one would go about that?

daynebattenOctober 26, 2015 / 9:03 amYes, you can do this with survfit.coxph. That will let you get a survival curve from a cox results object and the values of the covariates associated with the user in question. The probability that the user churns in the next 30 days is then simply the difference in the value of the survival curve between day X and day X + 30, where X is the customer’s current age in days.

Of course, with simple cox regression, this only works for covariates that are stable throughout a customer’s lifetime (such as gender). But, if you need to know how likely it is that somebody churns in the next 30 days, you’re likely interested in the effects of recent events (like a recent support request). For these purposes, you’ll probably want to extend the cox model to use time-dependent covariates. There’s a pretty good summary of time-dependent covariates for cox regression and how to implement them in R available here.

Hope that helps!

Frank SauvageOctober 26, 2015 / 10:31 amVery nice lot of articles about the churn analysis subject! Crystal clear so far and very helpful!

Thank you very much for your time for this share.

jjreddickFebruary 8, 2016 / 2:01 pmHello:

I am a bit confused on how data is split between train/test for survival models.

If I want to predict survival aka “time to churn” for active customers then ideally I would use only churned customer data for train/test (ie., use only uncensored data) and then predict survival for un-churned customers (censored data).

I know the idea of survival analysis is to leverage uncensored data for training but how would we then be able to predict “time to event” on the same uncensored data, given it’s already used for model creation?

I hope I made my question clear.

daynebattenFebruary 8, 2016 / 2:40 pmExcellent question. A few thoughts:

1) You’re coming at this from a machine learning perspective, whereas most of the classical survival analysis methodologies (including cox regression) have been developed more for the purposes of traditional statistical analysis. So, the techniques are usually used for describing the relationships between variables, not for making predictions.

2) If you want to use a data set consisting entirely of customers that have churned, that’s certainly doable… but you would have to only use customers from long enough ago that all of them have churned. (Simply dropping the non-churning folks out instead of censoring them would bias the results towards early churn, which is why survival analysis methods for right-censored data were invented in the first place.) For the vast majority of businesses, that’s not going to be anything close to possible – they’ll all have a handful of customers that have been around for decades.

3) If you are interested in prediction, you might look into Random Survival Forests. Essentially, these work like regular random forests, except that splits are determined by selecting the variable that maximizes the survival differences between the two cohorts. This is probably what you’re going for, and there are R implementations, though I’ve never used them (yet).

4) In terms of actually building a train/test set for something like Random Survival Forests, you definitely don’t want to train and test on the same data, just like with any other model. Instead, simply randomly assign customers to train and test sets and go from there.

Hope that’s helpful!

jjreddickFebruary 8, 2016 / 3:00 pmIndeed I was coming at it from a machine learning perspective, and have been looking at Random Survival Forests.

Just to make sure I understood your point #4 correctly, if I randomly use some active customers in the training set then I’d have to ignore them for test/predictions, which would mean I don’t have a prediction on those customers. Correct?

In order to reduce the number of active customers left out without a prediction, can I re-run the random forest model multiple times – each time randomly selecting a different set of active customers?

daynebattenFebruary 9, 2016 / 7:30 amOh, I’m following now. You’re worried about what will happen if you include currently active customers in your training data and then you want to predict churn likelihood for those currently active customers… That’s an interesting question.

I actually don’t have a super-defensible answer on that either way, and I haven’t seen anybody else mention the problem (though I plan to keep looking). My gut tells me it’s unlikely to be a super serious issue for two reasons:

1) You’re training on the past life of currently active customers, but then predicting their current/future churn probabilities. So you’re not

quitetraining and testing on the same data.2) The concern with training and testing on the same data is that you may produce a model that doesn’t generalize well. But, in this case, you’re less concerned about the model generalizing, because the vast majority of today’s customers are the same as yesterday’s customers.

As I said, I’m not super sure on this, and I’m willing to be proved wrong. But that’s where my gut goes.

jjreddickFebruary 9, 2016 / 4:30 pmMakes sense – thank you so much for the quick feedback.

NommelNovember 4, 2017 / 2:31 pmHi Dane Batten,

Thank you for providing us with these awesome tutorials. I took a survival class, but I had a difficult time grasping the conceptual idea, but following your tutorials improve my understanding of the topic. Can you recommend some good resources to learn survival analysis?

Thanks

daynebattenNovember 14, 2017 / 7:18 amHi Nommel. I don’t necessarily have a book or something to send you to, but there’s a lot of good information linked in these posts already. In particular, I’d check out some of the basic Wikipedia articles, as well as any R vignettes you can find on the subject. I’d probably focus on learning one methodology at a time. It may get really confusing to get halfway through figuring out how one methodology works and then start reading about another one. Hope that helps!