Last week, we discussed using Kaplan-Meier estimators, survival curves, and the log-rank test to start analyzing customer churn data. We plotted survival curves for a customer base, then bifurcated them by gender, and confirmed that the difference between the gender curves was statistically significant. Of course, these methods can only get us so far… What if you want to use multiple variables to predict churn? Will you create 5, 12, 80 survival curves and try to spot differences between them? Or, what if you want to predict churn based on a continuous variable like age? Surely you wouldn’t want to create a curve for each individual age. That’s preposterous!
Luckily, statisticians (once again, primarily in the medical and engineering fields) are way ahead of us here. A technique called cox regression lets us do everything we just mentioned in a statistically accurate and user-friendly fashion. In fact, because the technique is powerful, rigorous, and easy to interpret, cox regression has largely become the “gold standard” for statistical survival analysis. Sound cool? Let’s get started.
What is cox regression, you ask? If we were writing analogies on the old SAT, we might put it this way: “cox regression is to the log-rank test as linear regression is to the T-test.” Instead of just bifurcating some data and looking for differences, it lets us create a best-fit model that predicts a dependent variable’s value based on numerous explanatory variables. More specifically, we’re creating a regression model that predicts how changes in the independent variables multiply (or divide) the average “hazard rate” (or churn rate, in our case). Our results tell us things like “the hazard rate for men is 1.2 times the hazard rate for women,” which can also be expressed as “the hazard rate for men is 20% higher.” Once again, the math behind the model isn’t too bad, but I won’t cover it here.
So, once again, let’s take a look at some data on NetLixx, a fictional company that sells monthly access to a library of guitar tabs for popular songs. Since we need to be doing regression across multiple variables, we’ll use a slightly more complicated data set than we used last week. This time, the data will incorporate a user’s gender, their age, and whether or not they signed up with a “first month free” coupon. And, just like last week, we’ll know whether or not they’ve already churned, as well as a follow-up time (start date minus churn date for churners, start date minus today’s date for non-churners). You can get the data in CSV format here. And you can check out a quick preview below…
So, how do we run a cox regression? If you’ve been following this series, you know we’ve been using the freely available Survival Package for R to do our survival analysis so far. And it just so happens that it’s great for doing cox regression as well.
And here’s our results!
First of all, we can see that the coefficient for the female variable is not statistically significant. (If this was a real analysis, you’d probably remove it and run the model again, but this is a blog post with fake data…) But the other two variables did produce significant results! In order to interpret magnitude of the effects, we’re most interested in that exp(coef) column. This gives us the multiplicative relationship between the two hazard (churn) rates. According to these results, coupon users churn 1.8 times faster (or 80% faster) than the baseline survival rate. Wow! The age results are a bit more complicated… The coefficient means that a 1-year increase in customer age multiplies the hazard rate by .994, or 99.4%. So, we get a slight reduction in churn for every additional year of a customer’s age. Not bad! Let’s target those old guys!
Of course, as with regular regression, cox regression is built on some assumptions and, if your data violates those assumptions, your statistics will be all wrong. And, unfortunately, cox regression comes with one particularly big assumption – proportional hazards.
What does “proportional hazards” mean? Well, remember how we were talking about a multiplicative relationship between the baseline hazard rate and the hazard rate for a particular group? Like “coupon users churn 1.8 times faster than non-coupon users?” Yeah, well, cox regression assumes that all relationships are multiplicative throughout time. In other words, the model assumes the churn rate for coupon users is always 1.8 times higher than non-coupon users.
But what if that’s not true? What if coupon users churn a lot faster in their first month (a lot of them take their free month, then quit), but then churn at regular rates later? In fact, if we look at the data, we see that’s exactly what happens.
Looks like we might have “non-proportional hazards,” which means that our cox regression results are potentially faulty. Dang it. But are we sure? Can we test for this? It turns out, we can test for non-proportional hazards. And we can test the whole model at the same time, so we get results for all of our variables, as well as the model as a whole. Once again, that’s a lot better than making graphs and taking guesses! Here’s how you run the test… yup, it’s that easy.
And here’s the results!
Since we’re testing for non-proportional hazards, statistical significance here is a bad thing. And, as we expected, we’ve got some pretty heavy-duty significance on that coupon variable, which is also invalidating the assumptions for the global model. If we want to make sure we have useful statistics, we’re going to have to remove the coupon variable from the model. Unfortunate – but that’s the way things go sometimes. If you don’t want to report garbage, you have to throw out the garbage.
Even though we had to drop the coupon variable, we still learned several important things from our cox regression experiment. First, as people get older, they churn less. Second, there doesn’t seem to be a relationship between gender and churn (at least using this dummy data set). And, finally, our investigation into the non-proportionality of the coupon variable told us that coupon users tend to leave quickly in the first month, but they’re pretty much normal users after that. That’s still useful knowledge!
Of course, real data is rarely this clean, and often models will violate the proportional hazards assumption in ways that aren’t immediately obvious. (That’s why there’s a test for it.) In the next couple of weeks, we’ll discuss some statistical methods that we can use to analyze our data even when it violates the proportional hazards assumption… and we’ll link churn to revenue in the process. These new methods might not quite live up to the “gold standard” of cox regression for survival analysis, but they’ll still generate extremely useful business insight. Check back soon!