Analyzing Customer Churn – Competing Risks

Every survival analysis method I've talked about so far in this series has had one thing in common: we've only looked at one event in a customer lifetime (churn). In many cases, that's a perfectly fine way to go about things... we want our customers to stick with us, so churn is the event of interest. So why would we ever need to think about competing risks?

sharknado

You know, competing risks. Will you die by tornado, or by shark?

There's actually a critical assumption undergirding most survival analysis methods for right-censored data - that censored individuals have the same likelihood of experiencing the event of interest as individuals that never got censored. If this assumption ever gets violated, things like Kaplan-Meier estimators can become wildly inaccurate. (If you need a refresher on Kaplan-Meier curves and other concepts, take a look at my earlier post on basic survival analysis.)

The problem

I'll give two examples of why this may be a problem. Suppose you were interested in analyzing churn for your basic service plan, and you had a significant number of users that had upgraded from your basic plan to your premium plan. You don't want to exclude the upgraders from your analysis, since a significant portion of their lifetime was spent on the basic plan. Of course, including them has problems too, since a significant portion of their lifetime wasn't on your basic plan.

So you decide to include them, but to end their "lifetime" at their upgrade date. But wait, do they churn at their upgrade date? Surely not - that would artificially inflate churn! So they must get censored at their churn date? That's not ideal either. Users that upgrade are likely very happy with your service. If your premium plan didn't exist, they may very well have stuck on your basic plan longer than the average user. So, your censored users would be less likely to churn than your non-censored users, and your fundamental assumptions would be violated.

Wrong competing risks methods.

If you count folks that upgrade as churning, you get an obviously wrong curve. But, even in this dummy data set I randomly generated, you also get different results if you simply censor those who upgrade.

This is even easier to understand if we consider the reverse... Suppose you were interested in studying the cumulative incidence of customers upgrading from your basic plan to your premium plan. (A cumulative incidence curve is just 100% - the survival curve. So, you can say things like "10% of customers have upgraded by day 365" rather than "90% of customers have not upgraded by day 365.")

A very similar dilemma would apply in this case. Certainly, you wouldn't want to count churners as upgraders - you'd wildly over-inflate your upgrade rates. So do you count them as censored? Absolutely not! The average user who doesn't churn has an average chance of upgrading. The average user who does churn has a 0% chance of upgrading! So, if you censored individuals at their churn date, you'd effectively be assuming that they continue upgrading at the standard rate, and you'd still be massively overstating your upgrade rate.

The solution

Thankfully, statisticians have solved for this problem using "competing risks" survival models. (I won't dive into the math in this post, but the NIH has a pretty easy-to-follow explanation of competing risks math.) These models essentially let you study more than one event, and learn about the probability of either event occurring.

So, let's get started coding this up in R! If you'd like to follow along, download my dummy competing risks data. It's simply time-to-event data, like usual, except that there are 2 event types indicated by the "type" variable - we'll say 1 for churn and 2 for upgrades.

We'll start with basic setup... load the survival library, and load the survival data set.

Next, like usual, we're going to create a survival variable in the data frame, except that this time we'll specify a formula that includes both the event indicator and the type indicator, and we'll specify that this should be an "mstate" (for "multistate") survival object.

That's pretty much all there is to it! You can now fit a survival curve and plot it just as you would for a regular survival curve!

If you run this code, you'll get a plot that looks something like this:

competing risks graph

Accurate estimates of the true rate of upgrade and churn, with no assumptions violated.

You'll notice that this looks a little bit different from the survival analysis plots you're used to seeing... lines are going up and to the right, rather than down and to the left. That's because the plot function defaults to plotting "cumulative incidence" curves for competing risks data. As discussed above, this is simply the inverse of a survival curve - the share of people experiencing an event rather than not experiencing an event.

Conclusion

That's it! You've now added another tool to your survival analysis arsenal. Of course, there are extensions of this basic version of competing risks analysis... competing risks regression, for example, is a way of looking at the effect of independent variables on survival rates, while accounting for competing risks. There's a lot out there, but it's beyond the scope of this post (though maybe there will be another later).

Please follow up if you notice any problems or have any questions!

2 Responses

  1. Roland December 13, 2016 / 4:32 am

    Hi, I just found your blog today and wow, it’s great! (Why have you stopped posting?)
    Now for the questions:

    1, In this competing risk model does it mean that the upgraders have no probability of churning after they have upgraded? How could you extend the model?

    2, My background is in economics and while I have the basics of many things like probability theory, calculus, econometrics, etc; I am still a bit lost. How could I improve my knowledge and skills? I started working with R a year ago and do data analysis as well.

    On a sidenote: I found your churn posts especially interesting as at my previous company they used vanilla logistic regression for the same problem. The predictions were not usable as real probabilities but we used them to rank the customers.

    Why do you prefer survival analysis to logistic regression? (Also, I read a nice article of comparing linear regression and logistic regression demonstrating that while some assumptions are violated it still gives an empirically fine result
    http://folk.uio.no/stvoh1/Q%26Q%20Linear%20vs%20logisitic%20regression.pdf).

    Sorry, that was a lot 🙂
    Thank you for your work, I even whitelisted your domain on adblock and would definitely click on the donate button if there was one!

    Cheers,
    Roland

    • daynebatten December 15, 2016 / 3:23 pm

      Thanks for the kind words. I had a son about 6 months ago, so I’ve been a little preoccupied. Maybe I’ll come back to blogging sometime soon.

      1) It doesn’t mean that the upgraders have no risk of churning post-upgrade, it just means that we’re not thinking about that as part of the current model. The current model is for churn on the original plan, and upgrading can muck with our understanding of that.

      2) Good question. I actually came from the social sciences (sociology, then public administration) and am self-taught on most of this stuff. I know it’s not an easy answer, but for me, what worked was basically being curious. I’d find an interesting data problem that was beyond my current capacity to solve, figure out what sort of things I’d have to learn to tackle it, and go out and try to learn those things (usually for free, via Google, Stack Exchange, etc.). I’ve come along way, and I’m still learning this way every day.

      As far as logistic regression, there’s really no reason you can’t use logistic regression for an application like this. But how you handle time would be very different… For example, cox regression really naturally handles situations where, say, churn is higher across the board during certain periods of customer lifetime. You’d have to include dummies for that in a logit model. There would also be interesting challenges related to the censoring of the data that you’d have to think through. Ultimately, I’d say cox regression is likely a better choice if you want to get a holistic understanding of when users are churning and why. Logit models may be more helpful if you’re more interested in predicting which users are likely to churn soon…

      Hope those are helpful thoughts!

Leave a Reply

Your email address will not be published. Required fields are marked *