My previous series of guides on survival analysis and customer churn has become by far the most popular content on this blog, so I'm coming back around to introduce some more advanced techniques...
When you're using cox regression to model customer churn, you're often interested in the effects of variables that change throughout a customer's lifetime. For instance, you might be interested in knowing how many times that customer has contacted support, how many times they've logged in during the last 30 days, or what web browser(s) they use. If you have, say, 3 years of historical customer data and you set up a cox regression on that data using covariate values that are applicable to customers right now, you'll essentially be regressing customer's churn hazards from months or years ago on their current characteristics. Your model will be allowing the future to predict the past. Not terribly defensible.
In this post, we'll walk through how to set up a cox regression using "time-dependent covariates," which will allow us to model historical hazard rates on variables whose values were applicable at the time.
Setting up the data
Much like any statistical project, the hardest part of cox regression with time-dependent covariates is setting up the data. In traditional survival analysis, you usually have one record per subject (in our case, a customer), which simply includes the customer's age (either at present, or on the day she churned), and a dummy variable indicating whether the customer churned or got censored. If any covariates (say, gender) are going to be added to the survival model, they're simply added to the single record for each subject. Easy.
Time-varying covariates make this a little bit more complicated. To use a time-varying covariate, you must divide a customer's lifetime into "chunks" where the various values of the covariates apply. For example, check out this snippet of data below that includes survival data, plus an indicator showing whether a customer has contacted support:
Instead of simply an end time and a churn indicator, we now have an additional start time variable. Using the start time and end time, we can now break a customer's lifetime into pieces. For example, in the data above, customer 1000 has been around for 1000 days, has never contacted support, and hasn't churned yet. Customer 1001 first contacted support on day 649 (and therefore hadn't contacted support on days 0-648), then churned on day 655.
Now, getting your data structured this way may not seem too difficult and, for one variable, it's not that bad. But there are several complicating factors, which I discuss below.
For now, on to the modeling! If you'd like to work with the full set of dummy data used for this post, you can grab it here.
Doing some analysis!
Once you have your data set up, doing the actual cox regression looks pretty much like doing any other cox regression.
If you run this code on my dummy data, you'll get something that looks like this...
These results indicate that customers who have contacted support churn 1.89x faster than those who haven't - see the exp(coef) for contacted_support. That's a highly statistically significant result. You'll also notice that the proportional hazards test rejects. That's a red flag that the assumptions of the cox regression are being violated. If this were a real-world project, we'd probably want to go back and tweak some things. But, this is just a blog post, so we'll move on for demonstration purposes!
In basic survival analysis, we set up lots of cool little plots showing the survival curves for folks in different cohorts. But that's a little bit of a problem here... While a male customer will remain in the male cohort for the entirety of his customer life (barring extremely rare events), a customer could go from the "hasn't contacted support" to "has contacted support" camp at any time. So, we can't just plot the differences between the cohorts for their entire customer lives.
However, what we can do is plot survival curves for somebody who contacted support on an arbitrarily selected day. To do this, we'll actually be passing the results of our cox regression, along with some fake data on theoretical customers, into R's survfit function.
Let's start by setting up some fake data, with one imaginary customer who never contacts support, and another one that contacts support on day 500 of their customer life. We'll do it the long way to make sure it's all clear:
Now, to plot that data, we're simply going to pass it into survfit along with our cox results, and plot as we usually would!
If you run this, you'll get a pretty nice visual representation of the differences in survival chances for a customer that never contacted support, and one that contacted support at day 500. Awesome!
Congratulations! That's pretty much all there is to it. You're now doing cox regression with time-varying covariates!
Really setting up the data
Earlier in the post, I mentioned that setting up data for this type of model can be a hassle, and I'd like to circle back to that for a bit. Consider these complicating factors:
- If you're going to use historical data in this type of model, you actually have to preserve historical data. You need to structure your data warehouse so that you know not only what information applies to your customers right now, but also what information applied to them at every point in their life. Even for a relatively small customer base, storing this type of data will cause your data warehouse to grow very quickly.
- You may have to divide a customer's life into a lot of "chunks." In the earlier example, our only covariate was a dummy for whether the customer had contacted support, which meant we had a maximum of 2 chunks per customer. But what if we had 10 covariates... that could take on several values... and could change multiple times in a customer's life? The number of chunks a life would have to be divided into would increase exponentially, and the code to build this would quickly get out of hand.
There are many ways to deal with these types of issues, but here are a few techniques I've used in my work at Republic Wireless that you may want to consider:
- Track critical customer data every day. For example, if you have different service plans, track what service plan somebody is on each and every day of their customer life. It will come in handy when you want to do survival analysis.
- But don't necessarily feel like you need to store a row for every day. If you've got a million customers and you're storing one record per day, you'll be storing over a billion values before you know it... Instead, store data in "chunks" as well, and update the chunks each day. For example, you might know that "Bob" has been on your "Platinum Plan" from 2014-10-01 to today. Instead of adding a record tomorrow, you can simply have your daily ETL update the "end date" of that Platinum Plan record to show the new date. If Bob ever changes plans, then you can add a new record.
- When you do your analysis, consider using 1-day chunk sizes. (Yes, I know this seems to contradict what I just said.) It may be easier to simply create one record per subject per day than it is to go through all the crazy combinatorics required to appropriately size the chunks for each individual when multiple variables get involved.
- Sample individuals, not chunks. If you've got a million customers, you probably don't need to use them all to do your survival analysis. Especially as each customer takes on several rows to cover different time periods, the data can start bogging R down very quickly. A randomly selected sample of, say, 10,000 customers could be just what the doctor ordered.
- Finally, look into the "tmerge" function in R's survival package. It can take two separate historical data sets on individuals and combine them together, creating the necessary time chunks automatically. I prefer doing most of my data setup in SQL in our data warehouse (since it's much higher-performing than my workstation), but if you like doing things in R, this is a good way to go.
Feel free to follow up with any questions or comments you may have. I'd especially be interested in other's suggestions for working with the tmerge function or otherwise preparing the data! Perhaps I'll do a full post on tmerge/sql data preparation at a later date...