Every survival analysis method I've talked about so far in this series has had one thing in common: we've only looked at one event in a customer lifetime (churn). In many cases, that's a perfectly fine way to go about things... we want our customers to stick with us, so churn is the event of interest. So why would we ever need to think about competing risks?
You know, competing risks. Will you die by tornado, or by shark?
There's actually a critical assumption undergirding most survival analysis methods for right-censored data - that censored individuals have the same likelihood of experiencing the event of interest as individuals that never got censored. If this assumption ever gets violated, things like Kaplan-Meier estimators can become wildly inaccurate. (If you need a refresher on Kaplan-Meier curves and other concepts, take a look at my earlier post on basic survival analysis.)
If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.
Thankfully, R has a solution for this that makes life significantly more simple. It's the "tmerge" function, and it's in the survival package. Today, we'll walk through how to use it.
There are numerous strategies for dealing with non-proportional hazards in cox regression analysis. You can stratify your data based on a problematic variable. You can chuck the cox model and create "pseudo-observations" to analyze the gains (or losses) in lifetime within a certain period associated with changes in a variable. If age is a problem (unlikely for customer churn, but it happens a lot in medical contexts), you can use age rather than time in the cohort as your time scale. The list goes on.
But this is statistics! We're supposed to be modeling things!
Statistics. Almost as cool as Sparta.
Well, as it turns out, it's actually possible to directly model how the effects of a variable change with time, so that you can not only handle the proportional hazards problem, but also get a reliable estimate of how hazard ratios for a given variable change with time. The way to do this is actually incredibly simple... we introduce an interaction term between the variable of interest and time. Let's get started!
My previous series of guides on survival analysis and customer churn has become by far the most popular content on this blog, so I'm coming back around to introduce some more advanced techniques...
When you're using cox regression to model customer churn, you're often interested in the effects of variables that change throughout a customer's lifetime. For instance, you might be interested in knowing how many times that customer has contacted support, how many times they've logged in during the last 30 days, or what web browser(s) they use. If you have, say, 3 years of historical customer data and you set up a cox regression on that data using covariate values that are applicable to customers right now, you'll essentially be regressing customer's churn hazards from months or years ago on their current characteristics. Your model will be allowing the future to predict the past. Not terribly defensible.
In the classic double-slit experiment, past events are seemingly affected by current conditions. But unless you're a quantum physicist or Marty McFly, you're probably not going to see causality working this way.
In this post, we'll walk through how to set up a cox regression using "time-dependent covariates," which will allow us to model historical hazard rates on variables whose values were applicable at the time.
In a past post on analyzing churn in the subscription or Software as a Service business, I talked about two different ways to quantify the dollar cost of churn. You could use 1 / churn as an estimation of mean customer lifetime (though this simple method makes a lot of assumptions). Or, you could use “pseudo-observations” to calculate the dollar value of certain groups of customers during a particular time period (which doesn’t let you quantify the full lifetime value of a customer).
But what if there was another way? What if we took our Kaplan-Meier best estimate of our churn curve, fit a linear model to that model, and then projected it out?
A model within a model, if you will. Churnception.
Well, as it turns out, we’d get a reasonable estimation of our lifetime churn curve, which would let us estimate average customer lifetime, and customer lifetime value. Let’s get started.
Imagine for a moment that you’ve pulled together the mother of all churn data sets. You’ve got customer lifetime data, demographic data, and usage information. You know how many support tickets your customers have submitted, what those tickets were about, and whether they were happy with the customer service they received. You know what some of your customers had for breakfast this morning. OK, maybe not the breakfast thing. But it’s a lot of data.
Excited about your work, you pop all of this into a cox regression model, and the proportional hazards test blows up. Majorly. You take most of the variables out of the model, and you’re left with some analysis that doesn’t violate any key assumptions, but that also doesn’t tell you much of anything. What do you do?
You, sir, are stuck! Or are you?
One of the easiest ways to tackle these challenges is to create “pseudo-observations” from your survival data. These pseudo-observations can be plugged into regular statistical models that don’t have a proportional hazards assumption. It’s a great way out of a tight spot.