There are numerous strategies for dealing with non-proportional hazards in cox regression analysis. You can stratify your data based on a problematic variable. You can chuck the cox model and create "pseudo-observations" to analyze the gains (or losses) in lifetime within a certain period associated with changes in a variable. If age is a problem (unlikely for customer churn, but it happens a lot in medical contexts), you can use age rather than time in the cohort as your time scale. The list goes on.
But this is statistics! We're supposed to be modeling things!
Well, as it turns out, it's actually possible to directly model how the effects of a variable change with time, so that you can not only handle the proportional hazards problem, but also get a reliable estimate of how hazard ratios for a given variable change with time. The way to do this is actually incredibly simple... we introduce an interaction term between the variable of interest and time. Let's get started!
Exploring non-proportional hazards
If you'd like to follow along, you can go ahead and download the CSV of dummy customer data I used for this post. It has only 3 variables - time, a churn binary, and a male binary. It's intentionally structured so that hazards between the genders are non-proportional. Men churn more slowly, and the effect of being male gets significantly stronger as time goes on.
We'll start by going about our usual business, doing a simple cox regression of survival on gender.
Whoops! Have a look at the results of that proportional hazard test!
The statistical test rejects massively, demonstrating that we've got non-proportional hazards. We can actually see this graphically by plotting the results of the proportional hazards test.
This code plots scaled versions of "Schoenfeld residuals" for our model. Named after David Schoenfeld, the Harvard Medical School professor who first proposed them in 1982, these residuals represent the difference between the real and expected values of predictor variables for an individual that experienced the event of interest (in our case, churning). For example, if a male (male = 1) churned, and our expected value was leaning female (say, male = .45), then the Schoenfeld residual would be 1 - .45 = .55. As you might have guessed since we're dealing with actual and expected values of regressors, this results in one set of residuals for every predictor variable.
The plot of these (standardized) residuals from the R code above looks something like this:
In this case, if hazards for each gender were proportional, we'd expect these residuals to be completely uncorrelated with time. However, there's obviously a correlation! As time goes on, the residuals are getting more negative, implying that individuals experiencing the event are becoming more female than expected (numbers close to 0 - numbers close to 1 = negative). And, as I discussed above, this is what we should expect - the effect of being male gets stronger as customers get older.
But you already know you have the proportional hazards problem. You want to know how to fix it!
Introducing time-dependent coefficients
So, as I mentioned above, we're simply going to add an interaction with time to the model - in this case between male and time. You might be tempted to do something like the code below. Don't. I can't emphasize enough how wrong this is.
The problem with this is that the interaction term is an interaction of male and the time the customer churned or was censored not an interaction between male and the changing value of time applicable on each individual day of the customer's life. For example, on day 1, the value of the interaction term needs to be male * 1, not male * 723 (or whatever).
Luckily, the survival package provides for "time transform" functionality that makes this really, really easy to set up. So, the proper way to do things is something like this:
The time-transform function we created here simply multiplies the value passed to it by time. When we apply it to male in the formula, we end up with a simple interaction between male and time. Easy. (It's worth noting that it's most common to use x * log(t) rather than x * t as the time transform function, but that failed miserably with this dummy data.) Our results look something like this:
Unfortunately, this is slightly harder to interpret than your simple cox regression. Instead of just looking at the exp(coef) column to see the increase in the hazard associated with maleness, we have to do a little math with the raw coefficients. At time T, the hazard associated with maleness is given by the formula exp(-.47 - .0009 * T).
For convenience, we can build a function to give us this value for arbitrary time periods, then graph it in R.
This shows exactly what you'd expect! At time 0, males churn about 63% as fast as females. By time 1000, they're churning only 26% as fast. This is exactly the relationship I built into the data, and which we observed in the Schoenfeld residuals above.
So, there you go! We've successfully modeled the (changing) effects of maleness, and taken care of the proportional hazards issue.
That's it! A statistically rigorous way to model non-proportional hazards and get some meaningful, interpretable results. If you've got any questions or comments, please follow up! Always happy to entertain questions and good ideas.
(Also, this is totally off-topic, but has anybody else noticed how the "this is Sparta" meme doesn't use a screen grab from the actual "this is Sparta" scene? Frustrating.)