There are numerous strategies for dealing with non-proportional hazards in cox regression analysis. You can stratify your data based on a problematic variable. You can chuck the cox model and create "pseudo-observations" to analyze the gains (or losses) in lifetime within a certain period associated with changes in a variable. If age is a problem (unlikely for customer churn, but it happens a lot in medical contexts), you can use age rather than time in the cohort as your time scale. The list goes on.

But this is statistics! We're supposed to be modeling things!

Well, as it turns out, it's actually possible to directly model how the effects of a variable change with time, so that you can not only handle the proportional hazards problem, but also get a reliable estimate of how hazard ratios for a given variable change with time. The way to do this is actually incredibly simple... we introduce an interaction term between the variable of interest and time. Let's get started!

## Exploring non-proportional hazards

If you'd like to follow along, you can go ahead and download the CSV of dummy customer data I used for this post. It has only 3 variables - time, a churn binary, and a male binary. It's intentionally structured so that hazards between the genders are non-proportional. Men churn more slowly, and the effect of being male gets significantly stronger as time goes on.

We'll start by going about our usual business, doing a simple cox regression of survival on gender.

Whoops! Have a look at the results of that proportional hazard test!

The statistical test rejects massively, demonstrating that we've got non-proportional hazards. We can actually see this graphically by plotting the results of the proportional hazards test.

This code plots scaled versions of "Schoenfeld residuals" for our model. Named after David Schoenfeld, the Harvard Medical School professor who first proposed them in 1982, these residuals represent the difference between the real and expected values of predictor variables for an individual that experienced the event of interest (in our case, churning). For example, if a male (male = 1) churned, and our expected value was leaning female (say, male = .45), then the Schoenfeld residual would be 1 - .45 = .55. As you might have guessed since we're dealing with actual and expected values of regressors, this results in one set of residuals for every predictor variable.

The plot of these (standardized) residuals from the R code above looks something like this:

In this case, if hazards for each gender were proportional, we'd expect these residuals to be completely uncorrelated with time. However, there's obviously a correlation! As time goes on, the residuals are getting more negative, implying that individuals experiencing the event are becoming more female than expected (numbers close to 0 - numbers close to 1 = negative). And, as I discussed above, this is what we should expect - the effect of being male gets stronger as customers get older.

But you already know you have the proportional hazards problem. You want to know how to fix it!

## Introducing time-dependent coefficients

So, as I mentioned above, we're simply going to add an interaction with time to the model - in this case between male and time. You might be tempted to do something like the code below. Don't. I can't emphasize enough how wrong this is.

The problem with this is that the interaction term is an interaction of male and *the time the customer churned or was censored* not an interaction between male and *the changing value of time applicable on each individual day of the customer's life*. For example, on day 1, the value of the interaction term needs to be male * 1, not male * 723 (or whatever).

Luckily, the survival package provides for "time transform" functionality that makes this really, really easy to set up. So, the proper way to do things is something like this:

The time-transform function we created here simply multiplies the value passed to it by time. When we apply it to male in the formula, we end up with a simple interaction between male and time. Easy. (It's worth noting that it's most common to use x * log(t) rather than x * t as the time transform function, but that failed miserably with this dummy data.) Our results look something like this:

Unfortunately, this is slightly harder to interpret than your simple cox regression. Instead of just looking at the exp(coef) column to see the increase in the hazard associated with maleness, we have to do a little math with the raw coefficients. At time T, the hazard associated with maleness is given by the formula exp(-.47 - .0009 * T).

For convenience, we can build a function to give us this value for arbitrary time periods, then graph it in R.

This shows exactly what you'd expect! At time 0, males churn about 63% as fast as females. By time 1000, they're churning only 26% as fast. This is exactly the relationship I built into the data, and which we observed in the Schoenfeld residuals above.

So, there you go! We've successfully modeled the (changing) effects of maleness, and taken care of the proportional hazards issue.

## Conclusion

That's it! A statistically rigorous way to model non-proportional hazards and get some meaningful, interpretable results. If you've got any questions or comments, please follow up! Always happy to entertain questions and good ideas.

*(Also, this is totally off-topic, but has anybody else noticed how the "this is Sparta" meme doesn't use a screen grab from the actual "this is Sparta" scene? Frustrating.)*

RandiMay 9, 2016 / 7:00 pmThanks for the informative post- I am new to survival analysis and have found myself grappling with non-proportional hazards. In “Introducing Survival and Event History Analysis” by Melinda Mills, they actually suggest the ‘time:variable’ approach to time varying effects (on page 154, in the section on dealing with non-proportional hazards). The same approach is also advocated as a way to deal with non-proportional hazards in “Cox Proportional-Hazards Regression for Survival Data in R: An Appendix to An R Companion to Applied Regression, Second Edition”, by J. Fox and S. Weisberg.

Tell me if I understand correctly- the two approaches represent different statistical models. First in R code:

1. coxph(Surv(time, death) ~ male + male:time)

2. coxph(Surv(time, death) ~ male + tt(male), tt=function(x,t,…) x*t)

Now the underlying models:

1. h(t|X) = h0(t) exp(B1*X + B2*X*T)

2. h(t|X) = h0(t) exp(B1*X + B2*X*t)

where T is the censoring or event times in the data, and t is time in the model. The second approach definitely makes more sense to me because you can interpret B1 and B2 as the intercept and slope of a line describing how the log hazard ratio changes over time in the model. In contrast, B1 and B2 are harder to interpret for the first model… B1 is the log hazard ratio when T=0 (which is kind of nonsensical because we can’t have T=0 in the data), and B2 is weird because it shows how the log hazard ratio depends on censoring/event times, which is circular because event times are the response variable (this point is made in Terry Therneau’s vignette “Using Time Dependent Covariates and Time Dependent Coecients in the Cox Model”).

Still, it seems to me that in practice, if there is a linear relationship between a coefficient and time, then Model 1 will capture this. The precise meaning of the coefficients is less clear, but the model should still show that 1) a positive B1 means hazard increases initially, and vice versa, and 2) a positive B2 means the effect increases over time, and vice versa. I would think that the ‘time:variable’ approach is better than ignoring proportional hazards completely, and if your primary interest is in other variables in the model, then incorporating time varying effects in this way would give you more accurate results for the other variables. I’m not arguing for using the ‘time:variable’ method- it seems incorrect. I am just curious HOW incorrect it is in practice. If it is really unforgivable to model time varying effects that way, then it is a bit concerning that it is promoted in two fairly popular textbooks.

daynebattenMay 10, 2016 / 8:35 amVery interesting. Thanks for the comment!

I haven’t been able to look at the Mills book (I’ll need to swing by a library), but I did take a look at Fox and Weisberg. In their case, they advocate the “time:variable” approach in reference to their “Rossi.2” data frame. If you look at that data set closely, you’ll notice that the start and stop times are only 1 unit of time apart. In other words, they’ve created one observation per prisoner per what appears to be week since release (it’s recidivism data). They then create an interaction term between one of their covariates and the end time. In this case, because each observation represents a single week, the interaction term will represent the value of time during each individual week (as opposed to the future value of time when the prisoner was either censored or experienced an event). Effectively, this produces the same result as the TT method.

So, perhaps I should update my statement a bit… “time:variable” is only acceptable in situations where you produce one observation per individual per unit of time. Does that make sense?

I’d be interested to hear if the Mills book is doing the same thing…

EugeniaJanuary 13, 2017 / 6:10 amHi, I found your article very interesting, since it´s the only one that I could find with an explanation HOW it all runs in R. THANKS for that!

I downloaded your data set to reproduce the results and what seems weird to me is that male is not a factor, it’s simply integer, which doesn´t make sense to me. Even though it´s just 0 and 1, but these are numbers and not factor levels. Could you please comment on this?

daynebattenJanuary 17, 2017 / 8:12 pmAbsolutely… Regression operates solely on numbers, so people commonly use “dummy variables” which can take on a value of 1 or 0 to represent categories like “male” and “female.” Of course, R has a concept of a factor, and you could certainly use that. However, your factor would ultimately be converted to a numeric dummy variable “under the hood” by R when you ran the regression anyway…

Olivier CHOTINJanuary 14, 2018 / 9:08 amHi,

did you test WTTE-RNN in case of “Customer” Churn (I saw your test with “jet engine failure data from Nasa” …)

Thanks a lot.

daynebattenJanuary 30, 2018 / 11:25 amI did, though we never took it as far as we could have. Ultimately, the problem ended up being more of a data challenge than a modeling challenge – a lot of the major predictors of customer churn were actually indicators that the customer had already made the decision to leave but hadn’t fully cancelled yet. For example (I’m in telecom), minutes of usage would drop off substantially before the user cancelled, but that was because they had already signed up for another phone service.