Analyzing Customer Churn – Pseudo-Observations

Imagine for a moment that you’ve pulled together the mother of all churn data sets. You’ve got customer lifetime data, demographic data, and usage information. You know how many support tickets your customers have submitted, what those tickets were about, and whether they were happy with the customer service they received. You know what some of your customers had for breakfast this morning. OK, maybe not the breakfast thing. But it’s a lot of data.

Excited about your work, you pop all of this into a cox regression model, and the proportional hazards test blows up. Majorly. You take most of the variables out of the model, and you’re left with some analysis that doesn’t violate any key assumptions, but that also doesn’t tell you much of anything. What do you do?

Stuck Pooh

You, sir, are stuck! Or are you?

One of the easiest ways to tackle these challenges is to create “pseudo-observations” from your survival data. These pseudo-observations can be plugged into regular statistical models that don’t have a proportional hazards assumption. It’s a great way out of a tight spot.

What are pseudo-observations?

Pseudo-observations are computed values which show how each observation contributes to the value of some summary statistic across the entire data set. The summary statistic selected could be the value of the survival curve at a particular time (e.g., 365 days), the Restricted Mean Survival Time, or some other measure.

Pseudo-observations are computed using a “jackknife” procedure. The jackknife method calculates the value of a summary statistic for the entire data set. It then systematically re-computes the value over and over, leaving out one observation at a time. As it does so, it keeps track of information on how the value of that statistic changes when each observation is left out. That information then becomes the pseudo-observation for each value.

Since this process is based on various derivatives of Kaplan-Meier curves, right-censoring problems are taken care of during the pseudo-observation creation. This allows the pseudo-observations to be popped right into relatively vanilla statistical models.

As with many things we’ve discussed in this series, the math here isn’t hard. (See, for example, slides 15-19 of this presentation.) However, we won’t get into it here – it’s more important to see how to apply the methods to the business world.

Creating pseudo-observations

In R, pseudo-observations can be created using… wait for it… the “pseudo” package. Since we’ve covered Restricted Mean Survival Time (RMST) here in the past, we’ll go ahead and calculate pseudo-observations using RMST. The package is more diverse than this, though – it can calculate pseudos using the value of the survival curve, the cumulative incidence function, or years lost – if you’re into that sort of thing.

For the purposes of this post, we’ll again be using data on our fictional guitar tab subscription service, NetLixx. The data set is the same as the one we used for the cox regression post – it’s got survival data, along with gender, age, and whether or not the customer signed up using a coupon for one month of free service. You can download the data in CSV format here, or preview it below.

OK! Let’s create some pseudo-observations, using 365-day RMST!

That wasn’t bad, was it?

Note – if you try to use the pseudo package on a data set involving more than 10,000 or so observations, you may notice that it gets very slow and eventually breaks. To get around this, you can use the appropriately-named “fastpseudo” package instead. This is my own personal implementation of the same algorithm. It generates the same results in a fraction of the time (and with a fraction of the system resources), but only supports RMST pseudos at the time of this writing. I’m working on implementing the other algorithms, and I’ll try to remember to update this post when I do.

Running the Analysis

Now that pseudo-observations are created, let’s run some actual analysis! We’ll use the geese function in R’s geepack package to create a Generalized Estimating Equations model. With the settings applied here, it’s basically a normal regression model, except that it’s relatively immune to certain statistical weirdness that can be introduced by the jackknife procedure. Here’s the code!

And here’s our results!

The coefficients for each of these variables represent what each of the variables does to the value of the 365-day RMST for this group of customers. So, customers that signed up with a coupon provide over 30 days less revenue in their first year due to faster churn rates. Customers provide a fraction of a day of additional revenue for every year older they are… not much for a one year difference, but that’ll add up when you’re comparing a 20-year-old to her dad. The results on the gender variable are not significant in this data set.

As you might have noticed, these are basically the same results (directionally, at least) as we got with the cox regression. However, there’s a big benefit here! In the cox model, we had to throw out the coupon variable because it was violating assumptions and potentially ruining the validity of our statistics. In this model, we don’t have to worry about it! We’re good to go! (Of course, cox regression is still more respected in the academic world… but who cares? We’re looking for meaningful business insight.)

To make things even better, the coefficients here can be directly related to revenue. For a $5 a month service, that 30 days of additional lifetime we get out of non-coupon customers is worth, well, $5. That’s powerful information – and it’s easy to understand to boot!

Conclusion

This concludes our series of posts on analyzing customer churn, at least for now. Over the last 4 weeks, we’ve covered lots of great methods that can help you get a handle on customer churn. We’ve looked at:

  • Estimating survival rates over time with Kaplan-Meier estimators
  • Evaluating differences between survival curves with the log-rank test
  • Incorporating multiple explanatory variables with cox regression
  • Quantifying lost revenue with Restricted Mean Survival Time
  • Calculating customer lifetime revenue using back of the envelope methods
  • Creating pseudo-observations for applying vanilla statistics

There are other ways to slice and dice survival analysis, but this is a toolkit that should be very helpful to any analyst wanting to get a handle on customer churn. Good luck out there!

Side note: Pseudo-observations in survival analysis is a field that’s been largely pioneered and dominated by Danish biostatisticians… Though I’m not from Denmark personally, I can’t help but feel slightly united to these folks by my first name, ancestry, and (apparently) research interests.

5 Responses

  1. Mahi July 13, 2015 / 2:36 am

    Thanks very informative
    Cheers!!!
    Mahi

  2. Frank Sauvage October 26, 2015 / 4:50 pm

    Very insightful set of material. Many thanks for these 4 articles about churn analysis!
    Best wishes

  3. Emiel October 23, 2016 / 8:00 am

    Thank you for all the informative articles about Survival Analysis! They have helped me a lot when writing my masters thesis.

    Cheers!

  4. Toly June 1, 2017 / 12:13 pm

    Thanks for doing this series!

    I am a bit confused about how the pseudo observations are computed. You mentioned looking at the change in RMST when an observation is held out. How does this conceptually impute the value of the held out observation? Is this similar to a linear model, where each observation has a predicted value based on regression weights?Any chance you could provide a simple example?

    Thanks!

    • daynebatten June 5, 2017 / 7:43 am

      Imagine I have a churn data set with 2 people. One churns at 50 days, the other at 100. Average survival time between them is 75 days. If I hold person #1 out, the average survival time goes up to 100 days. So, I can say that person #1 brings down the average survival time by 25 days.

      If I had a much bigger data set and did this with everybody, I could then do a vanilla regression to find out if there were particular characteristics that were generally associated with folks who increased/decreased the average survival time.

      Restricted Mean Survival Time complicates this a little bit, but conceptually you’re doing something very similar to this.

Leave a Reply

Your email address will not be published. Required fields are marked *