Analyzing Customer Churn – Using “tmerge” to Build Data Sets

If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.

Merge, everybody, merge!

Thankfully, R has a solution for this that makes life significantly more simple. It's the "tmerge" function, and it's in the survival package. Today, we'll walk through how to use it.

The data

If you'd like to follow along with this post, you can download csv files of the data I'm using. Essentially, there's three small sample data sets involving customers of a fake company that I'll use for illustration purposes... The first is a base customer data set on 3 customers. It includes their customer IDs and their gender, like this...

The second set of customer data contains lifetime data... how long did each customer survive before churning or being censored? This is simply a list of customer IDs, with a time-to-event variable, and a churn indicator.

Now, you'll notice that in each of these files, the customers have just one record covering their whole lifetime... so far, so good. A merge would be easy.

Suppose, however, we're interested in knowing if customers that have contacted support are more likely to churn. We have a more complicated data set that shows, over the duration of a customer's life, whether or not they've contacted support. This data looks something like this:

You'll see that customer 3 didn't contact support until day 100 of his life, so he has 2 different records... one for the first 100 days, then another to cover the rest of his life. You'll also notice that there's no start/stop variables for time. We only need a "start" variable, and tmerge will figure out the stop time by looking at the data.

OK, so let's get this data merged!

Merging it!

First things first. Let's load the survival package and the CSV files.

Our first call to the tmerge function will merge our gender and survival data, creating an "event" variable in the process. Tmerge will automatically create all of the time-related variables it needs to keep merging other things in.

We do this by specifying our two data sets to be merged, specifying the customer id, and telling tmerge that the "time" variable indicates the stop time for the observations (since it's the maximum time of the customer life). We'll also be creating a tmerge event variable as a function of time and our event binary. Like so:

Now the real magic happens! We call tmerge, but this time we pass it both the base data set and our support data set. We tell it the id variable again (present in both data sets), and we specify the formula for creating our new time-dependent covariate. We'll create a covariate called "contacted_support," which is simply an extrapolation of the time variable and the contacted_support variable from the original support data frame.

Boom! That's it! Here's our results:

You'll notice that the resulting data frame includes information from all three of our original data sets, and tmerge figured out that it needed to divide customer 3's life into 2 chunks to account for the changing values of the support variable. It also knew that the event variable only needed to apply to the last chunk of customer 3's life. Fantastic. If you wanted to merge additional data in, you could do so without any trouble... just specify another tmerge and it will keep adding it right on!

Odds and ends

Of course, if you look at the tmerge docs, you'll notice that the "tdc" and "event" parameters aren't the only ways to specify new variables in tmerge. In fact, as you may have guessed, "tdc" stands for "time-dependent covariate" and it's used for creating, well, time-dependent covariates. However, you can also use "cumtdc" to create cumulative time-dependent covariates (e.g., total number of support requests created over a customer's life). You can use "event" similarly to merge on various types of additional events (e.g., if you wanted to merge on service plan changes so you could build a competing-hazards model for churn and plan changes) or "cumevent" for a cumulative event counter. Additional info can be found in the docs for the survival package and in this vignette on time-dependent covariates.

Conclusion

If you're working on a large churn analysis project in R and need to use a lot of time-dependent covariates, this is the easiest way I know to build your data set. If you've got any thoughts or questions, be sure to follow up!

10 Responses

  1. Lauren May 4, 2016 / 1:33 am

    I’ll begin by saying I’ve found your posts very helpful 🙂

    I am tempted to use a survival-based analysis for our company’s churn modelling. However, from my experience, the data.frame gets massive when you have a large number of customers (100,000) and you’re also considering a large number of time-varying covariates (that can’t be extrapolated like age).

    Have you encountered this problem? Would you move to a distributed solution (are the risk sets able to be processed individually?).

    Cheers

    How do you deal with that?

    • daynebatten May 4, 2016 / 7:18 am

      Lauren,

      Thanks for writing. Our company has a lot of customers as well, so we face this problem often. There are several different strategies you could take here. First, you could use a random sample of, say, 1000 customers for your survival analysis. You may even start with a small sample, figure out which variables are important, and then fit a model on a larger sample using only those few variables you identified with the small one. Second, you might consider doing different, independent, cox regressions with different variables in each one (just be a little more on the lookout for potentially spurious results). Third, Apache Spark can do accelerated failure time modeling on a cluster of computers. It’s not quite the same as cox regression, but it’s a very similar concept, and being able to parallelize it across a cluster should let you use really big data. See here: https://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression

      Honestly, though, I’d just stick with sampling your customers or sampling your variables…

        • daynebatten May 5, 2016 / 8:04 am

          Interesting… I was not aware of the MADlib project. It looks awesome, though! As you might have seen from some of my other posts, I’ve implemented a few statistical algorithms in SQL just because I got tired of having to export data all the time. But this takes it a step further!

        • Lauren May 9, 2016 / 6:29 pm

          Exporting data is tedious 🙂 I only became aware of the MADlib project as a consequence of our company recently implementing the Pivotal HAWQ (SQL over hadoop) framework. Haven’t used it yet, but will be interesting to see how it goes.

  2. Jen August 23, 2016 / 7:26 pm

    Thank you for the post. I found it very helpful.
    What do you do in the scenarios of left truncated data. In your example, you have tstart as 0 for each id. But in reality, different id may have different starting time. I tried to set tstart as 3 for the 2nd id. As a result, the contacted_support will be NA.

    • daynebatten August 30, 2016 / 9:44 am

      Good question. The tmerge function has a tstart parameter in addition to a tstop parameter. In tmerge_event, you would need to specify both a start time and an end time for each customer (customer 2 would have a start of 3). Then, in the initial call to tmerge, you’d need to specify both your start and stop parameters.

      Of course, you’d also need to set time in tmerge_support for customer 2 to also have a time of 3.

      Hope that helps?

  3. Ivo Fugers April 24, 2017 / 10:30 am

    Great posts, they’re helping me understanding using survival analysis for analyzing churn.

    In the result of tmerge in your example, customer 3 now has an observation going from t=100 to t=720 where he contacted support and had an event (churn). Can I assume that, in this example, the effect of contacting support lasts for the whole period (from t=100 untill t=720)?

    How would you approach it if you want to relate contacting support to churn only if it happened 3 periods before the churn? Would it make sense to create multiple chuncks (1 chunk for contacting support, 1 chunk for the churn event, 1 possible additional chunk for the time between those events).

    • daynebatten April 24, 2017 / 2:21 pm

      Great question. The size of the chunk you mention basically indicates that we’re going to allow the effect of contacting support to persist for the entire duration of the rest of a customer’s lifetime. In reality, this may be completely unrealistic – you’re probably far more likely to cancel shortly after a support request than you are to cancel 2 years after a support request.

      To account for this you could have a variable that is, say, “contacted support in last 30 days.” Then, after somebody contacts support, they have a 30-day “chunk” where that variable is true. After that, the variable goes back to being false unless they contact support again.

      If you wanted to get really fancy, you could do some sort of variable like “days since contacting support” or log of the same.

      Hope that helps!

Leave a Reply

Your email address will not be published. Required fields are marked *