Analyzing Customer Churn – Using “tmerge” to Build Data Sets

If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.

Thankfully, R has a solution for this that makes life significantly more simple. It's the "tmerge" function, and it's in the survival package. Today, we'll walk through how to use it.

The data

If you'd like to follow along with this post, you can download csv files of the data I'm using. Essentially, there's three small sample data sets involving customers of a fake company that I'll use for illustration purposes... The first is a base customer data set on 3 customers. It includes their customer IDs and their gender, like this...

The second set of customer data contains lifetime data... how long did each customer survive before churning or being censored? This is simply a list of customer IDs, with a time-to-event variable, and a churn indicator.

Now, you'll notice that in each of these files, the customers have just one record covering their whole lifetime... so far, so good. A merge would be easy.

Suppose, however, we're interested in knowing if customers that have contacted support are more likely to churn. We have a more complicated data set that shows, over the duration of a customer's life, whether or not they've contacted support. This data looks something like this:

You'll see that customer 3 didn't contact support until day 100 of his life, so he has 2 different records... one for the first 100 days, then another to cover the rest of his life. You'll also notice that there's no start/stop variables for time. We only need a "start" variable, and tmerge will figure out the stop time by looking at the data.

OK, so let's get this data merged!

Merging it!

First things first. Let's load the survival package and the CSV files.

Our first call to the tmerge function will merge our gender and survival data, creating an "event" variable in the process. Tmerge will automatically create all of the time-related variables it needs to keep merging other things in.

We do this by specifying our two data sets to be merged, specifying the customer id, and telling tmerge that the "time" variable indicates the stop time for the observations (since it's the maximum time of the customer life). We'll also be creating a tmerge event variable as a function of time and our event binary. Like so:

Now the real magic happens! We call tmerge, but this time we pass it both the base data set and our support data set. We tell it the id variable again (present in both data sets), and we specify the formula for creating our new time-dependent covariate. We'll create a covariate called "contacted_support," which is simply an extrapolation of the time variable and the contacted_support variable from the original support data frame.

Boom! That's it! Here's our results:

You'll notice that the resulting data frame includes information from all three of our original data sets, and tmerge figured out that it needed to divide customer 3's life into 2 chunks to account for the changing values of the support variable. It also knew that the event variable only needed to apply to the last chunk of customer 3's life. Fantastic. If you wanted to merge additional data in, you could do so without any trouble... just specify another tmerge and it will keep adding it right on!

Odds and ends

Of course, if you look at the tmerge docs, you'll notice that the "tdc" and "event" parameters aren't the only ways to specify new variables in tmerge. In fact, as you may have guessed, "tdc" stands for "time-dependent covariate" and it's used for creating, well, time-dependent covariates. However, you can also use "cumtdc" to create cumulative time-dependent covariates (e.g., total number of support requests created over a customer's life). You can use "event" similarly to merge on various types of additional events (e.g., if you wanted to merge on service plan changes so you could build a competing-hazards model for churn and plan changes) or "cumevent" for a cumulative event counter. Additional info can be found in the docs for the survival package and in this vignette on time-dependent covariates.


If you're working on a large churn analysis project in R and need to use a lot of time-dependent covariates, this is the easiest way I know to build your data set. If you've got any thoughts or questions, be sure to follow up!

19 Responses

  1. Lauren May 4, 2016 / 1:33 am

    I’ll begin by saying I’ve found your posts very helpful 🙂

    I am tempted to use a survival-based analysis for our company’s churn modelling. However, from my experience, the data.frame gets massive when you have a large number of customers (100,000) and you’re also considering a large number of time-varying covariates (that can’t be extrapolated like age).

    Have you encountered this problem? Would you move to a distributed solution (are the risk sets able to be processed individually?).


    How do you deal with that?

    • daynebatten May 4, 2016 / 7:18 am


      Thanks for writing. Our company has a lot of customers as well, so we face this problem often. There are several different strategies you could take here. First, you could use a random sample of, say, 1000 customers for your survival analysis. You may even start with a small sample, figure out which variables are important, and then fit a model on a larger sample using only those few variables you identified with the small one. Second, you might consider doing different, independent, cox regressions with different variables in each one (just be a little more on the lookout for potentially spurious results). Third, Apache Spark can do accelerated failure time modeling on a cluster of computers. It’s not quite the same as cox regression, but it’s a very similar concept, and being able to parallelize it across a cluster should let you use really big data. See here:

      Honestly, though, I’d just stick with sampling your customers or sampling your variables…

        • daynebatten May 5, 2016 / 8:04 am

          Interesting… I was not aware of the MADlib project. It looks awesome, though! As you might have seen from some of my other posts, I’ve implemented a few statistical algorithms in SQL just because I got tired of having to export data all the time. But this takes it a step further!

        • Lauren May 9, 2016 / 6:29 pm

          Exporting data is tedious 🙂 I only became aware of the MADlib project as a consequence of our company recently implementing the Pivotal HAWQ (SQL over hadoop) framework. Haven’t used it yet, but will be interesting to see how it goes.

  2. Jen August 23, 2016 / 7:26 pm

    Thank you for the post. I found it very helpful.
    What do you do in the scenarios of left truncated data. In your example, you have tstart as 0 for each id. But in reality, different id may have different starting time. I tried to set tstart as 3 for the 2nd id. As a result, the contacted_support will be NA.

    • daynebatten August 30, 2016 / 9:44 am

      Good question. The tmerge function has a tstart parameter in addition to a tstop parameter. In tmerge_event, you would need to specify both a start time and an end time for each customer (customer 2 would have a start of 3). Then, in the initial call to tmerge, you’d need to specify both your start and stop parameters.

      Of course, you’d also need to set time in tmerge_support for customer 2 to also have a time of 3.

      Hope that helps?

  3. Ivo Fugers April 24, 2017 / 10:30 am

    Great posts, they’re helping me understanding using survival analysis for analyzing churn.

    In the result of tmerge in your example, customer 3 now has an observation going from t=100 to t=720 where he contacted support and had an event (churn). Can I assume that, in this example, the effect of contacting support lasts for the whole period (from t=100 untill t=720)?

    How would you approach it if you want to relate contacting support to churn only if it happened 3 periods before the churn? Would it make sense to create multiple chuncks (1 chunk for contacting support, 1 chunk for the churn event, 1 possible additional chunk for the time between those events).

    • daynebatten April 24, 2017 / 2:21 pm

      Great question. The size of the chunk you mention basically indicates that we’re going to allow the effect of contacting support to persist for the entire duration of the rest of a customer’s lifetime. In reality, this may be completely unrealistic – you’re probably far more likely to cancel shortly after a support request than you are to cancel 2 years after a support request.

      To account for this you could have a variable that is, say, “contacted support in last 30 days.” Then, after somebody contacts support, they have a 30-day “chunk” where that variable is true. After that, the variable goes back to being false unless they contact support again.

      If you wanted to get really fancy, you could do some sort of variable like “days since contacting support” or log of the same.

      Hope that helps!

  4. Veerle October 11, 2017 / 10:32 pm

    Thank you for this post! This made the tmerge much clearer for me.
    I still have a question though, that I can’t seem to find anywhere on the web: what if your time-dependent variable is a categorical one? Or in my case even ordinal. How can you implement this in the tmerge function?
    Thank you!

    • daynebatten October 12, 2017 / 9:04 am

      I’m not sure why that would matter? In fact, “contacted support,” which is the variable I use as the example in this post is theoretically a categorical variable since it’s just a dummy. But I would expect tmerge to handle even categorical variables with strings the same as anything else.

      Have you run into a specific problem?

  5. Veerle October 12, 2017 / 8:04 pm

    Thank you for your reply. I haven’t ran into a specific problem yet, I just struggle to understand how R would implement this, since than ‘contacted support’ could get the value 0,1,2,3, and in my case its ordinal, so never, rarely, sometimes, often, does R understand that if an interval ends whit a 2 that this should get more weight compared to a 1 if I run a Cox model with this variable as time-varying?
    Thanks again, and sorry if this might not be the right place to ask this.

    • daynebatten October 13, 2017 / 7:42 am

      Ah, so this isn’t so much a question about tmerge as it is about using a categorical/ordinal variable in a cox model in R.

      Using a categorical or ordinal variable in a cox model should really be the same as using it in a standard OLS model (i.e., ‘lm’ in R). I’d recommend just using the variable as a factor (as.factor(variable)) in the model. The model will fit the best fit for the effect of each level of the variable, so you don’t need to explicitly tell it that one level is ordinally higher than another – if there’s a relationship, the model will find it.

  6. Naomi January 29, 2018 / 8:23 pm

    Hi! I know this is an old post but I’m looking for some help using tmerge. I wonder if I can pick your brain?
    Thank you!

    • daynebatten January 30, 2018 / 11:19 am

      I’m not necessarily an expert (and it’s been awhile) but I’ll be happy to help if I can. What’s up?

  7. habsi October 28, 2018 / 6:03 am

    very helpful thanks ,
    what about the time dependent variables that are only dependent on time but not on every customer ?

    • daynebatten April 19, 2019 / 8:03 am

      Good question. If I’m following the question, you’d still have to turn a variable that’s not customer-specific into a customer-specific time dependent variable, because each customer has a different time scale. So, for example, if you had an outage in March 2019, that hits at a different point in the lifetime of a customer that activated in January 2019 vs. one that activated in January 2018. So, the outage isn’t customer-specific, but its interaction with each customer’s lifetime is customer-specific. Does that help?

  8. John Kolassa February 25, 2020 / 8:31 am

    Thanks. This was very helpful. I now understand what this function does. I read a series of other documents, and couldn’t make head or tail of it until I hit your post.

    Is there a slick way to use tmerge to start from a single data frame, with your customer and event data frames merged by ID, and an additional column with support time 100 for subject 3, and NA in this column for other participants? I’m trying to build an example for a class I’m teaching based on the Emerson leukemia data set in Case Studies in Biometry,
    at . There’s a column column for last followup and an indicator for death or censoring, and another column for bone marrow tranplant time, and an indicator for whether the transplant occurred. I want to estimate the time-dependent effect of transplant. If I follow your model, I should first build a data frame analogous to your support, with the time 0 for everyone in the data set, and indicator zero, an then rbind it to another data set for those who had the transplant, with transplant time and indicator 1. Is this the best way to do it? Any suggestions would be appreciated.

Leave a Reply

Your email address will not be published. Required fields are marked *