If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.
If you'd like to follow along with this post, you can download csv files of the data I'm using. Essentially, there's three small sample data sets involving customers of a fake company that I'll use for illustration purposes... The first is a base customer data set on 3 customers. It includes their customer IDs and their gender, like this...
The second set of customer data contains lifetime data... how long did each customer survive before churning or being censored? This is simply a list of customer IDs, with a time-to-event variable, and a churn indicator.
Now, you'll notice that in each of these files, the customers have just one record covering their whole lifetime... so far, so good. A merge would be easy.
Suppose, however, we're interested in knowing if customers that have contacted support are more likely to churn. We have a more complicated data set that shows, over the duration of a customer's life, whether or not they've contacted support. This data looks something like this:
You'll see that customer 3 didn't contact support until day 100 of his life, so he has 2 different records... one for the first 100 days, then another to cover the rest of his life. You'll also notice that there's no start/stop variables for time. We only need a "start" variable, and tmerge will figure out the stop time by looking at the data.
OK, so let's get this data merged!
First things first. Let's load the survival package and the CSV files.
Our first call to the tmerge function will merge our gender and survival data, creating an "event" variable in the process. Tmerge will automatically create all of the time-related variables it needs to keep merging other things in.
We do this by specifying our two data sets to be merged, specifying the customer id, and telling tmerge that the "time" variable indicates the stop time for the observations (since it's the maximum time of the customer life). We'll also be creating a tmerge event variable as a function of time and our event binary. Like so:
Now the real magic happens! We call tmerge, but this time we pass it both the base data set and our support data set. We tell it the id variable again (present in both data sets), and we specify the formula for creating our new time-dependent covariate. We'll create a covariate called "contacted_support," which is simply an extrapolation of the time variable and the contacted_support variable from the original support data frame.
Boom! That's it! Here's our results:
You'll notice that the resulting data frame includes information from all three of our original data sets, and tmerge figured out that it needed to divide customer 3's life into 2 chunks to account for the changing values of the support variable. It also knew that the event variable only needed to apply to the last chunk of customer 3's life. Fantastic. If you wanted to merge additional data in, you could do so without any trouble... just specify another tmerge and it will keep adding it right on!
Odds and ends
Of course, if you look at the tmerge docs, you'll notice that the "tdc" and "event" parameters aren't the only ways to specify new variables in tmerge. In fact, as you may have guessed, "tdc" stands for "time-dependent covariate" and it's used for creating, well, time-dependent covariates. However, you can also use "cumtdc" to create cumulative time-dependent covariates (e.g., total number of support requests created over a customer's life). You can use "event" similarly to merge on various types of additional events (e.g., if you wanted to merge on service plan changes so you could build a competing-hazards model for churn and plan changes) or "cumevent" for a cumulative event counter. Additional info can be found in the docs for the survival package and in this vignette on time-dependent covariates.
If you're working on a large churn analysis project in R and need to use a lot of time-dependent covariates, this is the easiest way I know to build your data set. If you've got any thoughts or questions, be sure to follow up!