Every survival analysis method I've talked about so far in this series has had one thing in common: we've only looked at one event in a customer lifetime (churn). In many cases, that's a perfectly fine way to go about things... we want our customers to stick with us, so churn is the event of interest. So why would we ever need to think about competing risks?
You know, competing risks. Will you die by tornado, or by shark?
There's actually a critical assumption undergirding most survival analysis methods for right-censored data - that censored individuals have the same likelihood of experiencing the event of interest as individuals that never got censored. If this assumption ever gets violated, things like Kaplan-Meier estimators can become wildly inaccurate. (If you need a refresher on Kaplan-Meier curves and other concepts, take a look at my earlier post on basic survival analysis.)
There are numerous strategies for dealing with non-proportional hazards in cox regression analysis. You can stratify your data based on a problematic variable. You can chuck the cox model and create "pseudo-observations" to analyze the gains (or losses) in lifetime within a certain period associated with changes in a variable. If age is a problem (unlikely for customer churn, but it happens a lot in medical contexts), you can use age rather than time in the cohort as your time scale. The list goes on.
But this is statistics! We're supposed to be modeling things!
Statistics. Almost as cool as Sparta.
Well, as it turns out, it's actually possible to directly model how the effects of a variable change with time, so that you can not only handle the proportional hazards problem, but also get a reliable estimate of how hazard ratios for a given variable change with time. The way to do this is actually incredibly simple... we introduce an interaction term between the variable of interest and time. Let's get started!
My previous series of guides on survival analysis and customer churn has become by far the most popular content on this blog, so I'm coming back around to introduce some more advanced techniques...
When you're using cox regression to model customer churn, you're often interested in the effects of variables that change throughout a customer's lifetime. For instance, you might be interested in knowing how many times that customer has contacted support, how many times they've logged in during the last 30 days, or what web browser(s) they use. If you have, say, 3 years of historical customer data and you set up a cox regression on that data using covariate values that are applicable to customers right now, you'll essentially be regressing customer's churn hazards from months or years ago on their current characteristics. Your model will be allowing the future to predict the past. Not terribly defensible.
In the classic double-slit experiment, past events are seemingly affected by current conditions. But unless you're a quantum physicist or Marty McFly, you're probably not going to see causality working this way.
In this post, we'll walk through how to set up a cox regression using "time-dependent covariates," which will allow us to model historical hazard rates on variables whose values were applicable at the time.
My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.
Based on all available evidence, the survival curve for aliens singing disco music hits 0 at about 43 seconds (N = 1).
As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!
In a past post on analyzing churn in the subscription or Software as a Service business, I talked about two different ways to quantify the dollar cost of churn. You could use 1 / churn as an estimation of mean customer lifetime (though this simple method makes a lot of assumptions). Or, you could use “pseudo-observations” to calculate the dollar value of certain groups of customers during a particular time period (which doesn’t let you quantify the full lifetime value of a customer).
But what if there was another way? What if we took our Kaplan-Meier best estimate of our churn curve, fit a linear model to that model, and then projected it out?
A model within a model, if you will. Churnception.
Well, as it turns out, we’d get a reasonable estimation of our lifetime churn curve, which would let us estimate average customer lifetime, and customer lifetime value. Let’s get started.
One of the best ways to learn how a statistical model really works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.
In the book Data Smart, John Foreman introduces a bunch of awesome methodologies by walking you through how to build them in Excel…
Of course, doing regression in SQL also has (some) practical use as well! For example, suppose you wanted to identify which city in a database of temperature records had the biggest warming trend in the last month. This method would send you on your way without having to bring your data into an external tool. Nifty!
Have you ever taken a look at the “probability of outperforming” metric in Google Analytics’ Content Experiments and wondered how it was calculated? Have you ever scratched your head because the numbers didn’t make sense to you? I certainly have. It’s hard to see experiment results like the ones depicted below and not wonder what’s going on underneath the hood.
Real data from a GA content experiment, showing an under-performing variant with a >50% chance of outperforming the original. It’s like trash-talking when you’re down at the half.
In this post, we’ll highlight how Google’s Content Experiments work, why it’s a really smart idea, and why you might still want to do a little bit of the heavy lifting yourself…
Naive Bayes is an extraordinarily diverse algorithm for categorizing things… it can separate fraudulent from non-fraudulent credit card purchases, spam from legitimate email, or dinosaurs from fictional monsters. It’s most frequent application is in “bag of words” models… statistical models that classify a block of text based on the words that appear within it.
This is not a dinosaur…
In this post, we’ll explore the basics of Naive Bayes classification, and we’ll illustrate using a relatively common problem – assigning genders to a group of people when all you have is their names. And, just for kicks and giggles, we’re going to do it in SQL. Because nobody trains models in SQL.
Imagine for a moment that you’ve pulled together the mother of all churn data sets. You’ve got customer lifetime data, demographic data, and usage information. You know how many support tickets your customers have submitted, what those tickets were about, and whether they were happy with the customer service they received. You know what some of your customers had for breakfast this morning. OK, maybe not the breakfast thing. But it’s a lot of data.
Excited about your work, you pop all of this into a cox regression model, and the proportional hazards test blows up. Majorly. You take most of the variables out of the model, and you’re left with some analysis that doesn’t violate any key assumptions, but that also doesn’t tell you much of anything. What do you do?
You, sir, are stuck! Or are you?
One of the easiest ways to tackle these challenges is to create “pseudo-observations” from your survival data. These pseudo-observations can be plugged into regular statistical models that don’t have a proportional hazards assumption. It’s a great way out of a tight spot.
Sometimes, taking action to reduce customer churn costs money. In those instances, it can be helpful to know how much revenue churn is costing you… and how much of it you could recapture. Lucky for us, there’s a stat for that! It’s called “Restricted Mean Survival Time,” and it allows us to easily quantify the monetary impact of changes in customer churn. Let’s think about putting it to use!