Often, complex problems can be readily solved by simple rules that provide an acceptable solution, even if they don’t necessarily get you to the optimal point. The cell suppression problem is a perfect example of this – using a methodology that would be readily apparent to any human faced with tackling the problem with pen and paper, we can create a computerized solution that can appropriately suppress data sets containing tens of thousands of records disaggregated over dozens of dimensions. This heuristic method will likely suppress more data than it really needs to, but when all is said and done, it will finish the job quickly and without completely mangling your statistics.
Heuristics are kind of like Fermi estimation. Or, more accurately, I needed an image for this post and this was the best thing I could come up with.
Image credit: XKCD
We’ll start with an explanation of the basic idea, then move on to implementing it in code.
The “cell suppression problem” is an overarching term for situations where a researcher must hide certain values in tabular reports in order to protect sensitive personal (or otherwise protected) information. For instance, suppose Wayout County, Alaska has only one resident with a PhD – we’ll call her “Jane.” Some economist comes in to do a study of the value of higher education in rural areas, and publishes a list of average salaries disaggregated by county and level of education. Whoops! The average salary for people with PhDs in Wayout County is just Jane’s salary. That researcher has just disclosed Jane’s personal information to the world, and anybody that happens to know her now knows how much money she makes. “Suppressing” or hiding the value of that cell in the report table would have saved a lot of trouble!
No, not that kind of suppression.
Over the next couple weeks, I’ll be blogging about some algorithms used to solve the cell suppression problem, and showing how to implement them in code. For now, we’re going to start with an introduction to the intricacies of the problem.
Last weekend, I decided to build a bed. I looked up some plans online, made some modifications, drew up a list of the lengths and sizes of lumber I needed, and went to the store to buy lumber. That’s when the trouble started. The Lowe’s near me sells most of the wood I needed in 6ft, 8ft, 10ft, and 12ft lengths, with different prices. And I needed a weird mix of cuts – ranging from only 10 or 11 inches up to 5 feet. How was I supposed to know which lengths to buy, or how many boards I needed?
Of course, I could have just planned out my cuts on a sheet of paper, gotten close to something that looked reasonable, and called it a day. But I figured there had to be a better way. Turns out, there is, and there’s a huge body of academic literature on the subject. The problem I was facing was simply an expanded version of the classic “cutting stock problem.” It’s a basic integer linear programming problem that can be solved pretty easily by commercial optimization software. So, I decided to try out some optimizations!
Naive Bayes is an extraordinarily diverse algorithm for categorizing things… it can separate fraudulent from non-fraudulent credit card purchases, spam from legitimate email, or dinosaurs from fictional monsters. It’s most frequent application is in “bag of words” models… statistical models that classify a block of text based on the words that appear within it.
This is not a dinosaur…
In this post, we’ll explore the basics of Naive Bayes classification, and we’ll illustrate using a relatively common problem – assigning genders to a group of people when all you have is their names. And, just for kicks and giggles, we’re going to do it in SQL. Because nobody trains models in SQL.
Looking for a free geocoding solution and frustrated by the lack of options?
Though it’s poorly publicized (and documented), the US Census Bureau maintains a RESTful API for real-time and batch geocoding that’s free, fast, and accurate. It doesn’t even require an API key. Essentially, it’s everything you’d get out of a roll-your-own PostGIS Census/TIGER solution, without the hassle of having to set it up.
The only limitation is that the API mainly provides geocoding for residential addresses. Because the Census Bureau’s primary task is, well, conducting the national census, they’re a lot more interested in where people live than where they work. So, if you need to map business addresses, you may have to go elsewhere. But if you’re mapping customers (or students, or volunteers, or mailing list recipients, or whatever), the service is pretty on-point. In my work, I’ve seen about a 90% match rate for customer addresses when submitted to Census. Not too shabby.
Imagine for a moment that you’ve pulled together the mother of all churn data sets. You’ve got customer lifetime data, demographic data, and usage information. You know how many support tickets your customers have submitted, what those tickets were about, and whether they were happy with the customer service they received. You know what some of your customers had for breakfast this morning. OK, maybe not the breakfast thing. But it’s a lot of data.
Excited about your work, you pop all of this into a cox regression model, and the proportional hazards test blows up. Majorly. You take most of the variables out of the model, and you’re left with some analysis that doesn’t violate any key assumptions, but that also doesn’t tell you much of anything. What do you do?
You, sir, are stuck! Or are you?
One of the easiest ways to tackle these challenges is to create “pseudo-observations” from your survival data. These pseudo-observations can be plugged into regular statistical models that don’t have a proportional hazards assumption. It’s a great way out of a tight spot.
What if you could spend $50 per customer to reduce churn in your business by 1 percentage point. Would you do it? Would it make financial sense? Or would you just be burning money?
Sometimes, taking action to reduce customer churn costs money. In those instances, it can be helpful to know how much revenue churn is costing you… and how much of it you could recapture. Lucky for us, there’s a stat for that! It’s called “Restricted Mean Survival Time,” and it allows us to easily quantify the monetary impact of changes in customer churn. Let’s think about putting it to use!
Last week, we discussed using Kaplan-Meier estimators, survival curves, and the log-rank test to start analyzing customer churn data. We plotted survival curves for a customer base, then bifurcated them by gender, and confirmed that the difference between the gender curves was statistically significant. Of course, these methods can only get us so far… What if you want to use multiple variables to predict churn? Will you create 5, 12, 80 survival curves and try to spot differences between them? Or, what if you want to predict churn based on a continuous variable like age? Surely you wouldn’t want to create a curve for each individual age. That’s preposterous!
Don’t do this. This isn’t useful analysis. This is an etch-a-sketch gone horribly wrong.
Luckily, statisticians (once again, primarily in the medical and engineering fields) are way ahead of us here. A technique called cox regression lets us do everything we just mentioned in a statistically accurate and user-friendly fashion. In fact, because the technique is powerful, rigorous, and easy to interpret, cox regression has largely become the “gold standard” for statistical survival analysis. Sound cool? Let’s get started.
If your company operates on any type of Software as a Service or subscription model, you understand the importance of customer churn to your bottom line. When a customer leaves, you lose not only a recurring source of revenue, but also the marketing dollars you paid out to bring them in. As such, small changes in customer churn can easily bankrupt a profitable business, or turn a slow-mover into a powerhouse.
If you’re ready to get a handle on customer churn in your business, you’re ready to start doing some survival analysis. These statistical methods, which have been applied for decades in medicine and engineering, come in handy any time you’re interested in understanding how long something (customers, patients, car parts) survives and what actions can help it survive longer.
And the best part? The methods involved are mathematically simple, easy to understand and interpret, and widely available in free tools. You don’t need a PhD in stats to do this!
This won’t be you… I promise!