Naive Bayes is an extraordinarily diverse algorithm for categorizing things… it can separate fraudulent from non-fraudulent credit card purchases, spam from legitimate email, or dinosaurs from fictional monsters. It’s most frequent application is in “bag of words” models… statistical models that classify a block of text based on the words that appear within it.
This is not a dinosaur…
In this post, we’ll explore the basics of Naive Bayes classification, and we’ll illustrate using a relatively common problem – assigning genders to a group of people when all you have is their names. And, just for kicks and giggles, we’re going to do it in SQL. Because nobody trains models in SQL.
Looking for a free geocoding solution and frustrated by the lack of options?
Though it’s poorly publicized (and documented), the US Census Bureau maintains a RESTful API for real-time and batch geocoding that’s free, fast, and accurate. It doesn’t even require an API key. Essentially, it’s everything you’d get out of a roll-your-own PostGIS Census/TIGER solution, without the hassle of having to set it up.
The only limitation is that the API mainly provides geocoding for residential addresses. Because the Census Bureau’s primary task is, well, conducting the national census, they’re a lot more interested in where people live than where they work. So, if you need to map business addresses, you may have to go elsewhere. But if you’re mapping customers (or students, or volunteers, or mailing list recipients, or whatever), the service is pretty on-point. In my work, I’ve seen about a 90% match rate for customer addresses when submitted to Census. Not too shabby.
Imagine for a moment that you’ve pulled together the mother of all churn data sets. You’ve got customer lifetime data, demographic data, and usage information. You know how many support tickets your customers have submitted, what those tickets were about, and whether they were happy with the customer service they received. You know what some of your customers had for breakfast this morning. OK, maybe not the breakfast thing. But it’s a lot of data.
Excited about your work, you pop all of this into a cox regression model, and the proportional hazards test blows up. Majorly. You take most of the variables out of the model, and you’re left with some analysis that doesn’t violate any key assumptions, but that also doesn’t tell you much of anything. What do you do?
You, sir, are stuck! Or are you?
One of the easiest ways to tackle these challenges is to create “pseudo-observations” from your survival data. These pseudo-observations can be plugged into regular statistical models that don’t have a proportional hazards assumption. It’s a great way out of a tight spot.
What if you could spend $50 per customer to reduce churn in your business by 1 percentage point. Would you do it? Would it make financial sense? Or would you just be burning money?
Sometimes, taking action to reduce customer churn costs money. In those instances, it can be helpful to know how much revenue churn is costing you… and how much of it you could recapture. Lucky for us, there’s a stat for that! It’s called “Restricted Mean Survival Time,” and it allows us to easily quantify the monetary impact of changes in customer churn. Let’s think about putting it to use!