I just posted a simple implementation of WTTE-RNNs in Keras on GitHub: Keras Weibull Time-to-event Recurrent Neural Networks. I'll let you read up on the details in the linked information, but suffice it to say that this is a specific type of neural net that handles time-to-event prediction in a super intuitive way. If you're thinking of building a model to predict (rather than understand) churn, I'd definitely consider giving this a shot. And with Keras, implementing the model is pretty darn easy.
As proof of the model's effectiveness, here's my demo model (with absolutely no optimization) predicting the remaining useful life of jet engines. It's not perfect by any means, but it's definitely giving a pass to engines that are in the clear, and flagging ones that are more likely to fail (plus a few false positives). I have no idea how much better it could do with some tweaking:
Also, if anybody's curious as to why I've been in a bit of a post desert lately, my wife and I recently had a baby and I haven't been giving as much thought to the blog. However, I have some ideas brewing!
Since my wife and I have a baby on the way, we've spent a lot of time thinking about names lately. We've poured through dozens of lists of thousands of names, we've used sites and other tools, we've researched histories - everything. And we've found that most of the tools weren't terribly helpful.
WHAT IS YOUR NAME?
After playing around with all of those baby naming tools, I recently took a stab myself and built a website that lets you find names that sound like ones you already like. So, put in "Aubrey" and you'll get suggestions like "Aubree," "Avery," and "Audrey." The algorithms aren't perfect yet, the code is currently a massive pile of hacks to support a proof-of-concept, and I have no idea if parents-to-be will find the site useful... but it's been an interesting project to try my hand at.
For today's post, I'll simply be highlighting some of the algorithms I used to find words that sound similar, and how to implement them in SQL. (I won't get into exactly how I put them all together. Can't give away the secret sauce... at least not yet.)
My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.
Based on all available evidence, the survival curve for aliens singing disco music hits 0 at about 43 seconds (N = 1).
As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!
Still migrating old posts due to travel. Next post will be fresh content!
In a previous post, we learned that if you want to maximize your score on any individual turn of a game of "Pass the Pigs," you should always roll when there's less than 22.5 points in your hand, and hold when there's more than 22.5 points in your hand. (If you've never heard of "Pass the Pigs," the rules are explained in the prior post.)
I wouldn't mind being passed this pig for a few minutes... that's some serious cuteness right there.
However, we also concluded that that's not an effective strategy for winning the game as a whole. If you have a score of 0 and your opponent has a score of 99, for example, it would be really silly to stop rolling at 23 points just because the "22.5 rule" says to. So what's a person to do? How do you play effectively? Today, we'll generate a strategy that can help you make an optimal move in any situation. (Hint: you'll need to do a lot of math.)
Since I'm out of town for a bit, I'm migrating over a few relevant posts from an old blog of mine that I'm planning to shut down. Enjoy!
Pass the Pigs is a simple yet addictive dice game that uses cute little plastic pigs as dice. If you've never played, the rules are very straightforward. On each turn, a player rolls two pigs. The pigs will land in different positions, which will determine how many points the player has in their hand for that turn. The player may then decide to "pass the pigs" to the next player. If they do this, all of the points in their hand will be added to their official score. They may also decide to roll the pigs again to try to add more points to their hand before passing the pigs. But they must be careful! If the pigs both land on their sides with one showing a dot and the other showing a blank side, they "pig out" and lose all of the points they've accumulated in their hand! It's risky business. The first player to accumulate a score of 100 or higher wins.
Credit: Larry Moore
Like anything with dice (even pig-shaped dice), Pass the Pigs is a game of chance. That means, with a little effort, we should be able to figure out the probabilities of certain things happening in the game, and develop some optimal strategies. So how do you win at Pass the Pigs? Read on to find out.
Pretty much any language commonly used for data analysis (R, SAS, Python) can calculate the distance between two geographic coordinates with relative ease. But always having to pull your data out of your data warehouse any time you want to do some basic geographic analysis can be frustrating - sometimes it's nice to keep simple queries all in one system. If you've got a spatially enabled version of Postgres or SQL Server, you're in business. But if not, you'll need to roll your own SQL solution.
Because the earth is a sphere, the quickest route between two points is a "Great Circle," which may appear curved on flat maps...
In today's post, we're going to write our own code in vanilla SQL to calculate the distance between two latitude and longitude coordinates.
In a past post on analyzing churn in the subscription or Software as a Service business, I talked about two different ways to quantify the dollar cost of churn. You could use 1 / churn as an estimation of mean customer lifetime (though this simple method makes a lot of assumptions). Or, you could use “pseudo-observations” to calculate the dollar value of certain groups of customers during a particular time period (which doesn’t let you quantify the full lifetime value of a customer).
But what if there was another way? What if we took our Kaplan-Meier best estimate of our churn curve, fit a linear model to that model, and then projected it out?
A model within a model, if you will. Churnception.
Well, as it turns out, we’d get a reasonable estimation of our lifetime churn curve, which would let us estimate average customer lifetime, and customer lifetime value. Let’s get started.
One of the best ways to learn how a statistical model really works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.
In the book Data Smart, John Foreman introduces a bunch of awesome methodologies by walking you through how to build them in Excel…
Of course, doing regression in SQL also has (some) practical use as well! For example, suppose you wanted to identify which city in a database of temperature records had the biggest warming trend in the last month. This method would send you on your way without having to bring your data into an external tool. Nifty!
For example, how much better is the optimal approach than the heuristic? Is there ever a reason not to prefer the optimal approach? And what are some other improvements and techniques that a researcher using these macros might want to know about? I’ll spend this post reflecting on our two solutions and covering a few of these bases.
In last week’s post, we constructed a set of constraints to bound a binary integer program for solving the small cell suppression problem. These constraints allow us to ensure that every group of data points which could be aggregated across in a tabular report contains either 0 or 2+ suppressed cells.
At some point before age five, every kid masters the art of satisfying constraints with solutions that are hilariously non-optimal.
Obviously, there’s plenty of ways we could satisfy our constraints – suppressing everything, for example. But we want choose the optimal pattern of secondarily suppressed cells to minimize data loss. So, we’re going to tackle the problem using binary integer programming in PROC OPTMODEL. Strap yourself in, folks – it’s going to be an exciting ride.