Counting is hard. You might be surprised to hear me say that, but it's true. As a data scientist, I've done it all - everything from simple regression analysis all the way to coding Hadoop MapReduce jobs that process hundreds of billions of data points each month. And, with all that experience, I've found that counting often involves far more time and effort.
He makes it look so easy!
So why do I think counting is so hard? Let me give you three reasons.
Since my wife and I have a baby on the way, we've spent a lot of time thinking about names lately. We've poured through dozens of lists of thousands of names, we've used sites and other tools, we've researched histories - everything. And we've found that most of the tools weren't terribly helpful.
WHAT IS YOUR NAME?
After playing around with all of those baby naming tools, I recently took a stab myself and built a website that lets you find names that sound like ones you already like. So, put in "Aubrey" and you'll get suggestions like "Aubree," "Avery," and "Audrey." The algorithms aren't perfect yet, the code is currently a massive pile of hacks to support a proof-of-concept, and I have no idea if parents-to-be will find the site useful... but it's been an interesting project to try my hand at.
For today's post, I'll simply be highlighting some of the algorithms I used to find words that sound similar, and how to implement them in SQL. (I won't get into exactly how I put them all together. Can't give away the secret sauce... at least not yet.)
Many of the challenges of using Hadoop with small files are well-documented. But there's one thing I haven't seen a lot of discussion about: optimizing the maximum split size for the CombineFileInputFormat and its derivatives (e.g., CombineTextInputFormat). But this is actually a pretty big issue - improper configuration can cause out-of-memory errors or degraded performance.
An Elephant never forgets. But, apparently, it can run out of memory.
Thankfully, with a little bit of knowledge about your input data and your cluster, you can determine a value for this setting that will keep your jobs running along happily.
Although Amazon Redshift is a fantastic data warehousing product, it lacks several of the features available in other database systems. One such feature is materialized views, which are just what they sound like... views that have been pre-calculated and physically stored on disk for faster querying, and which are updated on request.
No, not that kind of View.
Materialized views are very useful any time you want to have the results of complex queries readily and quickly available, and you don't mind if the data is a few hours stale. For instance, suppose you have an executive dashboard that displays your monthly gross margin for your highly successful app, which has millions of users. The query that does those calculations runs for 20 minutes (way longer than an exec is going to want to wait), but it only needs to be updated once a month. Materialized views are the perfect solution. Calculate once, cache the data, and reference the cache on-demand. Refresh when needed.
The "Redshift View Materializer", now available on GitHub, is a simple Python script that creates tables containing the results of arbitrary SQL queries on-demand. Simply set the script to run as a cron-job whenever you want your tables re-created, and you'll end up with a reasonably close approximation of materialized views. The script is in the public domain, but I'd love if you'd contribute if you make any improvements, bug fixes, or additions!
As I've been working with Hadoop a lot in the last several months, I've come to realize that it doesn't deal gracefully with corrupt files (e.g., mal-formed gzip files). I would throw a cluster at a couple hundred thousand files (of which one or two were bad) and the job would die two hours into execution, throwing EOFException errors all over the place. If I was only processing one file, I suppose that's a reasonably acceptable outcome. But when 99.9% of your files are fine, and the corrupt ones aren't recoverable anyway, there's no sense in blowing up the whole job just because a trivial portion of the data was bad.
Every survival analysis method I've talked about so far in this series has had one thing in common: we've only looked at one event in a customer lifetime (churn). In many cases, that's a perfectly fine way to go about things... we want our customers to stick with us, so churn is the event of interest. So why would we ever need to think about competing risks?
You know, competing risks. Will you die by tornado, or by shark?
There's actually a critical assumption undergirding most survival analysis methods for right-censored data - that censored individuals have the same likelihood of experiencing the event of interest as individuals that never got censored. If this assumption ever gets violated, things like Kaplan-Meier estimators can become wildly inaccurate. (If you need a refresher on Kaplan-Meier curves and other concepts, take a look at my earlier post on basic survival analysis.)
Kia recommends that I get the oil in my 2009 Rio changed every 7,500 miles. But, anecdotally, it seemed that I always got better gas mileage right after an oil change than I did right before I was due for another one. So, I got to wondering - if an oil change costs $20, but saves me a few MPGs, is it cheaper overall to change my oil sooner than 7,500 miles? If so, where's the optimal point?
So, I got all ready to do some fancy math, and began tracking my gas mileage between two oil changes... And this is what I found:
This is why anecdotal evidence is unreliable.
At least based on this one oil change cycle, my gas mileage theory was 100% imagination. I should stick to oil changes at 7,500 mile intervals.
Of course, this is really a sample of only a single oil change cycle. It started in summer and went into winter, so winter fuel blends came out (which generally have slightly higher MPGs), and I presumably used the A/C less (though my defrost requires my A/C to be on, and I use the defrost a lot). There may be other complicating factors as well. So, I plan to track this all over again and re-visit after my next oil change cycle... because gut feelings die hard.
Last week, I looked into options for increasing the performance of an Amazon Redshift cluster that was currently using 10 large dense compute nodes (dc1.large). While investigating, I noticed that a cluster of 32 dc1.large nodes (the maximum for that node type) had the same number of CPUs, the same amount of storage, and a comparable amount of RAM to a cluster of 2 dc1.8xlarge nodes (the minimum for that node type), while offering significant cost savings for anything less than a 3-year term. This got me wondering... if you want a cluster that's around that size, which is a better bet for performance?
We'll get the answer out of the way early for the tl;dr crowd.
The results of testing out three diverse queries on various-sized Redshift clusters.
The cluster of 32 dc1.large nodes provided substantially (13%, on average) faster performance for all but the shortest query in my test batch. Given the superior performance and lower cost, it was the obvious choice for a cluster of this size. Whenever more performance is called for, it will make sense to switch to a cluster of 3 dc1.8xlarge nodes...
If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.
Merge, everybody, merge!
Thankfully, R has a solution for this that makes life significantly more simple. It's the "tmerge" function, and it's in the survival package. Today, we'll walk through how to use it.
There are numerous strategies for dealing with non-proportional hazards in cox regression analysis. You can stratify your data based on a problematic variable. You can chuck the cox model and create "pseudo-observations" to analyze the gains (or losses) in lifetime within a certain period associated with changes in a variable. If age is a problem (unlikely for customer churn, but it happens a lot in medical contexts), you can use age rather than time in the cohort as your time scale. The list goes on.
But this is statistics! We're supposed to be modeling things!
Statistics. Almost as cool as Sparta.
Well, as it turns out, it's actually possible to directly model how the effects of a variable change with time, so that you can not only handle the proportional hazards problem, but also get a reliable estimate of how hazard ratios for a given variable change with time. The way to do this is actually incredibly simple... we introduce an interaction term between the variable of interest and time. Let's get started!