Many of the challenges of using Hadoop with small files are well-documented. But there's one thing I haven't seen a lot of discussion about: optimizing the maximum split size for the CombineFileInputFormat and its derivatives (e.g., CombineTextInputFormat). But this is actually a pretty big issue - improper configuration can cause out-of-memory errors or degraded performance.
An Elephant never forgets. But, apparently, it can run out of memory.
Thankfully, with a little bit of knowledge about your input data and your cluster, you can determine a value for this setting that will keep your jobs running along happily.
Although Amazon Redshift is a fantastic data warehousing product, it lacks several of the features available in other database systems. One such feature is materialized views, which are just what they sound like... views that have been pre-calculated and physically stored on disk for faster querying, and which are updated on request.
No, not that kind of View.
Materialized views are very useful any time you want to have the results of complex queries readily and quickly available, and you don't mind if the data is a few hours stale. For instance, suppose you have an executive dashboard that displays your monthly gross margin for your highly successful app, which has millions of users. The query that does those calculations runs for 20 minutes (way longer than an exec is going to want to wait), but it only needs to be updated once a month. Materialized views are the perfect solution. Calculate once, cache the data, and reference the cache on-demand. Refresh when needed.
The "Redshift View Materializer", now available on GitHub, is a simple Python script that creates tables containing the results of arbitrary SQL queries on-demand. Simply set the script to run as a cron-job whenever you want your tables re-created, and you'll end up with a reasonably close approximation of materialized views. The script is in the public domain, but I'd love if you'd contribute if you make any improvements, bug fixes, or additions!
As I've been working with Hadoop a lot in the last several months, I've come to realize that it doesn't deal gracefully with corrupt files (e.g., mal-formed gzip files). I would throw a cluster at a couple hundred thousand files (of which one or two were bad) and the job would die two hours into execution, throwing EOFException errors all over the place. If I was only processing one file, I suppose that's a reasonably acceptable outcome. But when 99.9% of your files are fine, and the corrupt ones aren't recoverable anyway, there's no sense in blowing up the whole job just because a trivial portion of the data was bad.
Every survival analysis method I've talked about so far in this series has had one thing in common: we've only looked at one event in a customer lifetime (churn). In many cases, that's a perfectly fine way to go about things... we want our customers to stick with us, so churn is the event of interest. So why would we ever need to think about competing risks?
You know, competing risks. Will you die by tornado, or by shark?
There's actually a critical assumption undergirding most survival analysis methods for right-censored data - that censored individuals have the same likelihood of experiencing the event of interest as individuals that never got censored. If this assumption ever gets violated, things like Kaplan-Meier estimators can become wildly inaccurate. (If you need a refresher on Kaplan-Meier curves and other concepts, take a look at my earlier post on basic survival analysis.)
Kia recommends that I get the oil in my 2009 Rio changed every 7,500 miles. But, anecdotally, it seemed that I always got better gas mileage right after an oil change than I did right before I was due for another one. So, I got to wondering - if an oil change costs $20, but saves me a few MPGs, is it cheaper overall to change my oil sooner than 7,500 miles? If so, where's the optimal point?
So, I got all ready to do some fancy math, and began tracking my gas mileage between two oil changes... And this is what I found:
This is why anecdotal evidence is unreliable.
At least based on this one oil change cycle, my gas mileage theory was 100% imagination. I should stick to oil changes at 7,500 mile intervals.
Of course, this is really a sample of only a single oil change cycle. It started in summer and went into winter, so winter fuel blends came out (which generally have slightly higher MPGs), and I presumably used the A/C less (though my defrost requires my A/C to be on, and I use the defrost a lot). There may be other complicating factors as well. So, I plan to track this all over again and re-visit after my next oil change cycle... because gut feelings die hard.
Last week, I looked into options for increasing the performance of an Amazon Redshift cluster that was currently using 10 large dense compute nodes (dc1.large). While investigating, I noticed that a cluster of 32 dc1.large nodes (the maximum for that node type) had the same number of CPUs, the same amount of storage, and a comparable amount of RAM to a cluster of 2 dc1.8xlarge nodes (the minimum for that node type), while offering significant cost savings for anything less than a 3-year term. This got me wondering... if you want a cluster that's around that size, which is a better bet for performance?
We'll get the answer out of the way early for the tl;dr crowd.
The results of testing out three diverse queries on various-sized Redshift clusters.
The cluster of 32 dc1.large nodes provided substantially (13%, on average) faster performance for all but the shortest query in my test batch. Given the superior performance and lower cost, it was the obvious choice for a cluster of this size. Whenever more performance is called for, it will make sense to switch to a cluster of 3 dc1.8xlarge nodes...
If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.
Merge, everybody, merge!
Thankfully, R has a solution for this that makes life significantly more simple. It's the "tmerge" function, and it's in the survival package. Today, we'll walk through how to use it.
There are numerous strategies for dealing with non-proportional hazards in cox regression analysis. You can stratify your data based on a problematic variable. You can chuck the cox model and create "pseudo-observations" to analyze the gains (or losses) in lifetime within a certain period associated with changes in a variable. If age is a problem (unlikely for customer churn, but it happens a lot in medical contexts), you can use age rather than time in the cohort as your time scale. The list goes on.
But this is statistics! We're supposed to be modeling things!
Statistics. Almost as cool as Sparta.
Well, as it turns out, it's actually possible to directly model how the effects of a variable change with time, so that you can not only handle the proportional hazards problem, but also get a reliable estimate of how hazard ratios for a given variable change with time. The way to do this is actually incredibly simple... we introduce an interaction term between the variable of interest and time. Let's get started!
My previous series of guides on survival analysis and customer churn has become by far the most popular content on this blog, so I'm coming back around to introduce some more advanced techniques...
When you're using cox regression to model customer churn, you're often interested in the effects of variables that change throughout a customer's lifetime. For instance, you might be interested in knowing how many times that customer has contacted support, how many times they've logged in during the last 30 days, or what web browser(s) they use. If you have, say, 3 years of historical customer data and you set up a cox regression on that data using covariate values that are applicable to customers right now, you'll essentially be regressing customer's churn hazards from months or years ago on their current characteristics. Your model will be allowing the future to predict the past. Not terribly defensible.
In the classic double-slit experiment, past events are seemingly affected by current conditions. But unless you're a quantum physicist or Marty McFly, you're probably not going to see causality working this way.
In this post, we'll walk through how to set up a cox regression using "time-dependent covariates," which will allow us to model historical hazard rates on variables whose values were applicable at the time.
I recently downloaded the entire history of the Billboard Hot 100 chart, which has been tracking the most popular 100 US music tracks since 1958. So, I decided to see what I could learn!
Primary colors. Except not quite.
It only took me a couple of minutes to generate some interesting factoids. For instance, did you know…
That Rockin’ Around the Christmas Tree and Nat King Cole’s version of The Christmas Song both made it onto the charts for the first time in 1960… and both reappeared as recently as 2014?
Or that Imagine Dragon’s Radioactive spent more consecutive weeks on the chart than any song ever… at 85 weeks?
That the Beatles managed to have 14 songs on the Billboard Hot 100 at the same time in April of 1964? And that 5 of those songs managed to make the top 10 list simultaneously? Absolutely dominant.
How about that very few artists have managed to produce a #1 single, with no other charted songs… though the feat has been managed by Soulja Boy (with Crank That), Baauer (with Harlem Shake), and Daniel Powter (with Bad Day)? That’s the pinnacle of “one hit wonder” status right there.