Amazon Redshift Performance – Bigger Clusters, or Bigger Nodes?

Last week, I looked into options for increasing the performance of an Amazon Redshift cluster that was currently using 10 large dense compute nodes (dc1.large). While investigating, I noticed that a cluster of 32 dc1.large nodes (the maximum for that node type) had the same number of CPUs, the same amount of storage, and a comparable amount of RAM to a cluster of 2 dc1.8xlarge nodes (the minimum for that node type), while offering significant cost savings for anything less than a 3-year term. This got me wondering... if you want a cluster that's around that size, which is a better bet for performance?

We'll get the answer out of the way early for the tl;dr crowd.

Redshift performance graph.

The results of testing out three diverse queries on various-sized Redshift clusters.

The cluster of 32 dc1.large nodes provided substantially (13%, on average) faster performance for all but the shortest query in my test batch. Given the superior performance and lower cost, it was the obvious choice for a cluster of this size. Whenever more performance is called for, it will make sense to switch to a cluster of 3 dc1.8xlarge nodes...

Continue reading

Analyzing Customer Churn – Using”tmerge” to Build Data Sets

If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.

Merge, everybody, merge!

Thankfully, R has a solution for this that makes life significantly more simple. It's the "tmerge" function, and it's in the survival package. Today, we'll walk through how to use it.

Continue reading

Analyzing Customer Churn – Time-Dependent Coefficients

There are numerous strategies for dealing with non-proportional hazards in cox regression analysis. You can stratify your data based on a problematic variable. You can chuck the cox model and create "pseudo-observations" to analyze the gains (or losses) in lifetime within a certain period associated with changes in a variable. If age is a problem (unlikely for customer churn, but it happens a lot in medical contexts), you can use age rather than time in the cohort as your time scale. The list goes on.

But this is statistics! We're supposed to be modeling things!

this is sparta

Statistics. Almost as cool as Sparta.

Well, as it turns out, it's actually possible to directly model how the effects of a variable change with time, so that you can not only handle the proportional hazards problem, but also get a reliable estimate of how hazard ratios for a given variable change with time. The way to do this is actually incredibly simple... we introduce an interaction term between the variable of interest and time. Let's get started!

Continue reading

Analyzing Customer Churn – Time-Dependent Covariates

My previous series of guides on survival analysis and customer churn has become by far the most popular content on this blog, so I'm coming back around to introduce some more advanced techniques...

When you're using cox regression to model customer churn, you're often interested in the effects of variables that change throughout a customer's lifetime. For instance, you might be interested in knowing how many times that customer has contacted support, how many times they've logged in during the last 30 days, or what web browser(s) they use. If you have, say, 3 years of historical customer data and you set up a cox regression on that data using covariate values that are applicable to customers right now, you'll essentially be regressing customer's churn hazards from months or years ago on their current characteristics. Your model will be allowing the future to predict the past. Not terribly defensible.

In the classic double-slit experiment, past events are seemingly affected by current conditions. But unless you're a quantum physicist or Marty McFly, you're probably not going to see causality working this way.

In this post, we'll walk through how to set up a cox regression using "time-dependent covariates," which will allow us to model historical hazard rates on variables whose values were applicable at the time.

Continue reading

Analyzing the Billboard Hot 100 Archive

I recently downloaded the entire history of the Billboard Hot 100 chart, which has been tracking the most popular 100 US music tracks since 1958. So, I decided to see what I could learn!

Hot 100 Logo

Primary colors. Except not quite.

It only took me a couple of minutes to generate some interesting factoids. For instance, did you know…

  • That Rockin’ Around the Christmas Tree and Nat King Cole’s version of The Christmas Song both made it onto the charts for the first time in 1960… and both reappeared as recently as 2014?
  • Or that Imagine Dragon’s Radioactive spent more consecutive weeks on the chart than any song ever… at 85 weeks?
  • That the Beatles managed to have 14 songs on the Billboard Hot 100 at the same time in April of 1964? And that 5 of those songs managed to make the top 10 list simultaneously? Absolutely dominant.
  • How about that very few artists have managed to produce a #1 single, with no other charted songs… though the feat has been managed by Soulja Boy (with Crank That), Baauer (with Harlem Shake), and Daniel Powter (with Bad Day)? That’s the pinnacle of “one hit wonder” status right there.

Of course, a lot of these sorts of Hot 100 facts have already been calculated, so I decided to see what new information I could come up with!

Continue reading

Scraping Grocery Store Ads for Fun and Profit

If we’re honest, I imagine most of us would admit we don’t really know what a good price is on the grocery items we purchase regularly. Except for a few high-priced favorites (e.g., ribeyes and salmon) that I watch for sales, I honestly have no idea what’s a regular price and what’s a good deal. How much does a box of raisin bran cost? Whatever the grocery store charges me…

harris teeter

A local Harris Teeter in my area. Usually, their prices are higher… except when they’re not. Honestly, I don’t really know.

Obviously, this is a really terrible way to manage my grocery budget – I probably spring for “deals” that are nothing of the sort all the time. So, as a data scientist, I got to thinking… what if I could keep tabs on this? Build a database of historic prices for various items, so I’d know when to pull the trigger on sales. Seems straightforward enough.

Well, I needed some data to get started. So, I figured I’d see if I could programmatically scrape prices out of the online weekly ad for my local Kroger store. In this post I’ll walk through how I got that set up… and, as this project moves along, I’ll post updates on what I do with the data.

Continue reading

Data Show Stores Decorating for Christmas Earlier Each Year

Seems like every year around this time, I hear folks complaining that corporate America is decorating for Christmas way too early. Gone are the days when Christmas was reserved for December - now we're lucky to make it through Halloween without seeing wreaths and Christmas trees everywhere. But is any of it true? Are stores really decorating earlier than they used to? I decided to find out.

Of course, there's not really any available data on when stores around the country start decorating for Christmas (at least to my knowledge). So answering the question required some creativity. I started thinking - if there was one thing that represented corporate America's Christmas decorating traditions, what would it be? The Rockefeller Center Christmas Tree, of course!

Rockefeller Center Tree

The Christmas tree in Rockefeller Center.

First erected in 1933, the Rockefeller Center tree has been set up every year since and has become an unofficial start of the Christmas season for New Yorkers. And, since Rockefeller Center is without a doubt a bastion of American corporatism, the tree gives us a relatively good proxy for measuring the start of the corporate Christmas decorating season. If the tree at Rockefeller Center has been going up earlier every year, there's a good chance that's indicative of a larger national trend.

Continue reading

Override Hadoop’s Default Compression Codec Selections

If you're using a standard input format, Hadoop will automatically choose a codec for reading your data by examining the file extension. So, for example, if you have a file with a ".gz" extension, Hadoop will recognized that it's gzipped and load it with the gzip codec. This is all well and good... until you're trying to work with a bunch of compressed files that don't have the proper extension. Then the feature suddenly becomes a burden.

Codec Logo

Apparently, "codec" was the name of a 1980's grocery store in France with a hideously '80s logo.

I recently found myself in just this situation, and scoured the internet looking for tips on how to override Hadoop's codec choices. I couldn't find any good resources, so I went digging in the source to build the solution myself. Hopefully, this post will save somebody else the trouble!

Continue reading

SQL Survival Curves with Redshift and Periscope

My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.

Based on all available evidence, the survival curve for aliens singing disco music hits 0 at about 43 seconds (N = 1).

As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!

Continue reading

Optimal Pass the Pigs Strategy (Part Two)

Still migrating old posts due to travel. Next post will be fresh content!

In a previous post, we learned that if you want to maximize your score on any individual turn of a game of "Pass the Pigs," you should always roll when there's less than 22.5 points in your hand, and hold when there's more than 22.5 points in your hand. (If you've never heard of "Pass the Pigs," the rules are explained in the prior post.)

I wouldn't mind being passed this pig for a few minutes... that's some serious cuteness right there.

However, we also concluded that that's not an effective strategy for winning the game as a whole. If you have a score of 0 and your opponent has a score of 99, for example, it would be really silly to stop rolling at 23 points just because the "22.5 rule" says to. So what's a person to do? How do you play effectively? Today, we'll generate a strategy that can help you make an optimal move in any situation. (Hint: you'll need to do a lot of math.)

Continue reading