Finding Similar Sounding Names – Some Basics

Since my wife and I have a baby on the way, we've spent a lot of time thinking about names lately. We've poured through dozens of lists of thousands of names, we've used sites and other tools, we've researched histories - everything. And we've found that most of the tools weren't terribly helpful.

Monty Python - What is your name?

WHAT IS YOUR NAME?

After playing around with all of those baby naming tools, I recently took a stab myself and built a website that lets you find names that sound like ones you already like. So, put in "Aubrey" and you'll get suggestions like "Aubree," "Avery," and "Audrey." The algorithms aren't perfect yet, the code is currently a massive pile of hacks to support a proof-of-concept, and I have no idea if parents-to-be will find the site useful... but it's been an interesting project to try my hand at.

For today's post, I'll simply be highlighting some of the algorithms I used to find words that sound similar, and how to implement them in SQL. (I won't get into exactly how I put them all together. Can't give away the secret sauce... at least not yet.)

Continue reading

Optimizing Split Sizes for Hadoop’s CombineFileInputFormat

Many of the challenges of using Hadoop with small files are well-documented. But there's one thing I haven't seen a lot of discussion about: optimizing the maximum split size for the CombineFileInputFormat and its derivatives (e.g., CombineTextInputFormat). But this is actually a pretty big issue - improper configuration can cause out-of-memory errors or degraded performance.

An Elephant never forgets. But, apparently, it can run out of memory.

Thankfully, with a little bit of knowledge about your input data and your cluster, you can determine a value for this setting that will keep your jobs running along happily.

Continue reading

“Redshift View Materializer” Now on Github

Although Amazon Redshift is a fantastic data warehousing product, it lacks several of the features available in other database systems. One such feature is materialized views, which are just what they sound like... views that have been pre-calculated and physically stored on disk for faster querying, and which are updated on request.

the view

No, not that kind of View.

Materialized views are very useful any time you want to have the results of complex queries readily and quickly available, and you don't mind if the data is a few hours stale. For instance, suppose you have an executive dashboard that displays your monthly gross margin for your highly successful app, which has millions of users. The query that does those calculations runs for 20 minutes (way longer than an exec is going to want to wait), but it only needs to be updated once a month. Materialized views are the perfect solution. Calculate once, cache the data, and reference the cache on-demand. Refresh when needed.

The "Redshift View Materializer", now available on GitHub, is a simple Python script that creates tables containing the results of arbitrary SQL queries on-demand. Simply set the script to run as a cron-job whenever you want your tables re-created, and you'll end up with a reasonably close approximation of materialized views. The script is in the public domain, but I'd love if you'd contribute if you make any improvements, bug fixes, or additions!

Dealing with Corrupt Files in Hadoop

As I've been working with Hadoop a lot in the last several months, I've come to realize that it doesn't deal gracefully with corrupt files (e.g., mal-formed gzip files). I would throw a cluster at a couple hundred thousand files (of which one or two were bad) and the job would die two hours into execution, throwing EOFException errors all over the place. If I was only processing one file, I suppose that's a reasonably acceptable outcome. But when 99.9% of your files are fine, and the corrupt ones aren't recoverable anyway, there's no sense in blowing up the whole job just because a trivial portion of the data was bad.

elephant loses balance

Hadoop's relatively unknown LostBalance exception.

Turns out, it's not too hard to catch those exceptions within a custom record reader, log a warning message, gracefully ignore the files in question, and go about your business. Here's how to do it.

Continue reading

Amazon Redshift Performance – Bigger Clusters, or Bigger Nodes?

Last week, I looked into options for increasing the performance of an Amazon Redshift cluster that was currently using 10 large dense compute nodes (dc1.large). While investigating, I noticed that a cluster of 32 dc1.large nodes (the maximum for that node type) had the same number of CPUs, the same amount of storage, and a comparable amount of RAM to a cluster of 2 dc1.8xlarge nodes (the minimum for that node type), while offering significant cost savings for anything less than a 3-year term. This got me wondering... if you want a cluster that's around that size, which is a better bet for performance?

We'll get the answer out of the way early for the tl;dr crowd.

Redshift performance graph.

The results of testing out three diverse queries on various-sized Redshift clusters.

The cluster of 32 dc1.large nodes provided substantially (13%, on average) faster performance for all but the shortest query in my test batch. Given the superior performance and lower cost, it was the obvious choice for a cluster of this size. Whenever more performance is called for, it will make sense to switch to a cluster of 3 dc1.8xlarge nodes...

Continue reading

Analyzing Customer Churn – Using “tmerge” to Build Data Sets

If you've ever done churn analysis using cox regression with time-dependent covariates, you know that the hardest part of doing that type of research is building your base data set. You have to divide each customer's lifetime into "chunks" where the changing values of a host of different predictor variables apply. I've coded this in SQL before, and it gets ugly. Fast.

Merge, everybody, merge!

Thankfully, R has a solution for this that makes life significantly more simple. It's the "tmerge" function, and it's in the survival package. Today, we'll walk through how to use it.

Continue reading

Scraping Grocery Store Ads for Fun and Profit

If we’re honest, I imagine most of us would admit we don’t really know what a good price is on the grocery items we purchase regularly. Except for a few high-priced favorites (e.g., ribeyes and salmon) that I watch for sales, I honestly have no idea what’s a regular price and what’s a good deal. How much does a box of raisin bran cost? Whatever the grocery store charges me…

harris teeter

A local Harris Teeter in my area. Usually, their prices are higher… except when they’re not. Honestly, I don’t really know.

Obviously, this is a really terrible way to manage my grocery budget – I probably spring for “deals” that are nothing of the sort all the time. So, as a data scientist, I got to thinking… what if I could keep tabs on this? Build a database of historic prices for various items, so I’d know when to pull the trigger on sales. Seems straightforward enough.

Well, I needed some data to get started. So, I figured I’d see if I could programmatically scrape prices out of the online weekly ad for my local Kroger store. In this post I’ll walk through how I got that set up… and, as this project moves along, I’ll post updates on what I do with the data.

Continue reading

Override Hadoop’s Default Compression Codec Selections

If you're using a standard input format, Hadoop will automatically choose a codec for reading your data by examining the file extension. So, for example, if you have a file with a ".gz" extension, Hadoop will recognized that it's gzipped and load it with the gzip codec. This is all well and good... until you're trying to work with a bunch of compressed files that don't have the proper extension. Then the feature suddenly becomes a burden.

Codec Logo

Apparently, "codec" was the name of a 1980's grocery store in France with a hideously '80s logo.

I recently found myself in just this situation, and scoured the internet looking for tips on how to override Hadoop's codec choices. I couldn't find any good resources, so I went digging in the source to build the solution myself. Hopefully, this post will save somebody else the trouble!

Continue reading

SQL Survival Curves with Redshift and Periscope

My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.

Based on all available evidence, the survival curve for aliens singing disco music hits 0 at about 43 seconds (N = 1).

As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!

Continue reading

Render Google Maps Tiles with Mapnik and Python

If you want to take a bunch of GIS data and rasterize it as a tiled image map for public consumption, the folks at ESRI would be happy to sell you an expensive solution. Of course, as with oh-so-many projects, you can accomplish the same thing for free with open-source software. In this case, we'll use Python and a library called Mapnik to render beautiful map layers, then display them on Google Maps, just like this demo rendering of my home county!

Ready to get started? Dust off your Python skills, and let's go!

Continue reading