Finding Similar Sounding Names – Some Basics

Since my wife and I have a baby on the way, we've spent a lot of time thinking about names lately. We've poured through dozens of lists of thousands of names, we've used sites and other tools, we've researched histories - everything. And we've found that most of the tools weren't terribly helpful.

Monty Python - What is your name?


After playing around with all of those baby naming tools, I recently took a stab myself and built a website that lets you find names that sound like ones you already like. So, put in "Aubrey" and you'll get suggestions like "Aubree," "Avery," and "Audrey." The algorithms aren't perfect yet, the code is currently a massive pile of hacks to support a proof-of-concept, and I have no idea if parents-to-be will find the site useful... but it's been an interesting project to try my hand at.

For today's post, I'll simply be highlighting some of the algorithms I used to find words that sound similar, and how to implement them in SQL. (I won't get into exactly how I put them all together. Can't give away the secret sauce... at least not yet.)

Continue reading

Optimizing Split Sizes for Hadoop’s CombineFileInputFormat

Many of the challenges of using Hadoop with small files are well-documented. But there's one thing I haven't seen a lot of discussion about: optimizing the maximum split size for the CombineFileInputFormat and its derivatives (e.g., CombineTextInputFormat). But this is actually a pretty big issue - improper configuration can cause out-of-memory errors or degraded performance.

An Elephant never forgets. But, apparently, it can run out of memory.

Thankfully, with a little bit of knowledge about your input data and your cluster, you can determine a value for this setting that will keep your jobs running along happily.

Continue reading