Since my wife and I have a baby on the way, we've spent a lot of time thinking about names lately. We've poured through dozens of lists of thousands of names, we've used sites and other tools, we've researched histories - everything. And we've found that most of the tools weren't terribly helpful.
After playing around with all of those baby naming tools, I recently took a stab myself and built a website that lets you find names that sound like ones you already like. So, put in "Aubrey" and you'll get suggestions like "Aubree," "Avery," and "Audrey." The algorithms aren't perfect yet, the code is currently a massive pile of hacks to support a proof-of-concept, and I have no idea if parents-to-be will find the site useful... but it's been an interesting project to try my hand at.
For today's post, I'll simply be highlighting some of the algorithms I used to find words that sound similar, and how to implement them in SQL. (I won't get into exactly how I put them all together. Can't give away the secret sauce... at least not yet.)
Setting Postgres up...
In order to use the algorithms I'm going to discuss in SQL, you'll need to load the "fuzzystrmatch" Postgres plugin, which is super easy. Check it:
This extension contains a lot of additional code for various string matching utilities. You can view the fuzzystrmatch docs here.
The soundex algorithm was developed in the early 1900's, and has been used primarily for identifying similar sounding last names for the census or national archives. It converts a name into a four digit code consisting of its first letter followed by three digits which each represent a family of consonant sounds. For example, since D and T sound relatively similar, they are both coded as a 3. When encoded in the soundex, lots of common name variations end up producing the same code. So, Smith and Smythe both become S530, for example.
Using Postgres, you can use the soundex() function to get the soundex of a given string. (Imagine that!) You can also use the poorly-named difference() function to find the number of bytes that are shared between the soundexes for two strings. So, for example Drake (D620) and Blake (B420) have a difference of 2 because their soundex codes both share two digits.
Check it out in code:
Of course, the soundex has some notable limitations... It completely ignores first letters that may be pronounced the same. (Kristina and Christina, anyone?) Furthermore, it has no concept of silent consonants... Hayleigh codes to H420, as if the G (coded as 2) was pronounced. This is particularly problematic when it comes to non-English names.
Metaphone and Double Metaphone
The Metaphone algorithm was released in 1990 by computer programmer Lawrence Phillips as a way to tackle many of the limitations of the soundex. Its handling of silent letters, letters with similar pronunciations, and foreign words all make it a substantial improvement over its predecessor. Moreover, the metaphone can also produce codes of any length for words that contain more than 4 consonant sounds. The 'double metaphone,' released in 2000, contains further improvements and produces alternate pronunciations for words that might have different pronunciations.
The metaphone(), dmetaphone(), and dmetaphone_alt() functions in Postgres allow for calculations of metaphone encodings. We'll illustrate in code, showing a couple improvements over the soundex...
It's still not a perfect system, but it's a big step up over soundex!
Levenshtein Edit Distance
The Levenshtein edit distance actually has very little to do with how words sound. Rather, it's simply a representation of how many operations (single-letter insertions, deletions, or substitutions) it would take to convert one string into another. So, cat becomes bat in 1 operation (substitute B for C), yielding a Levenshtein edit distance of 1. My intuition was that words that are spelled more similarly are, on average, likely to be pronounced more similarly. But, more importantly, words whose metaphone encodings have a small Levenshtein edit distance are more likely to sound similar.
Some illustrations in code...
Again, not a perfect solution, but we're definitely starting to get at name similarity in a pretty big way!
For me, next steps are figuring out how to make even better suggestions about name similarity! I'll be looking into CMU's phoneme tools, as well as other systems for mapping names to pronunciations. Notably, soundex and metaphone only focus on consonants, whereas phonetic systems using the "ARPAbet" or the International Phonetic Alphabet focus on consonants and vowels. I'll be sure to follow up on this post as the project moves forward!
Of course, I'll also be cleaning up the code and the site a little bit along the way!