Naive Bayes is an extraordinarily diverse algorithm for categorizing things… it can separate fraudulent from non-fraudulent credit card purchases, spam from legitimate email, or dinosaurs from fictional monsters. It’s most frequent application is in “bag of words” models… statistical models that classify a block of text based on the words that appear within it.
In this post, we’ll explore the basics of Naive Bayes classification, and we’ll illustrate using a relatively common problem – assigning genders to a group of people when all you have is their names. And, just for kicks and giggles, we’re going to do it in SQL. Because nobody trains models in SQL.
The Naive Bayes algorithm
Naive Bayes text classifiers are based on the idea that the probability of a block of text falling into a given category is related to the historic density of each word appearing in other texts in that category. So, in the world of email classification, it might work something like this: the word “viagra” is far more likely to appear in spam than in legitimate email, so if an email contains “viagra,” that’s pretty good evidence that it’s spam.
Of course, emails have more than one word, so you simply end up multiplying the probability of each word appearing in a given category together to create a “score.” If the spam score is greater than the legit score, you call the email spam. Otherwise, you call it legit. Easy!
If you know your statistics, you realize that this is just an application of Bayes’ theorem that makes a bunch of ridiculous assumptions about the world (hence “Bayes” and “naive”). If you’re interested in learning more, check out yhat’s summary of the principles behind Naive Bayes. Or, you can always hit up the Wikipedia articles on Bayes’ theorem and Naive Bayes.
Getting some data
Of course, if we’re going to create a classifier, we’re going to need some training data. Luckily, the State of North Carolina has just the thing! Voter rolls in the state are public record and available via FTP. They contain name, gender, and other information on millions of people. Of course, it’s not perfect (maybe NC’s population of Asians is really small, for example), but it’s still plenty of data to cover commonly used names.
A little bit of data manipulation magic let me extract just the fields necessary from the massive file I got from the FTP site. I kept full name (including prefix and suffix), race, ethnicity, party, gender, and age. Several of those were not strictly necessary for this project, but I wanted to see how well the classifier could do at assigning people to political party or race based on their name. I won’t cover it here, but it works far better than you’d expect – try it out on your own if you’d like.
Here’s a preview of the data we’re working with. Usually, I’d provide a download, but it’s a really large file and I want to be kind to my web host… It’s not that hard to parse the voter file down to this yourself…
Now, we’ll read the data into an Amazon Redshift database.
We’ll create a table of voters for which we have full data (first name, middle name, gender). We’ll also tack on an extra random variable for splitting the data into training and testing sets.
Finally, we’ll split the data into a “test” and “train” table using the random variable we created earlier.
Building our classifier
Alright, now that we have some data, let’s build a classifier. Remember, the idea here is that the probability of a person being a given gender is related to how likely each part of their name is to come up for that gender. This means that we can use information on first and middle name to make some good determinations – “Sam” might be male or female, but “Sam Marie Smith” is almost certainly female. We’ll also include information about name suffixes… “Jr” and “Sr” usually indicate that somebody’s male. In fact, I can’t think of any name suffixes for women, but that’s why we’re using data and not my intuition.
One small note about uncommon names: our algorithm specifies that we should multiply the probabilities of each part of a name appearing for a given gender together to create a “score” for that gender. But some uncommon names will appear in the test data, but not the train data. If our training data doesn’t contain the name “Ihalewf,” but we have an “Ihalewf Daniel Jones” in our test data, does the score for each gender become 0 because we multiply by 0 when calculating the score for each gender? That would be preposterous! Instead, we’ll pretend in our score calculations that we’ve seen all names at least once. And, to keep it fair, we’ll add one extra occurrence to every name we saw in the train data.
Another note about uncommon names and floating point numbers (how your computer stores decimals). If somebody’s name was something as ridiculously uncommon as (and I’m picking randomly) “Dayne Drake Batten,” you’d multiply the probabilities for their first and middle name together and you’d get a number so small your computer would have a hard time representing it accurately. (Maybe it wouldn’t actually happen with my name, but at some point the numbers would get too small for the computer to work with.) Thankfully, we can solve this problem by adding together natural logs of probabilities to create our scores (instead of multiplying the scores themselves). The result will be the same, because math.
OK. Sorry for the diversion, but those were some important details… you’ll see them reflected in the code below. Now, back to the coding! Let’s calculate the likelihood of each first name occurring for each gender. (Note that this is Marys / women, not female Marys / total Marys!)
I won’t post the code here, but we’ll do the same thing for middle name and name suffix. Then we’re all trained and ready to go!
Trying it out
Let’s go ahead and try it out on the test data! Get excited!
As you’ll see if you actually try this, our algorithm gets a match rate of about 98.6% on the test data. Not too shabby!
I think there’s two key takeaways here. First, with a sufficient training data set, Naive Bayes can assign genders based on name alone with a very high degree of accuracy. Second, SQL is a more capable language than most people would think when it comes to these sorts of operations. It doesn’t have the data science libraries of, say, SAS, R, or Python, but it’s designed to operate incredibly quickly on gobs of data. And, if you are a decent programmer who can understand math, it’s not too hard to implement basic statistical models. In fact, I’ve actually implemented linear regression in SQL in the past – maybe I’ll write a post about that later!
If I got you interested earlier… Naive Bayes can pick somebody’s political party in North Carolina with 64% accuracy – which is 6 percentage points better than you’d do by making an educated guess using the Democrat/Republican split in the state. It’s not a lot, but remember, the only thing we’re going on here is name. Race is harder (given that there’s more than 2 common categories), but the results are still better than guessing.
What’s in a name? A lot, apparently.