Naive Bayes is an extraordinarily diverse algorithm for categorizing things… it can separate fraudulent from non-fraudulent credit card purchases, spam from legitimate email, or dinosaurs from fictional monsters. It’s most frequent application is in “bag of words” models… statistical models that classify a block of text based on the words that appear within it.

In this post, we’ll explore the basics of Naive Bayes classification, and we’ll illustrate using a relatively common problem – assigning genders to a group of people when all you have is their names. And, just for kicks and giggles, we’re going to do it in SQL. Because nobody trains models in SQL.

## The Naive Bayes algorithm

Naive Bayes text classifiers are based on the idea that the probability of a block of text falling into a given category is related to the historic density of each word appearing in other texts in that category. So, in the world of email classification, it might work something like this: the word “viagra” is far more likely to appear in spam than in legitimate email, so if an email contains “viagra,” that’s pretty good evidence that it’s spam.

Of course, emails have more than one word, so you simply end up multiplying the probability of each word appearing in a given category together to create a “score.” If the spam score is greater than the legit score, you call the email spam. Otherwise, you call it legit. Easy!

If you know your statistics, you realize that this is just an application of Bayes’ theorem that makes a bunch of ridiculous assumptions about the world (hence “Bayes” and “naive”). If you’re interested in learning more, check out yhat’s summary of the principles behind Naive Bayes. Or, you can always hit up the Wikipedia articles on Bayes’ theorem and Naive Bayes.

## Getting some data

Of course, if we’re going to create a classifier, we’re going to need some training data. Luckily, the State of North Carolina has just the thing! Voter rolls in the state are public record and available via FTP. They contain name, gender, and other information on millions of people. Of course, it’s not perfect (maybe NC’s population of Asians is really small, for example), but it’s still plenty of data to cover commonly used names.

A little bit of data manipulation magic let me extract just the fields necessary from the massive file I got from the FTP site. I kept full name (including prefix and suffix), race, ethnicity, party, gender, and age. Several of those were not strictly necessary for this project, but I wanted to see how well the classifier could do at assigning people to political party or race based on their name. I won’t cover it here, but it works far better than you’d expect – try it out on your own if you’d like.

Here’s a preview of the data we’re working with. Usually, I’d provide a download, but it’s a really large file and I want to be kind to my web host… It’s not that hard to parse the voter file down to this yourself…

Now, we’ll read the data into an Amazon Redshift database.

We’ll create a table of voters for which we have full data (first name, middle name, gender). We’ll also tack on an extra random variable for splitting the data into training and testing sets.

Finally, we’ll split the data into a “test” and “train” table using the random variable we created earlier.

## Building our classifier

Alright, now that we have some data, let’s build a classifier. Remember, the idea here is that the probability of a person being a given gender is related to how likely each part of their name is to come up for that gender. This means that we can use information on first and middle name to make some good determinations – “Sam” might be male or female, but “Sam Marie Smith” is almost certainly female. We’ll also include information about name suffixes… “Jr” and “Sr” usually indicate that somebody’s male. In fact, I can’t think of any name suffixes for women, but that’s why we’re using data and not my intuition.

One small note about uncommon names: our algorithm specifies that we should multiply the probabilities of each part of a name appearing for a given gender together to create a “score” for that gender. But some uncommon names will appear in the test data, but not the train data. If our training data doesn’t contain the name “Ihalewf,” but we have an “Ihalewf Daniel Jones” in our test data, does the score for each gender become 0 because we multiply by 0 when calculating the score for each gender? That would be preposterous! Instead, we’ll pretend in our score calculations that we’ve seen all names at least once. And, to keep it fair, we’ll add one extra occurrence to every name we saw in the train data.

Another note about uncommon names and floating point numbers (how your computer stores decimals). If somebody’s name was something as ridiculously uncommon as (and I’m picking randomly) “Dayne Drake Batten,” you’d multiply the probabilities for their first and middle name together and you’d get a number so small your computer would have a hard time representing it accurately. (Maybe it wouldn’t actually happen with my name, but at some point the numbers would get too small for the computer to work with.) Thankfully, we can solve this problem by adding together natural logs of probabilities to create our scores (instead of multiplying the scores themselves). The result will be the same, because math.

OK. Sorry for the diversion, but those were some important details… you’ll see them reflected in the code below. Now, back to the coding! Let’s calculate the likelihood of each first name occurring for each gender. (Note that this is Marys / women, not female Marys / total Marys!)

I won’t post the code here, but we’ll do the same thing for middle name and name suffix. Then we’re all trained and ready to go!

## Trying it out

Let’s go ahead and try it out on the test data! Get excited!

As you’ll see if you actually try this, our algorithm gets a match rate of about 98.6% on the test data. Not too shabby!

## Conclusion

I think there’s two key takeaways here. First, with a sufficient training data set, Naive Bayes can assign genders based on name alone with a very high degree of accuracy. Second, SQL is a more capable language than most people would think when it comes to these sorts of operations. It doesn’t have the data science libraries of, say, SAS, R, or Python, but it’s designed to operate incredibly quickly on gobs of data. And, if you are a decent programmer who can understand math, it’s not too hard to implement basic statistical models. In fact, I’ve actually implemented linear regression in SQL in the past – maybe I’ll write a post about that later!

If I got you interested earlier… Naive Bayes can pick somebody’s political party in North Carolina with 64% accuracy – which is 6 percentage points better than you’d do by making an educated guess using the Democrat/Republican split in the state. It’s not a lot, but remember, the only thing we’re going on here is name. Race is harder (given that there’s more than 2 common categories), but the results are still better than guessing.

What’s in a name? A lot, apparently.

RastislavSeptember 30, 2015 / 5:55 amdear sir,

thank you for this post, it is very interesting. I have a question regarding your code:

why in the code under ‘Building our classifier’ do you divide by max(num) when calculating male/female score?

As far as I noticed it means that if we have in train data more of gender female, than max(num)=number of female, or opposite if we have in train data more of gender male, than max(num)=number of male.

Since join is on gender, I would expect the division by num of given sex type and in case there is a name for both male and female keep the max score of male_score and female_score to avoid duplicity.

Thank you for answer and again for this blog!

best regards,

daynebattenOctober 5, 2015 / 4:35 amSounds like you’ve got it down. We are calculating the chance that a given name belongs to a particular gender, so we are dividing the number of occurrences of that name/gender combination by the total number of people in that gender. The max is just because the join will return the total count of the gender on every row and we just need one value rather than one per row…

Also, sorry for the slow response… I’ve been traveling the last couple of weeks.

GuayasenApril 21, 2016 / 10:46 amWe already tested that algorithm, and we are really happy to say that it works perfectly. Thank u very much, it was very useful!

We got a database with 6475 users with fullname and sex informed, and we found 51 errors (for example, some Laura with “male” informed) . Thanks

daynebattenApril 21, 2016 / 2:23 pmGlad it was helpful! Thanks for dropping a line. Always good to know folks are finding this stuff useful.