Scraping Grocery Store Ads for Fun and Profit

If we’re honest, I imagine most of us would admit we don’t really know what a good price is on the grocery items we purchase regularly. Except for a few high-priced favorites (e.g., ribeyes and salmon) that I watch for sales, I honestly have no idea what’s a regular price and what’s a good deal. How much does a box of raisin bran cost? Whatever the grocery store charges me…

harris teeter

A local Harris Teeter in my area. Usually, their prices are higher… except when they’re not. Honestly, I don’t really know.

Obviously, this is a really terrible way to manage my grocery budget – I probably spring for “deals” that are nothing of the sort all the time. So, as a data scientist, I got to thinking… what if I could keep tabs on this? Build a database of historic prices for various items, so I’d know when to pull the trigger on sales. Seems straightforward enough.

Well, I needed some data to get started. So, I figured I’d see if I could programmatically scrape prices out of the online weekly ad for my local Kroger store. In this post I’ll walk through how I got that set up… and, as this project moves along, I’ll post updates on what I do with the data.

Continue reading

Data Show Stores Decorating for Christmas Earlier Each Year

Seems like every year around this time, I hear folks complaining that corporate America is decorating for Christmas way too early. Gone are the days when Christmas was reserved for December - now we're lucky to make it through Halloween without seeing wreaths and Christmas trees everywhere. But is any of it true? Are stores really decorating earlier than they used to? I decided to find out.

Of course, there's not really any available data on when stores around the country start decorating for Christmas (at least to my knowledge). So answering the question required some creativity. I started thinking - if there was one thing that represented corporate America's Christmas decorating traditions, what would it be? The Rockefeller Center Christmas Tree, of course!

Rockefeller Center Tree

The Christmas tree in Rockefeller Center.

First erected in 1933, the Rockefeller Center tree has been set up every year since and has become an unofficial start of the Christmas season for New Yorkers. And, since Rockefeller Center is without a doubt a bastion of American corporatism, the tree gives us a relatively good proxy for measuring the start of the corporate Christmas decorating season. If the tree at Rockefeller Center has been going up earlier every year, there's a good chance that's indicative of a larger national trend.

Continue reading

Override Hadoop’s Default Compression Codec Selections

If you're using a standard input format, Hadoop will automatically choose a codec for reading your data by examining the file extension. So, for example, if you have a file with a ".gz" extension, Hadoop will recognized that it's gzipped and load it with the gzip codec. This is all well and good... until you're trying to work with a bunch of compressed files that don't have the proper extension. Then the feature suddenly becomes a burden.

Codec Logo

Apparently, "codec" was the name of a 1980's grocery store in France with a hideously '80s logo.

I recently found myself in just this situation, and scoured the internet looking for tips on how to override Hadoop's codec choices. I couldn't find any good resources, so I went digging in the source to build the solution myself. Hopefully, this post will save somebody else the trouble!

Continue reading