Scraping Grocery Store Ads for Fun and Profit

If we’re honest, I imagine most of us would admit we don’t really know what a good price is on the grocery items we purchase regularly. Except for a few high-priced favorites (e.g., ribeyes and salmon) that I watch for sales, I honestly have no idea what’s a regular price and what’s a good deal. How much does a box of raisin bran cost? Whatever the grocery store charges me…

harris teeter

A local Harris Teeter in my area. Usually, their prices are higher… except when they’re not. Honestly, I don’t really know.

Obviously, this is a really terrible way to manage my grocery budget – I probably spring for “deals” that are nothing of the sort all the time. So, as a data scientist, I got to thinking… what if I could keep tabs on this? Build a database of historic prices for various items, so I’d know when to pull the trigger on sales. Seems straightforward enough.

Well, I needed some data to get started. So, I figured I’d see if I could programmatically scrape prices out of the online weekly ad for my local Kroger store. In this post I’ll walk through how I got that set up… and, as this project moves along, I’ll post updates on what I do with the data.

Continue reading

Send Google Analytics Data to Your Own Server

In last week’s post, we explored how to tag individual users and hits with unique identifiers in Google Analytics, so that an analyst could export raw data from the Google Analytics API for complex statistical analyses not possible in the GA interface. But there are undoubtedly some situations in which even that solution isn’t good enough – Google limits the number of metrics and dimensions you can download in a single query, for example. What do you do then?

Luckily, there’s a solution for this. We’ll just send Google Analytics data on a little detour from the user’s browser to our own web server, process it ourselves, and query to our hearts content!

The 1945 movie, Detour, starring Tom Neal and Ann Savage.

The methodology I’ll summarize today allows an organization to leverage much of the value-add of Google Analytics (for instance, they’ve already done all the hard work of detecting JavaScript, flash, screen size, page, URL, etc.) while still processing the data on their own servers. It’s a massive win-win.

Continue reading

Random Forest Classifiers as a Web Service in PHP

Recently, I found myself wanting to be able to make real-time, online predictions using a random forest classifier trained in R. Of course, there are many ways to make that happen – I could have used yhat’s ScienceOps product, for example. But, for project-specific reasons, I decided that the best route to go in this case was to get my hands dirty and build my own RESTful API for making predictions using my model.

Apparently, back in 2011, Disney debuted a show called "So Random." Thankfully, it only ran for a single season...

Apparently, back in 2011, Disney debuted a show called “So Random.” Thankfully, it only ran for a single season…

In this post, we’ll walk through all of the code necessary to export a random forest classifier from R and use it to make real-time online predictions in a PHP script.

Continue reading