Scraping Grocery Store Ads for Fun and Profit

If we’re honest, I imagine most of us would admit we don’t really know what a good price is on the grocery items we purchase regularly. Except for a few high-priced favorites (e.g., ribeyes and salmon) that I watch for sales, I honestly have no idea what’s a regular price and what’s a good deal. How much does a box of raisin bran cost? Whatever the grocery store charges me…

harris teeter

A local Harris Teeter in my area. Usually, their prices are higher… except when they’re not. Honestly, I don’t really know.

Obviously, this is a really terrible way to manage my grocery budget – I probably spring for “deals” that are nothing of the sort all the time. So, as a data scientist, I got to thinking… what if I could keep tabs on this? Build a database of historic prices for various items, so I’d know when to pull the trigger on sales. Seems straightforward enough.

Well, I needed some data to get started. So, I figured I’d see if I could programmatically scrape prices out of the online weekly ad for my local Kroger store. In this post I’ll walk through how I got that set up… and, as this project moves along, I’ll post updates on what I do with the data.

Exploring the data format

The first thing I did was to start using the Chrome developer tools to monitor HTTP traffic to the Kroger website when I loaded the weekly ad for my local store (Kroger is my go-to grocery store). Turns out, the weekly ad is stored at a URL that looks something like https://wklyads-krogermidatlantic.kroger.com/flyers/krogermidatlantic-weekly?type=2&store_code=00342&chrome=broadsheet&flyer_run_id=##### where 00342 is the store ID of my local Kroger and ##### is the ID for the current “run” (whatever that means) of fliers being distributed by Kroger.

Loading that page turned up something nice and convenient – some JavaScript code that defines an object literal containing all the flyer data. The code started with window[‘flyerData’] = and then just listed a massive JavaScript object containing all the data I needed!

I further discovered that it didn’t really matter what I put for the flyer run ID – “asdf” worked just fine. Nevertheless, I didn’t want to assume that this would always hold true, so I looked for a way to locate the ID. Turns out, loading https://wklyads-krogermidatlantic.kroger.com/flyers/krogermidatlantic?type=2&store_code=00342&chrome=broadsheet solved the problem. It’s got a handy little JavaScript object defined at window[‘hostedStack’] = that lists out all of the currently available fliers, their flyer type (weekly vs. seasonal) and their ad run IDs. Awesome.

Kroger Ad List

Kroger usually has a weekly ad and a seasonal ad available at any given time…

I had one final stroke of luck… There are important differences between JavaScript objects and JSON, but in this case, Kroger’s objects just happened to be formatted like valid JSON. (For example, all of the keys were enclosed in quotes, even though that’s only necessary for JavaScript objects when the key contains a special character.) This meant that I could simply parse all of this data as JSON and transform it into whatever object I wanted.

Scripting it up in PHP

I know PHP isn’t as sexy as, say, Python right now, but I’m very familiar with it, and it actually makes a remarkably good scripting language. So, I coded things up! This is pretty self-explanatory and well-commented, so I won’t go into too much detail here… Basically, this code hits the main ad page for my local Kroger to get the list of ads, finds the ID for the ad run containing the word “weekly” in the name, queries for that data, grabs data from relevant fields, and writes it all out to a CSV.

Check it out!

Conclusion

This is probably only the first step in an ongoing project. Now that I can grab this data, I’ll need to set up a weekly cron job to pull it and pop it into a database somewhere. Then, after that goes on for a while, I’ll need to actually do something with the data!

Stay tuned to see where this goes!

10 Responses

  1. Syd March 31, 2016 / 12:20 am

    I’ve been strugging to find a large list of brand names of grocery store items for a project I’m working on. This code is interesting. Scraping the grocery stores themselves!

    Any clue how to get such a list of brand-names?

    • daynebatten April 4, 2016 / 10:22 am

      Sounds interesting… You could certainly use this kind of scraping to get there, but I’ll go ahead and do you one better… Here’s a unique list of every brand name I’ve scraped so far. It would need a touch of clean-up (UTF characters got screwed up somewhere along the line), but hopefully it’s a good starting point. https://gist.github.com/daynebatten/885e81e841c2dcae29e41439a37cc2d8

      What’s the project you’re working on?

      (Also, sorry for the slow reply, I was out of town. Hopefully this is still useful to you.)

  2. Randy Corke April 27, 2016 / 10:07 am

    Hi Dayne, this is very interesting and I’d love to talk with you about this. Could you drop me an email?

    • daynebatten April 27, 2016 / 10:44 am

      Sure thing.

  3. Andy Cross October 24, 2016 / 8:28 am

    Have you taken this pet project any further? I found each store to need individual customization. Then there is change over time. Did you try comparing this to Bureau of Labor’s market basket for Consumer Price Index?

    • daynebatten October 24, 2016 / 9:35 am

      Good questions… No, I haven’t taken this project any further, but I’d still like to. My wife and I had a baby recently, so I’ve had a lot of interests on hold for a bit!

      Yeah, it would definitely be interesting to compare it to Labor’s CPI data. I think you’d need a much more long-term data set to get anything too interesting, though…

  4. Tyler Davis December 22, 2016 / 9:21 am

    Cool project! I’m headed down a similar path myself. Trying to build a price book for my area, so I don’t need to remember prices over time. I’m finding that each store posts its flyer in a totally different format. One of my local grocery stores seems to be posting a scanned JPEG of the scanned paper flyer!

  5. Harper June 5, 2017 / 2:58 pm

    Hi Dayne,

    This is a great article! Thanks for writing it. I am starting on a project for school, and this is my first attempt at any web-scraping. I am writing in Python however and am having some issues with the urllibs. Could you email me? I would love to get your advice!

    • daynebatten June 9, 2017 / 2:00 pm

      Hi Harper,

      I don’t have any experience with web scraping in Python. I’ve used the Requests library, which is great for making HTTP requests, but I suspect Python has some really good packages for doing a lot of the DOM parsing for you and extracting information really easily.

      Best of luck, and sorry I can’t be of further assistance!

Leave a Reply

Your email address will not be published. Required fields are marked *