Scraping Grocery Store Ads for Fun and Profit

If we’re honest, I imagine most of us would admit we don’t really know what a good price is on the grocery items we purchase regularly. Except for a few high-priced favorites (e.g., ribeyes and salmon) that I watch for sales, I honestly have no idea what’s a regular price and what’s a good deal. How much does a box of raisin bran cost? Whatever the grocery store charges me…

harris teeter

A local Harris Teeter in my area. Usually, their prices are higher… except when they’re not. Honestly, I don’t really know.

Obviously, this is a really terrible way to manage my grocery budget – I probably spring for “deals” that are nothing of the sort all the time. So, as a data scientist, I got to thinking… what if I could keep tabs on this? Build a database of historic prices for various items, so I’d know when to pull the trigger on sales. Seems straightforward enough.

Well, I needed some data to get started. So, I figured I’d see if I could programmatically scrape prices out of the online weekly ad for my local Kroger store. In this post I’ll walk through how I got that set up… and, as this project moves along, I’ll post updates on what I do with the data.

Continue reading

Data Show Stores Decorating for Christmas Earlier Each Year

Seems like every year around this time, I hear folks complaining that corporate America is decorating for Christmas way too early. Gone are the days when Christmas was reserved for December - now we're lucky to make it through Halloween without seeing wreaths and Christmas trees everywhere. But is any of it true? Are stores really decorating earlier than they used to? I decided to find out.

Of course, there's not really any available data on when stores around the country start decorating for Christmas (at least to my knowledge). So answering the question required some creativity. I started thinking - if there was one thing that represented corporate America's Christmas decorating traditions, what would it be? The Rockefeller Center Christmas Tree, of course!

Rockefeller Center Tree

The Christmas tree in Rockefeller Center.

First erected in 1933, the Rockefeller Center tree has been set up every year since and has become an unofficial start of the Christmas season for New Yorkers. And, since Rockefeller Center is without a doubt a bastion of American corporatism, the tree gives us a relatively good proxy for measuring the start of the corporate Christmas decorating season. If the tree at Rockefeller Center has been going up earlier every year, there's a good chance that's indicative of a larger national trend.

Continue reading

Override Hadoop’s Default Compression Codec Selections

If you're using a standard input format, Hadoop will automatically choose a codec for reading your data by examining the file extension. So, for example, if you have a file with a ".gz" extension, Hadoop will recognized that it's gzipped and load it with the gzip codec. This is all well and good... until you're trying to work with a bunch of compressed files that don't have the proper extension. Then the feature suddenly becomes a burden.

Codec Logo

Apparently, "codec" was the name of a 1980's grocery store in France with a hideously '80s logo.

I recently found myself in just this situation, and scoured the internet looking for tips on how to override Hadoop's codec choices. I couldn't find any good resources, so I went digging in the source to build the solution myself. Hopefully, this post will save somebody else the trouble!

Continue reading

SQL Survival Curves with Redshift and Periscope

My company has a subscription-based business model, which means we spend a lot of time analyzing customer churn. We wanted to include Kaplan-Meier survival curves in some of our executive dashboards, but neither our database (Redshift) nor any of our commonly used dashboarding tools (Tableau, Periscope, etc.) provided the necessary functionality. We could, of course, have pulled data out of the warehouse, analyzed it in R or Python, and pushed it back up, but that's pretty complicated. So we went looking for a better solution.

Based on all available evidence, the survival curve for aliens singing disco music hits 0 at about 43 seconds (N = 1).

As you likely guessed from the title of this post, that better solution involved writing our own code for calculating a Kaplan-Meier estimator in SQL. In this post, I'll be walking through our strategy step-by-step, including the SQL code for calculating the estimators and making that code reusable in Periscope. Let's do this!

Continue reading

Optimal Pass the Pigs Strategy (Part Two)

Still migrating old posts due to travel. Next post will be fresh content!

In a previous post, we learned that if you want to maximize your score on any individual turn of a game of "Pass the Pigs," you should always roll when there's less than 22.5 points in your hand, and hold when there's more than 22.5 points in your hand. (If you've never heard of "Pass the Pigs," the rules are explained in the prior post.)

I wouldn't mind being passed this pig for a few minutes... that's some serious cuteness right there.

However, we also concluded that that's not an effective strategy for winning the game as a whole. If you have a score of 0 and your opponent has a score of 99, for example, it would be really silly to stop rolling at 23 points just because the "22.5 rule" says to. So what's a person to do? How do you play effectively? Today, we'll generate a strategy that can help you make an optimal move in any situation. (Hint: you'll need to do a lot of math.)

Continue reading

Optimal Pass the Pigs Strategy (Part One)

Since I'm out of town for a bit, I'm migrating over a few relevant posts from an old blog of mine that I'm planning to shut down. Enjoy!

Pass the Pigs is a simple yet addictive dice game that uses cute little plastic pigs as dice. If you've never played, the rules are very straightforward. On each turn, a player rolls two pigs. The pigs will land in different positions, which will determine how many points the player has in their hand for that turn. The player may then decide to "pass the pigs" to the next player. If they do this, all of the points in their hand will be added to their official score. They may also decide to roll the pigs again to try to add more points to their hand before passing the pigs. But they must be careful! If the pigs both land on their sides with one showing a dot and the other showing a blank side, they "pig out" and lose all of the points they've accumulated in their hand! It's risky business. The first player to accumulate a score of 100 or higher wins.

Pass the Pigs

Credit: Larry Moore

Like anything with dice (even pig-shaped dice), Pass the Pigs is a game of chance. That means, with a little effort, we should be able to figure out the probabilities of certain things happening in the game, and develop some optimal strategies. So how do you win at Pass the Pigs? Read on to find out.

Continue reading

Distance between Latitude and Longitude Coordinates in SQL

Pretty much any language commonly used for data analysis (R, SAS, Python) can calculate the distance between two geographic coordinates with relative ease. But always having to pull your data out of your data warehouse any time you want to do some basic geographic analysis can be frustrating - sometimes it's nice to keep simple queries all in one system. If you've got a spatially enabled version of Postgres or SQL Server, you're in business. But if not, you'll need to roll your own SQL solution.

Great Circles

Because the earth is a sphere, the quickest route between two points is a "Great Circle," which may appear curved on flat maps...

In today's post, we're going to write our own code in vanilla SQL to calculate the distance between two latitude and longitude coordinates.

Continue reading

Render Google Maps Tiles with Mapnik and Python

If you want to take a bunch of GIS data and rasterize it as a tiled image map for public consumption, the folks at ESRI would be happy to sell you an expensive solution. Of course, as with oh-so-many projects, you can accomplish the same thing for free with open-source software. In this case, we'll use Python and a library called Mapnik to render beautiful map layers, then display them on Google Maps, just like this demo rendering of my home county!

Ready to get started? Dust off your Python skills, and let's go!

Continue reading

Project Mean Customer Lifetime by Modeling Churn

In a past post on analyzing churn in the subscription or Software as a Service business, I talked about two different ways to quantify the dollar cost of churn. You could use 1 / churn as an estimation of mean customer lifetime (though this simple method makes a lot of assumptions). Or, you could use “pseudo-observations” to calculate the dollar value of certain groups of customers during a particular time period (which doesn’t let you quantify the full lifetime value of a customer).

But what if there was another way? What if we took our Kaplan-Meier best estimate of our churn curve, fit a linear model to that model, and then projected it out?

Inception Squint

A model within a model, if you will. Churnception.

Well, as it turns out, we’d get a reasonable estimation of our lifetime churn curve, which would let us estimate average customer lifetime, and customer lifetime value. Let’s get started.

Continue reading

Identifying Trends in SQL with Linear Regression

One of the best ways to learn how a statistical model really works is to code the underlying math for it yourself. Today, we’re going to do that with simple linear regression.

Data Smart Cover

In the book Data Smart, John Foreman introduces a bunch of awesome methodologies by walking you through how to build them in Excel…

Of course, doing regression in SQL also has (some) practical use as well! For example, suppose you wanted to identify which city in a database of temperature records had the biggest warming trend in the last month. This method would send you on your way without having to bring your data into an external tool. Nifty!

Continue reading