Think that sounds like a cool idea? Let’s get started.
Google Analytics’ reporting API (as well as the custom reports module on the site) lets you view metric values for a limitless number of dimensions. If you have 10,000 pages on your site, you can get the number of pageviews for all 10,000 of those pages exported from GA. So, if you had a dimension that uniquely identified each user, for example, you could download data on every user that hit your site. Of course, if you could uniquely identify every hit that was sent to GA, that would really be the holy grail! You could download reports with one row for every hit that was submitted!
Obviously, Google doesn’t provide any built-in dimensions for this – but that doesn’t mean we can’t set it up ourselves using custom dimensions! There are plenty of ways you could set these dimensions up (a unique ID on every hit, for example), but I’ve found the most useful to be a combination of two custom dimensions: a unique user ID, and a time stamp. The combination of these values will uniquely identify each hit (unless, I suppose, somebody manages to fire off two hits per millisecond), and you’ll get the added bonus of being able to track the activity of a single user as they move about the site. Awesome!
Preparing the custom dimensions
Before we write any code, we’ll first set up some custom dimensions within GA to hold the information we’ll be submitting. First, we’ll need a “browser id” dimension to hold a unique ID representing a particular browser. (I think in terms of “browsers” instead of “users” since technically one user could end up with more than one ID when they clear their cookies or change computers.) This dimension will be user-scoped, so that we only have to set it the first time a user comes to the site, and it will automatically get associated with everything else that user does on the site.
Second, we’ll need a hit-scoped time stamp dimension. I’ve set mine up to simply record number of milliseconds since January 1, 1970. This dimension will get a different time stamp for every hit a user submits, allowing us to track her activity over time.
Our final dimensions will look something like this:
Coding it up
It’s really just that easy. We’ve now tagged all our Google Analytics data with custom dimensions that let us pull it down in a raw format!
(Note: if you want to use this code, you’ll need to switch out the cookie domain to match your own. And you’ll also need to swap out the GA property ID… obviously.)
After firing a few dummy pageviews into GA using this methodology, I tried pulled down some raw pageview data. Check it out:
Of course, using the API, you could export tens or even hundreds of thousands of individual pageviews, events, e-commerce transactions, etc. The sky’s the limit. You can pretty much get any data you’re interested in looking at this way.
Now, I know there’s somebody out there going “but sampling!” In my experience, GA samples data within each combination of the dimensions but will never return less than one row per dimension. So, since we’re using a combination of dimensions that lets us effectively identify each hit uniquely, we don’t have to worry about sampling. In practice, I’ve downloaded hundreds of thousands of individual pageviews this way, and never had the API warn me that sampling was occurring. And, even if there were some theoretical limit where sampling would start, you could always segment your queries by GA’s date and time dimensions.
Over the last several months, I’ve been working with a major site that’s been collecting GA data this way and loading it into a data warehouse for further processing. We’ve found dozens of use cases for this, but here’s a few general ideas you may want to think about…
- Complex queries – Before we got all this set up, there were countless times where I would be trying to answer some complex question using user segments and custom reports and thinking “this would be so much easier if I could just query the raw data in SQL.” Now I can. It gives me a lot more flexibility, plus the added benefit of knowing exactly how the data is being processed – instead of guessing at what GA is doing behind the scenes.
- Link to customer data – I’ve never done this, but if your site allows users to log in, you could always set the value of a third custom dimension to some sort of user ID when a user logs in. Then, you could merge all of your GA data with other information you have on that user. (Just be sure the user ID you store in GA is a generic, non-identifiable ID. PII is banned by the GA terms of service.)
- Fraud analysis – If you run an e-commerce business, users who make fraudulent purchases on your site likely interact with it very differently than upstanding citizens. Linking site behavior to transactions (via the transaction id in GA e-commerce reporting) can help you determine whether orders are fraudulent or not.
As I said, we’ve found plenty of other uses for this data. But I’d love to hear what others are thinking about… Let me know if you’re using Google Analytics in a similar fashion and, if so, what you’re doing with the data!