Export Raw Data from Google Analytics (the Free Way)

Today, we’re going to use a couple of lines of JavaScript code to get free access to raw data from Google Analytics. That’s a feature that’s usually only available in Google Analytics Premium, a product which will set you back a cool $150,000 a year.

In this how-to video, the author merges customer data with Google Analytics data via Google BigQuery. Luckily, you can unlock these kinds of features without having to take out a second mortgage.

Think that sounds like a cool idea? Let’s get started.

The concept

Google Analytics’ reporting API (as well as the custom reports module on the site) lets you view metric values for a limitless number of dimensions. If you have 10,000 pages on your site, you can get the number of pageviews for all 10,000 of those pages exported from GA. So, if you had a dimension that uniquely identified each user, for example, you could download data on every user that hit your site. Of course, if you could uniquely identify every hit that was sent to GA, that would really be the holy grail! You could download reports with one row for every hit that was submitted!

Obviously, Google doesn’t provide any built-in dimensions for this – but that doesn’t mean we can’t set it up ourselves using custom dimensions! There are plenty of ways you could set these dimensions up (a unique ID on every hit, for example), but I’ve found the most useful to be a combination of two custom dimensions: a unique user ID, and a time stamp. The combination of these values will uniquely identify each hit (unless, I suppose, somebody manages to fire off two hits per millisecond), and you’ll get the added bonus of being able to track the activity of a single user as they move about the site. Awesome!

Preparing the custom dimensions

Before we write any code, we’ll first set up some custom dimensions within GA to hold the information we’ll be submitting. First, we’ll need a “browser id” dimension to hold a unique ID representing a particular browser. (I think in terms of “browsers” instead of “users” since technically one user could end up with more than one ID when they clear their cookies or change computers.) This dimension will be user-scoped, so that we only have to set it the first time a user comes to the site, and it will automatically get associated with everything else that user does on the site.

Second, we’ll need a hit-scoped time stamp dimension. I’ve set mine up to simply record number of milliseconds since January 1, 1970. This dimension will get a different time stamp for every hit a user submits, allowing us to track her activity over time.

Our final dimensions will look something like this:

custom_dimensions

Coding it up

The JavaScript code to actually populate these dimensions is ridiculously simple. First, we load Google Analytics. Then, we check a cookie to see if the user has already been assigned an ID. If they haven’t been assigned one, we generate a UUID using Broofa’s JavaScript UUID generator, assign it to custom dimension 1, and set a cookie indicating that the ID has now been set. Next, we assign a time stamp to custom dimension 1. Then, we fire our hits (in this case, a pageview)!

It’s really just that easy. We’ve now tagged all our Google Analytics data with custom dimensions that let us pull it down in a raw format!

(Note: if you want to use this code, you’ll need to switch out the cookie domain to match your own. And you’ll also need to swap out the GA property ID… obviously.)

Our results

After firing a few dummy pageviews into GA using this methodology, I tried pulled down some raw pageview data. Check it out:

raw_report

Of course, using the API, you could export tens or even hundreds of thousands of individual pageviews, events, e-commerce transactions, etc. The sky’s the limit. You can pretty much get any data you’re interested in looking at this way.

Now, I know there’s somebody out there going “but sampling!” In my experience, GA samples data within each combination of the dimensions but will never return less than one row per dimension. So, since we’re using a combination of dimensions that lets us effectively identify each hit uniquely, we don’t have to worry about sampling. In practice, I’ve downloaded hundreds of thousands of individual pageviews this way, and never had the API warn me that sampling was occurring. And, even if there were some theoretical limit where sampling would start, you could always segment your queries by GA’s date and time dimensions.

Use cases

Over the last several months, I’ve been working with a major site that’s been collecting GA data this way and loading it into a data warehouse for further processing. We’ve found dozens of use cases for this, but here’s a few general ideas you may want to think about…

  • Complex queries – Before we got all this set up, there were countless times where I would be trying to answer some complex question using user segments and custom reports and thinking “this would be so much easier if I could just query the raw data in SQL.” Now I can. It gives me a lot more flexibility, plus the added benefit of knowing exactly how the data is being processed – instead of guessing at what GA is doing behind the scenes.
  • Link to customer data – I’ve never done this, but if your site allows users to log in, you could always set the value of a third custom dimension to some sort of user ID when a user logs in. Then, you could merge all of your GA data with other information you have on that user. (Just be sure the user ID you store in GA is a generic, non-identifiable ID. PII is banned by the GA terms of service.)
  • Fraud analysis – If you run an e-commerce business, users who make fraudulent purchases on your site likely interact with it very differently than upstanding citizens. Linking site behavior to transactions (via the transaction id in GA e-commerce reporting) can help you determine whether orders are fraudulent or not.

As I said, we’ve found plenty of other uses for this data. But I’d love to hear what others are thinking about… Let me know if you’re using Google Analytics in a similar fashion and, if so, what you’re doing with the data!

43 Responses

  1. Martin Watts December 8, 2015 / 5:56 am

    Hi,

    Great article, this looks like exactly the solution I’m looking for to track my un-logged in users navigation around the site.

    Is there any benefit to using this UUID method over passing the clientID into a custom dimension? I believe the clientID is stored in cookies in the same way that the generated UUID would be.

    https://developers.google.com/analytics/devguides/collection/analyticsjs/cookies-user-id?hl=en#getting_the_client_id_from_the_cookie

    Thanks,
    Martin

    • daynebatten December 8, 2015 / 7:52 am

      Great question. Pushing the clientID into a custom dimension should work perfectly fine. We went the UUID route simply as a matter of preference – it put control over the ID generation and storage in our court, so we wouldn’t have to worry about Google changing something up in the future… Of course, Google could change the entire structure of GA’s custom dimensions or reporting at any point, which would theoretically break all this down anyway.

  2. Jaap December 23, 2015 / 4:14 am

    Hey, thanks so much for this article, looks really useful.

    Aren’t you worried that google finds out about the huge amount of data your account is pulling every time though? I also want to use your trick in a business environment, but I cant have my clients google account suddenly be blocked or something 🙂

    Thanks again,
    Jaap

    • daynebatten December 23, 2015 / 10:08 am

      Great question. As it turns out, Google’s API quotas only limit the number of requests per day (50,000), not the total number of records downloaded. So, downloading 7,000 rows of raw page view data is the same as simply grabbing your total number of hits for last month. You can only download a maximum of 10,000 rows at a time, however, before you have to start paginating the output, so if you have, say, 120,000 unique page views to download, that will eat up 12 requests. It will still take a long time to get to 50,000 this way.

      Of course, to get this full quota, you need to register your application in the Google Developer’s Console. This gives you an added ability to keep tabs on your quotas, so you can be sure you never go over your limit.

      In practice, I work with a pretty large site and have never come anywhere close to burning through my API limits. To help with that, we go easy by downloading everything once a day and storing it permanently in our data warehouse. That way, we never have to download the same data twice and we keep our requests to a minimum.

      Info from Google on API limits for Google Analytics: https://developers.google.com/analytics/devguides/config/mgmt/v3/limits-quotas

  3. Gio April 10, 2016 / 9:59 am

    Is it still working?

    Sorry, I am complete newbie in Analytics API.
    1. How did you get that table after header “Our results”?
    2. How are you doing “downloading everything once a day and storing it permanently in our data warehouse”?

    • Gio April 10, 2016 / 11:44 am

      I’ve figured out how to solve Question 1.

    • daynebatten April 12, 2016 / 7:55 am

      Hey Gio. Glad you got question 1 figured out. On 2, we’re using a Python client for the Google Analytics Core Reporting API (or what was formerly known as that, anyway). The APIs have changed a bit over the years, but there are several options for connecting to GA, specifying dimensions and metrics, and downloading reports. Google’s current API Client for Python appears to have the necessary functionality, though I personally haven’t played with it: https://developers.google.com/api-client-library/python/

      Essentially, our daily script has a couple of GA reports that it downloads using the Python API client (each of which may contain tens or hundreds of thousands of individual user actions). The Python script writes out all of the data to a CSV, and we then simply read that CSV into a database table…

        • daynebatten April 13, 2016 / 7:19 am

          No problem. And thanks!

          I have used R to access the GA API before, and the RGA package is great. our daily ETL processes are all written in Python, though, so we stuck with Python for implementing this data load as well.

  4. SamH May 25, 2016 / 3:52 pm

    Great article!

    I take this is a similar concept to Simo Ahava’s blog article on custom dimensions? I’m very hands-on with tag manager at the moment but Simo uses an event trigger to fire the custom dimensions through. This great as this method uses a page view hit which could be very useful!

    What databases are you using to store the GA data would MYSQL cut it? What kind of processing do you do on it afterwards?

    Cheers 🙂

    • daynebatten May 26, 2016 / 8:54 am

      Yeah, very similar concept to what Simo’s talking about. My understanding is that once you set a custom dimension on a page, it will get fired through with every subsequent hit. So, if you set them up before the page view hit, they go through.

      We are storing the GA data in Amazon Redshift, since our organization is already using that for some of our larger data warehousing needs. MySQL would definitely cut it for all but the largest of sites, though.

      As for processing, the sky’s the limit! As an example, our affiliate program is only supposed to pay out when our affiliates refer new customers (as opposed to old customers making a new purchase). Previously, we had no way to actually determine whether an affiliate click was a new customer or not. Now, since we can link the affiliate hit to the transaction via the GA data, then use our back-end systems to look at what customer the transaction came from, we are able to audit our affiliate sales and not pay commissions on sales that don’t meet our terms.

  5. daniel August 12, 2016 / 10:44 am

    Hi Dayne

    thanks so much for a great read. I tried implementing it myself as a proof-of-concept and a bit of practice (2nd year CS Student) but am running into some difficulty. I’m trying to implement it on the polaroidkidd.com doman, set up the custom dimensions (included their code in the website) and exchaned yours, but it still isn’t being tracked when I open up a custom report in GA.

    Could you please go into a bit more detail about setting up the custom dimension code wise on the website and exchanging the cookie domain?

    Any help would be appreciated!

    • daynebatten August 12, 2016 / 11:14 am

      Took a look at your site… It looks like you’ve got some extraneous JS code executing (and crashing) above the code you’ve pulled from this post. (I think maybe you meant to comment that out along with the old Google Analytics code?) Your browser’s console should be helpful for looking for these sorts of errors. But this isn’t what you’re asking about…

      On to the point: Based on monitoring network traffic when I visit your site (again, browser developer tools are massively helpful), I see that the hits are flowing properly to Google Analytics. Are you sure your property ID is set correctly? Also, I don’t know how much time you’ve given for this, but it takes data a number of hours (I usually find around 4 or 5, but it could be longer) to get processed by Google and actually show up in Google Analytics reports. Maybe try checking back later?

  6. daniel August 13, 2016 / 5:08 am

    thanks for getting back so quickly. I managed to get it to work with your tips. The GTM is still not working correctly, but that’s a different matter.

    When I created the custom dimenisons in the GA-Admin, it gave me a code-snipit and asked me to implement it on the site, which, I think, lead to them not working properly with your code or the field browser_id just being filled with “browser_id” instead of a UUID.

    Again, thank you very much for this great read and the support!

  7. Becky August 17, 2016 / 11:04 pm

    Great post! I absolutely agree with the idea of bringing in a browser ID and a timestamp, if you have the available custom dimensions. Have you considered using the Google Analytics client ID as your browser ID? If Google Analytics tracker has been created on the page, this cookie should be available for you:
    ga(function(tracker) {
    var clientId = tracker.get(‘clientId’);
    });

    Also, as a small note, you can still get hit by sampling even if you break out your data at the hit level. https://support.google.com/analytics/answer/2637192?hl=en
    Breaking up your query by the day level can absolutely help you get around this as long as you do not have more than 500,000 sessions (in that Property) within a single day.

  8. Michael Billing September 9, 2016 / 3:50 pm

    Hi,
    Super article, I also read about it on Simo Ahavas blog http://www.simoahava.com/analytics/improve-data-collection-with-four-custom-dimensions/

    I’d like to add that I actually spent the past 2 years taking this approach a bit further. 🙂

    A have made an ETL app that extracts just about all the raw data from the GA property/view you want. You put a script (sort of what you suggest, http://www.scitylana.com/assets/script/tagmanager.txt) on your site that collects these unique id and timestamp in a custom dimension. (we put the extra info in a single dimension)

    Now we can make a lot of queries and stitch them together reproducing the “raw” data.

    The app fetches the data daily and puts it into txt files on your hdd for easy import into analysis tools, DBs or whatever you like.

    Keep up the good work here! 🙂

  9. Ole Petter Løvik September 16, 2016 / 9:33 am

    Great post! I have implemented a similar approach and tested it out on our test servers. I wonder if you have any experience with data update delay from Google when downloading raw data like this through the API. I see during my testing (small scale) that I get the data about 10-20 minutes after page activity. But I am more after what expected delays would be for websites with more traffic (minimum 20000 – 100 000 hits a day). Also do you filter by dimention2 (timestamp) when downloading data? Is that safe to use regarding processing time at google?

    …I hope some day custom dims will be available through real time api…

    • daynebatten September 16, 2016 / 3:04 pm

      In my experience with a reasonably large site, data is maybe 4-8 hours stale by the time it shows up in Google Analytics. Not terrible, though. I actually filter by GA’s date parameters rather than by dimension2… it’s pre-built by Google for filtering on date. I don’t see any reason why you couldn’t go a different route, though, if you could build the filter.

      I’m with you on the real time API…

  10. Harshal Varday September 16, 2016 / 12:48 pm

    A quick question, Can we export custom events data?
    Thanks, in advance!

    • daynebatten September 16, 2016 / 3:05 pm

      Yes, you should be able to export virtually everything from GA (though I don’t believe they allow exporting gender or age).

  11. Sarnath K September 18, 2016 / 10:29 am

    Hmm… I doubt that. What exactly do you mean by “Combination of dimensions” ?

    If combination of all dimensions is what you are referring to –> It is as simple a stating that GA will ignore duplicate rows and will give unique rows.

    I am somehow not convinced that we can get around the “sampling” part. Obviously, I am drooling to be proved otherwise..

    • daynebatten September 19, 2016 / 3:05 pm

      By “combination of dimensions” I mean just that – the full list of combinations that occur across all the dimensions in your query.

      In my experience (which involves downloading millions of hits a month this way), I definitely get one record for every hit, since the browser ID and timestamp are unique. How do I know this is true (the “proof,” if you will)? If I count the number of hits on a given day in our data warehouse, it’s the same as what Google reports. If sampling was removing some of my data, the numbers wouldn’t add up.

  12. Jesper Ellegaard December 2, 2016 / 6:59 am

    Great article!
    This is actually implemented as a plug’n’play product by https://scitylana.com. Scitylana will pull all raw data unsampled, user by user, click by click from GA (free edition) onto your local hard drive. From here on you can use data in Excel, Power Pivot, Power BI, SQL Server or whatever you like.

  13. Chintan December 29, 2016 / 12:40 pm

    Hello!

    This is a very god post and most likely what I am looking for. I would really appreciate if you could please help me with my problem below:

    I am trying to get clickstream data for my target website from Google Analytics. The data I am looking to get is customerID, timestamp, pagePath etc…

    How do I add these so that I can see them in custom dimensions? I did not set up the GA account and I do not know much about it. Can you please point me in right direction?

    I know how to use R. Do I need to use the code from your example in R? Let’s say the website I am interested is http://www.xyz.com

    Thank you!

  14. aidian March 13, 2017 / 10:37 am

    I tried this approach, and I’m seeing duplicate values for uuid and timestamp. I’m thinking it’s because we dynamically generate and then cache our site every 10 minutes. Does that seem likely to you? Or am I missing something obvious?
    Thanks

    • aidian March 13, 2017 / 2:15 pm

      Nevermind – figured out my issue. Thanks for the tip!

  15. Mo April 3, 2017 / 1:33 am

    Hello,

    Would you please share a step by step procedure to extract the raw data. I have created the customer dimensions but no luck. So I would like to create the UUID. But couldnt understand the below mentioned statement from the article. Would you help me in understand the same. Thanks.

    assign it to custom dimension 1, and set a cookie indicating that the ID has now been set. Next, we assign a time stamp to custom dimension 1. Then, we fire our hits (in this case, a pageview)!

    • daynebatten April 3, 2017 / 7:52 am

      This is simply describing the JavaScript code snippet right below it. You can create custom dimensions and metrics from the Google Analytics website, but in order for them to contain anything, they need to be populated by the code on your website. That’s where the JavaScript comes in. It sets a UUID for the browser and places that UUID in one of the custom dimensions you’ve already built.

  16. Bryan April 3, 2017 / 2:08 am

    1. How did you get that table after header “Our results”?

    Urgent Help please.

  17. Michele May 11, 2017 / 9:53 am

    Hi Dayne, probably it’s a trivial question, but this means you’ll be able to collect only “future” collected data and not what has already been collected. Am I right?

    • daynebatten May 11, 2017 / 1:33 pm

      Yeah, unfortunately, that’s correct…

  18. Tom July 19, 2017 / 7:35 am

    Hi Dayne,

    If I could push a custom event to the GA data layer on completion of a form with some generic ID would this method enable me to match GA data to the data collected in the form?

    Cheers,

    Tom

    • daynebatten July 25, 2017 / 8:37 am

      Sorry for the slow response. Yes, you could absolutely do that. You’d need to have already assigned a unique ID to that user, though, in order to follow all of their activity, if that’s what you’re interested in. You could look for user IDs that have your custom event, extract the value of the custom event to get the form submission ID you supplied, and then use their user ID to grab all their past behavior.

  19. Serge August 25, 2017 / 3:41 am

    Thanks! Your decision is similar to what I’m looking for.

    Is it possible to collect personal data (user email, phone atc)?

    • daynebatten August 28, 2017 / 8:49 pm

      Google’s terms of service explicitly ban submitting personally identifiable information to Google Analytics. You can, however, send an anonymous user ID which you can link to your own internal systems if you want. Hope that helps!

        • daynebatten September 6, 2017 / 11:11 am

          Sounds like what you’re looking for can be handled by GA’s User-ID feature… you can read an intro to it here: https://support.google.com/analytics/answer/3123662?hl=en. Essentially, when somebody logs into your site (or you otherwise know who they are), you would send a (non-PII) User ID that you’ve generated to Google. You could extract that from GA later. You could then join using that ID to anything in your database (which could be in BigQuery or anything else).

          Sorry I can’t go into full detail on how to code that up, but that’s way beyond the scope of a blog comment, and would be specific to your site/use case anyway…

  20. Ivo September 19, 2017 / 1:49 am

    Hi Dayne,

    Love your post, it has helped me help so many of my clients with more detailed tracking. Thank you!

    Quick question for you: I have a client who would like to use data in Google Analytics that is already collected over the past 3 years to figure out the following:
    – The unique Client ID for each cookie
    – Subsequently, detailed demographic data for each cookie/client id (age range, gender, etc).

    Would you know if this would be possible via the Google Analytics API? We tried quering the API but weren’t able to figure it out. (We were able to get the Client IDs but weren’t able to extract demo data for each, and when we got demo data it was for grouped Client IDs).

    I am wondering if this is perhaps a feature reserved only for Google Analytics premium subscribers? Would using BigQuery enable us to do so?

  21. Idan November 8, 2017 / 3:10 am

    Hi Dayne,
    Great post, very helpful!
    I have to say that I don’t understand the benefit of using GA to get the raw data and then pushing it to cloud DB. Why not push it to cloud DB in the first place?

    TNX
    Idan

    • daynebatten November 14, 2017 / 7:16 am

      I actually have another post about re-directing the Google Analytics hit straight to your own server, bypassing Google’s infrastructure entirely. You can find it here: http://daynebatten.com/2015/07/send-google-analytics-data-to-your-own-server/

      As for why you might want to send it to Google Analytics and export from there… Google Analytics does a ton of value-adds on the data that flows in there. They geo-locate IPs for you. They tell you what ISP somebody is on. They filter out bots and spam. They aggregate referral information into channels. I could go on. Allowing Google to do all that stuff for you gives you a far more analyst-friendly data set to work with when you export it.

Leave a Reply

Your email address will not be published. Required fields are marked *