“Redshift View Materializer” Now on Github

Although Amazon Redshift is a fantastic data warehousing product, it lacks several of the features available in other database systems. One such feature is materialized views, which are just what they sound like... views that have been pre-calculated and physically stored on disk for faster querying, and which are updated on request.

the view

No, not that kind of View.

Materialized views are very useful any time you want to have the results of complex queries readily and quickly available, and you don't mind if the data is a few hours stale. For instance, suppose you have an executive dashboard that displays your monthly gross margin for your highly successful app, which has millions of users. The query that does those calculations runs for 20 minutes (way longer than an exec is going to want to wait), but it only needs to be updated once a month. Materialized views are the perfect solution. Calculate once, cache the data, and reference the cache on-demand. Refresh when needed.

The "Redshift View Materializer", now available on GitHub, is a simple Python script that creates tables containing the results of arbitrary SQL queries on-demand. Simply set the script to run as a cron-job whenever you want your tables re-created, and you'll end up with a reasonably close approximation of materialized views. The script is in the public domain, but I'd love if you'd contribute if you make any improvements, bug fixes, or additions!

8 Responses

  1. Ruben Kogel September 30, 2016 / 7:33 pm

    what is the difference between a materialized view and an ad-hoc table that gets refreshed infrequently and is generated by a custom cron job?

    • daynebatten October 4, 2016 / 3:55 pm

      In this case, absolutely nothing. This is just pre-built script to do the work for you, with some added niceties (re-using code between multiple queries via snippets, for example).

      In a usual database, there’s still not much of a physical difference, except that you don’t need an external script to refresh the view. You can set up a trigger to do it when the underlying data is changed, for example.

  2. PRaveen August 30, 2017 / 11:41 am

    is there any faster way we can make tables as materilized, if i do Create table as select * from for all my tables it is taking way over 2+ hrs.. and i do not want to add 2+ hours to my daily load..

    any suggestions

    • daynebatten September 6, 2017 / 11:14 am

      This would be a totally different set of code (unfortunately), but if you have a huge table where old data stays the same from day to day and you only need to update recent data (say, a table of sales, for example), you could delete relatively recent data and re-insert fresh data every day, leaving the oldest data untouched. That would likely save significant processing time, but only if this fits your use case.

      Have you gone through the usual query optimization steps (sort keys, distribution keys, foreign keys, checking for nested loop joins and the like, etc.)?

  3. Sebastian Heyneman January 4, 2018 / 10:56 am

    Hi Dayne, I appreciate the content on your blog. Thank you for the time and energy you put into it. It makes me want to do something similar.

    We use Redshift and we want to connect an ETL service that’ll pipe data from Redshift to Salesforce. It needs a specific table that is updated on a regular basis. As you point out – we can’t build materialized views to solve the problem.

    My understanding is that your script will allow us to:
    -Specify SQL for the table we want to build
    -Schedule the table to be dropped and rebuilt or appended to on on an hourly or daily basis
    -Keep the same table name after each update

    Is that accurate? Thank you Dayne.


    • daynebatten January 8, 2018 / 9:18 pm

      Yes, that’s roughly correct.

      If you’re just piping data from Redshift to Salesforce, do you even need to be materializing a view, though? Why not just run the query on your Redshift instance at ETL time? Is it some sort of timing constraint?

      • Sebastian January 12, 2018 / 4:14 pm

        Ooh, interesting. What I hear you saying is to just store it as a regular view and have the API tool wait for that view to load and pull from there. Is that roughly accurate?

        • daynebatten January 30, 2018 / 11:26 am

          I think you’re getting there. You don’t really “store” things as views though. A view is just a pre-defined query over a set of tables where you can reference the result set like a table in another query.

Leave a Reply

Your email address will not be published. Required fields are marked *