How to enable Analytics Engineering using DBT

A company that follows an ad-hoc process when doing Analytics generally suffers from the following issues:

  • Fudgeable (aka 💩) analyses, where the final outcome of a report is a series manual and inherently irreproducible steps.
  • People spending countless hours arguing about who has the right numbers due to the inability to share workflows and assumptions.
  • Lack of documentation, where your knowledge of how things work leaves together with the analyst who decides one day that your company is just not worth it.
  • Inability to adapt to new requirements due to lack of testing. Given that every new requirement has the ability to break existing work, tests are gifts you send to your future self and lack of testing leads to brittleness and lack of confidence in making changes.

DBT is meant to tackle a lot of the above and provide Analysts for a way to transform existing workflow through a process that leads to better documentation, higher standards around reproducibility, and reliable testing. Personally. I see DBT not necessarily as only a tool, so much as a change in mindset for how Analytics should be done. In other words, it sets expectations for your Analysts to become more like software engineers and be more rigorous in how they approach reporting.

There are already a lot of articles around the DBT viewpoint, but not necessarily many resources around what might be the best practice in setting up a CI/CD pipeline for DBT. Below, I will cover the following two broad topics, in the hope that this article becomes more of a How-To-Guide. Specifically, I will cover the following

  • How to structure your DBT project for CI/CD and what gotchas you need to be mindful of.
  • How to set up a DBT CI/CD pipeline using Github workflows and host the docs on Netifly.

Note that nothing I describe is “best practice” or even original. The field of Analytics Engineering is still very new, so the “best practice” is a moving target. You can see an example of a DBT CI/CD pipeline on this GitHub link and can access the documentation for the example in here.

How to structure your DBT project

Your DBT project should generally have four environments: staging, local development, development, and production.

The staging environment in your data warehouse should have tables that are produced by external data ingestion tools, such as Airflow, Dagster, Stitch, etc. All analysts and the CI tool should have access to this environment. The general pattern is to try to keep the data untransformed and segregated by source or by producer. Some data warehouses like Snowflake and BigQuery make that very easy to do since you pay only for storage. For Redshift, you might have to compromise on what your store in the staging environment, since DBT currently is not integrated with Redshift Spectrum and assumes that your data is in the warehouse.

In the second environment, you want to make sure that each Analyst has access to their own development schema. Although some might argue that that leads to a proliferation of resource costs due to the fact that some Analysts don’t use the environment correctly, in my experience, there is nothing worse than not trusting your Analysts to do their job and not providing them with access to data and ability to play with it. You can always set requirements around query timeouts or even set up a transparent billing system where people can see the costs of the infrastructure they use. One important thing to note is that an Analyst should never be allowed to use their local development environment for company reporting and that expectation needs to be communicated very clearly. To enforce that, you can even set up a cron job that will periodically drop all the tables from the local development schemas.

When an Analyst is ready to submit their work for review, they can open a pull request which triggers a CI tool (e.g., Github Workflows, Travis, Jenkins) to run tests, build the dbt models, and generate the documentation. None of the models get to the final end-users and instead are run against a development environment (this can be a database, schema, or even a data warehouse). Another member on the team would then be able to review the work, make comments, query the actual data, and confirm that things are working correctly. The CI tool should also check that no selects are being done on public schema and that all models only reference other models or sources that are in staging.

Finally, the production environment simply involves the approval of the PR and a merge to master, which then triggers the CI/CD tool to rerun tests, models, and generate documentation against a production schema.

Gotchas

There are a few gotchas you should be aware of when using the above process:

  • Deleting DBT models does not remove the underlying views and tables. For this reason, you might want to add another job to the CI/CD pipeline that removes tables/views that are not referenced by DBT but that exist in your development and production schemas.
  • Make sure that the CI tool is able to catch instances where Analysts reference tables that are part of the production schemas. DBT models should never be built using public schemas. You can do that easily by checking that compiled queries contain the production schema names.
  • Run tests against sources first to check for data quality issues. This will reduce the feedback cycle in the pipeline. If you run tests against everything, you will be frustrated due to the long feedback cycle.
  • Make sure to use on-run-end in your workflows to grant access to newly created tables to end-users. In the example below I run the grant commands to the looker user after every dbt model run. Alternatively, you can also use alter default privileges to do that.
on-run-end:
- 'grant usage on schema "{{ target.schema }}" to looker_read_only'
- 'grant select on all tables in schema "{{ target.schema }}" to looker_read_only'
  • If your DBT project gets large, it would be a good idea to let the CI tool run only models that have a diff in git. This is particularly important if you use BigQuery or Snowflake since they both charge per compute or the data scanned.
  • If your team is large, you can let the CI tool create a development environment for each pull request.
  • Lastly, it is important to not forget that all of the above is a bit useless unless combined with a scheduler. You can either use DBT Cloud or any of the half dozen open source orchestration tools. If you’re a small startup and have no regulatory constraints, DBT Cloud is actually a really good deal pricewise since managing your own infrastructure does require having a lot of expensive people on the team.

Show me an example

We will go through some of the steps above and set up a CI/CD pipeline using Github Workflows and Actions.

Step 1: Define your environments

Below I define two environments. I’m using a small RDS instance on AWS since Redshift and Snowflake are pricey. Note that I have two outputs: ci and prod. The ci environment will run models against development schema and production will just generate models against public schema.

Step 2: Define your Github workflow

When defining your CI/CD workflow, you need to clearly separate which steps get triggered during a PR and which need to be run on master. DBT makes this very easy to do since it provides a — target parameter for all of its commands. For example, if you want to run models against prod, you can do the following dbt run — target prod.What is very important is to also ensure that the static documentation is set up and available in the PR. The documentation will serve several purposes: 1) it will allow your colleague to understand your work better since they can examine data lineage graphs and 2) it will provide colleagues with a high-level overview of where your work exists in a project. If you configure Netifly correctly, you will find documentation hosted for every commit:

If you’re interested in reproducing this environment, you can take a look in here.

Conclusions

DBT is a wonderful tool that allows us to blur the lines defined by role boundaries. Combined with CI/CD, it allows Analysts to write reusable modules, version control their work and take advantage of the ability to write tests and documentation. Although the project is still relatively new, I’m personally very excited about its future.

--

--

Data Scientist

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store