Data Warehousing with Databricks – Part 2 Transforming Data with DBT

Introduction

In the previous post (Data Warehousing with Databricks – Part 1 Dynamic Extract and Load – Cookie Codes) I demonstrated how Azure Data Factory can take various distributed data sources and bring them into a centralised location via a metadata driven process.

As demonstrated in that post we have our data in a centralised place – we can now transform it to make it fit for purpose for a variety of tasks, such as;

Data Warehousing
Reporting
AI Services

This post is going to show how a technology called DBT can be used to do this.

What is DBT?

DBT (short for data build tool) is a framework that allows raw data to be transformed, tested and documented using SQL, Jinja and configuration files. It focuses specifically on the Transform layer of a ELT (Extract,Load,Transform) process. Effectively you write models in SQL and DBT executes these transformations – but these models can have;

Tests applied
Dependencies calculated at runtime (the order in which executions need to take place)
History Managed (change data capture)
Documentation Generated
Dynamic configuration depending on the environment you are targeting (dev, test, prod etc.)
Load incrementally

DBT is cross platform – it can connect to most modern data warehouses such as red-shift, snowflake, Big Query or in our case Databricks.

Creating Delta Tables

In the previous post we moved files to our centralised storage account as either delimited text files (if the source was a file) or parquet files (if the source was a database).DBT requires the sources to be able to be queried directly via SQL. which means given we are using Databricks – we will need a process pre-dbt execution that converts these raw tables to a Delta format. We can do this in a Databricks notebook that executes after the Azure Data factory tasks (in the previous post) have completed.

The Databricks notebook will use the same metadata that Azure Data Factory used (the list of instructions of what to load). It will loop over this and convert the centralised text files and parquet files to Delta tables. Some code below shows how this works;

As you can see a number of functions are referenced;

get_metadata_json – Fetches the json used in the datafactory as a string, this is what is itterated over and contains the details of the locations of the files written
convert_csv_to_delta / convert_parquet to delta – A simple function that write a delta table based on a csv/parquet passed to it
register_table_in_unity_catalog – This function takes a delta table that exists in storage and registers it in unity catalog (CREATE TABLE statement with the LOCATION as the created detla)

After the execution of this notebook – there is a new schema “raw_delta” in the dataplatform catalog that contains the converted raw files;

DBT Setup

Now we have our data in a format DBT can use – we need to set up a DBT project. Fire up your IDE! You can install dbt and make it initialise a blank project for you as follows.

Use a package manager to install DBT core and the databricks adapter;

Set dbt as a enviroment variable path to make your life easier! (you can reference dbt then rather than a full path to the exe).

once dbt is installed and a recognised path you can then run the following terminal line;

where “example_dbt_project” represents your project name. The process (for databricks) will ask you for;

The host (your databricks workspace url)
http path (the http path of the warehouse / cluster you are using)
Access Token (a PAT token generated to authenticate with databricks)

Once this completes you should see upon opening the relevant folder a structure similar to this;

You will also have a folder created locally under your user named profiles.yaml. This contains your connection / environment details and is therefore outside of your repo (as you dont want these sensitive details in your repo). When deploying dbt via source control you may need to make this file using secured secrets in a pipeline.

DBT Project File

the dbt_project.yml file controls the details about your project – including the paths to relevant folders – most of this file will be left unchanged – the interesting section is named configuring models;

Here I am defining 2 schemas “prepared” and “mart”. My approach will be to take the raw data initially into a prepared schema and then into a mart schema (which will combine data and present it in a good format for reporting (spaces in field names etc..) where possible I am dictating I would like the models to be materialised as views instead of tables.

DBT Models

I can make the prepared and mart tables underneath the parent models folder – these correspond to the declarations made in the project file. I also add a source.yml file to the root of the models folder;

the Sources.yml file contains details of the source data dbt will use. These are the delta tables created in the first part of this post. These are referenced as follows

these sources are referenced in the models using Jinja (Jinja reference | dbt Developer Hub). We can make a very simple prepared customer model as follows;

with contents as below;

As you can see these are SQL statements – Dbt is being told is make a table named prep_customers in the prepared schema with that definition. the source ties back the the sources.yml value raw.customers.

I can tell DBT to execute this model as follows;

after this executes – you should see your datawarehouse / cluster start in databricks and then see some logs showing the task completed;

We can then make other models, I’ve made one for orders, products (wine) and products2 (spirits).

these models do the following;

Cast data types
Trim whitespace
add metadata for timestamp getting loaded
tidy up some data

obviously the tasks here are very subjective depending on the data you have.

Here is an example of products2 data;

Mart Models

The mart models follow the same process – at this point I am making views that AI Services / Power BI might consume from – the 2 projects tables are combined here into one products table for example;

notice how CTE’s are supported within these models. These are commonly used in DBT.

At this point we have all the models we need and we can execute a full execution of dbt (both prepared and mart models). A full execution can be performed by “dbt run”;

Now you can see how transformations are handled in DBT – we can execute a project of transformations in this approach.

DBT Test

One other feature DBT gives a developer / engineer is the ability to write data tests directly in your project. If we create a yml file for our prep customers model;

this file can contain descriptions about our data and tests on the schema. Inside this file I have the following yaml;

here you can see I have witten some description about what the key fields represent and have some inbuild DBT tests enabled. The tests dicated;

Email shouldn’t be null
The customer ID cant be null and must be unique

There are a number of in-built tests in DBT and you also have some flexibility to create your own. Further details can be seen here: About data tests property | dbt Developer Hub

one other example could be in our orders table – to put a test on the referential integrity of an ID;

here you can see we are referring one model to another via Jinja syntax. stating that the customer ID in the order table needs to correspond to a customer ID in the customers table.

We can run the test with the dbt test command;

in our case we can see a failure;

We can log the failures to a table for futher analysis as by using the store-failures option;

This writes to a location derived from your profiles.yml and a default name dbt uses you can see after executing this the folloiwing line;

If we run this in Databricks;

we can see our failure – customer ID 2 has a null email (it was like this at source). A data engineer can then decide whether this should be omitted or not / if any remediation can be done.

DBT Document Generation

the descriptions we write in the model.yml files can be used to automatically generate documentation about our data. If you run the following command DBT will generate documentation;

this generates a data catalog file in json, nice, but not the easiest to read, luckily dbt has another command that will serve a static website to you showing you this json;

This needs to be run after the docs generate command – it shows you a view of your data like this;

The files this command produces can be hosted as a static website so other members of your organisation can view it and is a useful code driven documentation feature.

DBT Incremental Loads

Currently our models would load all the data on each execution, for some forms of data (for example transactional) this might be inefficient as a transaction in the past cannot be updated today. Our orders table is a good example of this. Here we can implement some configuration to load this table incrementally. Meaning that a “watermark” is placed on each load and only more recent records are loaded on subsequent executions. The configuration for this looks as follows;

here you can see in a config section, I state the materialisation option is incremental, with the unique key being the row_id. You can also see at the end of the statement an IF statement saying – add if your order date time is larger than the current biggest order date time in the table (e.g. we only add new records each time).

This can be very effective in financial systems where once something is written into a journal it will never be changed, but be reversed in a subsequent transaction. There is no point loading the historical data more than once in such a setting.

DBT Snapshots

in addition to models, DBT supports change data capture via snapshots. These implement an SCD Type 2 approach to record changes to data over time. Let me walk through an example of this with a new table “Order_Status”;

this table represents the current state of a sales order within our system. Here we can see that the status of order 8 is “accepted”. When the status changes – the system would only display the current status, but for reporting purposes we want to track when the status changed for metrics such as “average order process time” or something of this nature.

We can set up a model in DBT to snapshot this table, by adding a .sql file into the snapshots folder;

Here we have some config stating this is a snapshot and specifying the strategy (based on a timestamp telling you that the record is updated). Snapshots are executed with the following command;

At this point we can see the classic SCD to Valid From / Valid To fields are auto generated by DBT – our example output table from this snapshot is shown below;

you can also see an autogenerated surrogate key (unique per version) dbt_scd_id – again we could define our own surrogates in the model if we wished.

Lets now demonstrate what happens if we make a change in our data updating the order status;

our order is now shipped – and has a new updated_on time (which will trigger the snapshot to see this update)

and re-run the dbt snapshot;

you can see it has end dated the previous state and given us a new record for a new state. This allows us to see the full history of the order – tracked automatically by in-build behaviour in DBT.

Conclusion

Well if you’re anything like me you are thinking “why wouldn’t I just use notebooks for this and write the SQL there?” and to some extent – given the trivial example I’ve used here that is a fair point, the volume of tables / data in this example probably isnt worth the setup time of DBT. That being said if you now imagine a full scale enterprise data-warehouse with hundreds of data points DBT then comes into it’s own. The main strengths for me are the ability for DBT to automatically work out dependency order (effectively the order in which models need to run) this is a huge advantage and changes which introduce new dependencies are catered for automatically.

If you are also using slow changing dimensions the fact these are in-built via snapshots in DBT provides a standardised automated approach to this without having to code any logic around this issue yourself.

Automated Data quality testing and documentation generation also allow for some tedious tasks to be fairly automated (for example: “Do we have a data dictionary?” … “well we can do give me a minute…” )

It’s also worth mentioning there are other features in DBT I’ve not touched upon in this post – I might revisit this in the future, some that spring to mind are;

Macros – allowing you to effectively write re-usable functions you can reference in your dbt models.
Semantic Layers + Metrics – a way of making an abstract layer on top of your DBT model that provides standardized metrics.
Support for Dynamic configuration across different environment (dev, test and prod) – here you can set environments up and reference them in Jinja, a good example is perhaps in dev/test you only want a tiny amount of the actual data you have from prod. You can set models to reference the environment executed and limit rows.

What’s Next?

Well lets just re-cap what has being achieved in the last two posts. We have taken various data sources / formats and combined them into a single centralised point via a process that is easy to extend (our metadata Azure Data Factory).

We have then got a project which can standardise, combine, transform, test and document this data for us, again that is easily extendible. This is no small achievement.

What’s next in terms of developing this solution, well I think that really falls into 2 groups – Dev-Ops operations and using the data.

Dev Ops

Currently I am running DBT from the command line on my local machine – the next step would be to have a process that runs the whole data warehouse job (extract, load, transform) in the cloud. I think this could be done orchestrated entirely from Azure Data Factory, but various other options exist including common. scheduling tools such as Airflow or Dagstar.

The following tasks would need to be executed;

Run the data factory pipeline to move all relevant data contained in the metadata config to lake storage
Execute a Databricks workflow consisting of;
- Notebook to transform Data to Raw Deltas
- Job / Notebook to initialise and run DBT

I would also consider making a deploy pipeline in Azure dev ops to allow us to build and deploy changes to the Data factory pipeline / DBT config automatically

Using the Data

AI Services – Genie

We could set up some AI services to consume from our data, one great example is the Genie service within databricks. Here you can see I have connected it to our mart data, defined relationships and provided some relevant instructions – we are then able to ask the data questions via a chat style prompt and receive answers;

you can reference previous answers in the chat;

and you can even see the SQL query that the service has generated – here is a CTE from this query;

One really cool thing about the Genie service is there is an API for it in public preview – meaning a company could have an “Ask the datawarehouse” page on the intranet.

that being said – it would be worth communicating that the responses are AI generated and might not be 100% correct – logging the responses / code generated is one good way to help update the model to make it more accurate for this purpose.

Using the Data – Power BI

While the Genie AI service is a very slick user interface and “fun” way for business users to interact with our centralised data, it can be incorrect like any AI service. Business reports tend not to want this level of potential ambiguity. I would propose any reports that are not ad-hoc queries be generated with a known codebase. We that purpose we could easily set up a power BI model using our mart tables. Something like below perhaps;

We could add the refresh of the Power BI semantic model to the end of our DW loading jobs giving us an automatically updating model where we could add standardized metrics in DAX and present this to analysts for them to do reporting on.