Data Warehousing with Databricks - Part 1 Dynamic Extract and Load

Introduction

I want you to do me a favour – Imagine you work for a disorganised company (that really wasn’t hard was it!). This company operates in a fairly isolated approach to business with different systems for different departments and very limited ability to link together data to show organisational metrics, its also possible due to the distributed data sources that there is often disagreements on metrics and no clear “source of truth”.

This is a fairly common issue within companies, particularly if they have grown via acquisitions, where you are literally acquiring more datapoints, more systems, more departments that operate outside your original core systems.

Data warehousing is an approach which seeks to solve this issue, the main focus is to pull organisational data from various sources into a centralised location where data is then cleaned, deduplicated, aggregated and ultimately presented to business users.

Sounds great right? but how do we go about doing this?

Well… there are many ways, I will attempt to demonstrate one approach for this. I will split my walkthrough of this into 2 parts;

Extraction and Loading of Data (the physical act of obtaining the data and moving it to a centralised location)
Transformation of Data (the cleaning, deduplicating, aggregation, modelling)

This post is showing how we can use Azure Data Factory with a metadata configuration to make a flexible approach to extracting and loading data to a centralised location, based on a fictional wine shop business.

What are the Objectives?

We know given the disorganised nature of our fictional business that we will need to pull data from various sources and in various formats, we also know it is very likely that in the future new data sources will arise that possibly we didn’t know about, some key features we want in this solution are therefore as follows;

We want a system that can extract data from multiple locations
We want a system that can handle data extraction in multiple formats (text files, databases, json etc..)
We want a system we can easily extend to add new data points without having to recode everything.
We want to store all our data in a centralised Azure Storage account (data lake)
We want Databricks to be used to access the data we move into the centralised location (a data lake house approach)

It is worth adding that this article is not demonstrating a product that is going to solve all your problems, I have spend a day coding on this, however it should seek to show you how this approach could be leveraged to fit your particular business needs.

Azure Data Factory

Azure Data Factory is a cloud service from Microsoft that helps you move and transform data between different places, it supports connectivity to on-premise data sources as well as those in the cloud and also supports a wide variety of data formats. You can set up pipelines to pull data in and send it where it needs to go — all without having to manage the servers yourself. We will be using this service in this post.

Metadata Configuration

One of our objectives it to make something we can easily extend without having to constantly recode our service. One of the ways we can do this is to use metadata configuration within Azure data factory.

Metadata configuration is an approach where within our code we will not hard-code any values to data sources, data locations, and various other settings but instead fetch these from configuration. This works as follows;

I will create a text file (metadata) of things like source file paths, destinations, formats, and transformation rules.

This text file will be read at runtime.

The pipeline can then process many data sources in one execution, with the details of each data source being provided from the metadata text file

This approach means if we need to add new data sources, or change their locations (either target or source) we can do this by editing configuration rather than any code changes.

Creating the Metadata

We need to create a text file in a format that can be easily read by a data factory process. We also need to consider what we need to have in this metadata file to be able to allow it to tell the process what needs to be done. I will use a JSON file to store the configuration and will have objects/attributes in the JSON to tell me;

What data to load (each object within a list will represent this)
The “type” of data to load (text file, database table etc..)
The source location of the data
The target location of the data
The source format of the data
The target format of the data
Some secure way of telling the process where to get credentials (if required)

Depending on what data is to be loaded we may well need to know different properties, for example – if we are connecting to a database we will need to know a database connection string (the server, the database, authentication) – however if we are connecting to a file in cloud storage there is no database server.

because of this each object in our JSON file can have different attributes, based on the type of data we are loading.

Considering all of these factors we could make a metadata config file like below;

Here you can see our configuration is a collection of objects, we specify a csv which appears to be stored in azure storage at the following path stgcustomers/customers/customers/customers.csv it is comma delimited. The target is dataplatformstg1/raw/customers/customers.csv

What we need to do now is set up an Azure Data Factory service that can connect to this metadata, read it and then use the data here to set up a process to copy this file.

Azure Data Factory – Loading a CSV to Storage

In Azure we need to create a new data factory instance. We then need to grant the service principal of this instance access to the storage accounts it needs to read from / to (I am assuming they already exist). This will give the service permissions to read/write data to these locations.

Reading Metadata

Within our data factory our first task is to make it able to read the metadata. Lets assume we are storing our metadata at the following location in azure storage;

dataplatformstg1/adf-metadata/adf-metadata/adf-metadata.json

Create a Linked Service – Azure Storage

We need to create a linked service in Data Factory to allow our data factory to connect to this. Think of a linked service as a way of telling data factory where the data lives, similar to a connection string. this allows the Data factory service to be able to connect to a data source.

Given we are storing our data in an Azure storage account, we need a linked service for Azure storage. Navigate to linked services in data factory and select a new linked service – search for the type of storage account you are using (gen2 in this case);

Click continue.

We now need to enter the linked service details, but remember we are seeking to make the connections we use dynamic – they will be access from the metadata configuration, this means that we want to use one linked service for any data lake storage connection regardless of what the data is. Because of this we need to parameterise the linked service to allow us to inject configuration into it to change the connection at runtime. We will enter the following configuration;

and set a parameter for “accountName”;

this parameter will allow us to inject a storage account at runtime, meaning that when we loop through our metadata we can obtain the attribute of the storage account name and pass it into this linked service.

You may notice I am defaulting the parameter to the storage account that contains our metadata (this location is also the target for our data extractions) but this is not required. Assuming your data factory has access to this storage account via an AD role to the service principal. You should be able to test the connection;

Create a Dataset – JSON

At this point we have the ability for Data Factory to connect to a blob storage account we inject into the linked service (provided it has permissions). We now need to make a dataset object, datasets dictate what data is there within a connection and what format it has. Create a new Dataset for JSON (the format of our metadata) – configure it as follows;

you can see we have 4 parameters configured;

Again, the goal of these parameters is to make this JSON dataset be generic, in that we could use this data set for any JSON data we need to move / write. The parameters injected dictate the location of the json file (accountName/container/directory/file) – we are assuming it will always come from an azure storage account (using the storage account linked service), while this could also be set dynamically (see “Whats Next“) it didn’t seem appropriate for my simplified example.

Reading our Metadata

Given we now have a linked service (connection) and a dataset (specifying what data is there) we are now able to set up a pipeline that reads from our metadata. Create a new pipeline and add the “lookup” task

In the settings of the lookup task select the json dataset and pass appropriate parameters to tell the task its location;

you can then click the preview data button and if it is set up correctly you should be able to see the JSON metadata;

We now have a task in a pipeline which can access and read our metadata.

Creating a CSV Dataset

we can see in our configuration we want to read data from a comma separated text file originating from Azure storage, while we have an azure storage linked service, we don’t currently have a csv dataset – so we will need to make one. Create a generic CSV dataset, this is very similar to the JSON one we made earlier;

Again – we have all the previous parameters to signify the location of the text file in the storage account, we do have one new one “columnDelim” which specifies the delimiter used (comma in this case).

Using Our Metadata

What we need to do now is create tasks and pipelines which use the metadata. We need to consider how this will work, on a high level we know that;

We will loop through each object in the metadata (each object representing a data point to be loaded)
We will evaluate the “type” of data to be loaded – this will dictate what connection we use, what config is expected etc..
Execute the data load for that “type” – moving the data from the source to the target in a specified format.

To achieve this we will need to connect our lookup task to a loop container (looping through each object in the metadata);

the items we are looping over are the output values of the lookup (@activity(‘Lookup Metadata’).output.value). You can also see within the loop is one activity – a switch statement based on the “type”, lets look at that;

Here we are evaluating the “type” attribute in the metadata, if the value is csv – we will execute a new pipeline signifying to load a csv, lets look at this pipeline;

This pipeline consists of a copy data task – we are going to copy a csv from an Azure storage source to an azure storage target, the source/sink can be configured as follows;

Source

Sink (Target)

Here you can see we are using the same generic delim_text dataset for both – but we are changing properties in it. In this case I have made the decision that I want all the data in the target data lake that is delimited to have a pipe (|) delimiter, this is just a small standardisation I am applying, its not entirely necessary I just think it will make some downstream tasks somewhat easier.

Passing Parameters from Metadata to the Load CSV pipeline

If you are observant you might be thinking – how do these parameters get set? well back in our loop task, we can inject these when we execute the child pipeline;

Please note that these parameters are taken from the current item within the loop (each object within the metadata that in this case has a type of csv).

At this point we can try to execute our pipeline – all the configuration is set up to load our CSV from its source storage account to the destination. If we run the Data Factory Main pipeline in debug mode;

we can see our tasks have completed successfully. We can validate this by looking in the target storage account and see the file is written;

we can download / edit it and see that it is now a pipe separated csv and formatted correctly;

This is great, at this point we have a dynamic metadata driven loading process that can move delimited text files to and from various blob storage accounts (they can be in different subscriptions, RG’s etc..)

Bringing in More Data – Additional CSVs

lets demonstrate why this metadata approach is so useful – lets say there are 2 more delimited text files that we need moving from another department within the business, they are at different locations (different storage account and paths) and have a different format (pipe delimited not comma delimited)

products.csv (wines)

products2.csv – Spirits

As all our pipeline code is dynamically driven from the metadata – all we need to do to include these files in our extract and load process is to update the metadata with appropriate configuration, this could be done as follows;

We are telling our configuation there are more files – telling them the location of these files (a different storage account and path) and we also specify the delimiter used and the target destinations. Provided the Data Factory has access to this storage account, this should be sufficient to load these files with no code changes.

If I re-run the data factory main pipeline with this configuration;

you can see that 3 executions of the child pipeline have occurred – we have 3 delimited text files to load now. We can see this was successful and see them in the target location;

Hopefully now if you’ve read this far (firstly.. thanks and well done) you can see why this metadata configuration approach is useful – we need more files loaded, we add them to the list, that’s it.. no code changes.

However this leads to the next question – “OK.. but what if I want something that isn’t a text file – there is something in a database”. Lets look at that now.

Bringing in Different Data – SQL Server

While the metadata configuration allows you to bring in more of the same data by modifying it, whenever you have a new “type” of data – you will need to set up data factory configuration for it, once it is done though you can re-use this config for anything of that “type”.

Lets assume this company has an orders department that uses an SQL server database, we need to move the orders table shown below to our target storage;

What we need to do is very similar to the set up of the generic CSV to Azure storage process we created earlier;

Add an azure Key Vault linked Service (for accessing secrets)
Add a linked Service for SQL Server databases
Add a dataset for an SQL Server Table
Add configuration into the metadata file specifying location, target and authentication secrets
Add a dataset for the target file type (parquet)
Add a new switch statement (sqlserver) and pipeline to execute the copy data task

Azure Key Vault Linked Service

I am going to use SQL Server authentication for this example – you could also use Managed Identities, but this allows me to show an approach of how secrets can be securely access via Data factory with this approach. I have created a key vault in Azure named “dataplatform-kv1”. I have granted my Data factory principal access to read secrets from this key vault and I have one secret stored in the vault – the password for the SQL authentication, the secret name is “orders-db-sql-pw”.

I can then create a linked service to access this key-vault (i’ve hardcoded this, but using the same parametrisation idea you could inject the key vault to use), I simply think having one key vault with the data platform secrets is easier to manage / rotate secrets etc..

Given this linked service data factory can now connect to our key vault and securely access secrets.

Azure SQL DB – Linked Service

the configuration for this is shown below;

We have parameterised this to allow us to connect to any Azure SQL DB that the data factory has access to we have the following parameters;

database_server_name
database_name
sql_user_name
sql_db_secret_name (the name of the secret in our keyvault – not the value)

Azure SQL DB – Dataset

As before – we have the ability to connect to a database via our linked service – but we need a dataset to define the database table object;

here we are using the linked service created and injecting parameters to that – we also have 2 new parameters schema and table_name – these will be injected from the metadata to tell the dataset the particular table we are interested in loaded (in this example dbo.orders).

Adding Metadata Config

We now know the parameters we are using to connect to a particular table within a database, we also know the parameters we need to write to our target storage account – we can add a new metadata object containing these;

.parquet? what is that?

Well…. in the raw layer of my data platform I want to keep things as similar as possible to the source. I dont want to replicate an actual database there, I want files to be loaded into databricks – by using a parquet file I can keep the schema that the original database has. I am making the decision that any Azure SQL table loads will be moved to the raw folder as a parquet file rather than a csv.

Adding Parquet Dataset

As parquet is a format we haven’t used in our data factory configuration so far – we need a dataset for it. This can be seen below;

Modifying Data Factory Config

We now have everything we need to add SqlServer as type to our data factory process we will add a line in the switch statement within the loop and then have a new child pipeline for copying Azure SQL;

the pipeline is a copy data task dynamically set from the current item in the loop;

Source

Sink (Target)

Values injected From the loop;

Executing the Job

we can now run the main pipeline;

you can see an Execute SQL Server Load child pipeline has completed succesfully;

and we can see the parquet file in the target of storage;

At this point we can see that our Azure SQL Server Load has worked, any future Azure SQL Server data loads can now be achieved by simply adding objects to the metadata configuration.

Conclusion / What’s Next

I hope this article has shown you how Azure Data Factory can be used to extract and import data into a centralised location, I also hope that you understand the power of using it in a dynamic way with metadata. the key point here is that is that once you have set up a particular linked service you can re-use as much as you like by editing a text file, so requesting 100 new data points to be moved suddenly becomes trivial.

This is a simple example, and the goal is not to show you an enterprise product, simply to show how some functionality works and what is possible obviously you can take this much further and there are more things to do be done to make this “production ready” these include;

Further parameterisation (you could parameterise the datasets themselves – for example I assume that the delimited text datasets are always on blob storage – they might not be and could be in an AWS S3 bucket or an FTP, making this dynamic would give you much more flexibility in a wider enterprise environment, but for me was beyond the scope of my example).
Dev Ops pipelines – I would want to hold my data factory code in Azure dev ops and have pipelines to build / deploy it, Given we are going to use a Databricks service to transform this data the project would also hold all the infrastructure around that (catalogs, registering the metastore, mounting external locations, permissions etc..), I would use Terraform to manage this.
Transformations – Currently we have moved data to a centralised location, but aren’t really doing anything with it – Part Two of this post will focus on using Databricks to make this centralised data useful.

Cookie Codes

Data Warehousing with Databricks – Part 1 Dynamic Extract and Load