Introduction
In my last post (Learning Databricks – Accessing a Data Lake Using a Service Principal – Cookie Codes). I described the process for setting up a Service principal in Azure and how you can store its credentials securely and access them via Databricks. This post is going to continue from this point and use this service principal to mount an azure storage account, we will then be able to access files in the storage account and demonstrate how we can load them into Dataframes within a Databricks notebook.
What is a Mount
You might remember in the last post that we accessed a file with the abfs protocol. To refresh your memory abfs is a protocol for access files (similar to http but optimised for large data loads), an example of the format of an abfs endpoint is;
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
There are a couple of issues with this approach to access files in the data-lake;
- You need to authenticate with your credentials each time
- Semantically this is long winded and less familiar than the familiar file system approach
Mounting storage solves these issues. The approach is essentially an abstraction of the abfs endpoints and is set created once by an authorised user and can be re-used, unless the mount is removed. Think of this the way you might mount a network drive on your personal computer – the storage is accessed via a protocol (http) – but its abstracted so that you view it as a local drive.
Data We Want to Access
In order to mount a storage location – we first need said location. In my Azure account I have a data lake created and have created the container named ‘cookiecodes’. In this folder is one csv named ‘Useless_info’
For the purposes of this post we will mount this storage location and import the data. Please note this relies on us using the service principal made in the previous post
Note: You don’t necessarily have to use a service principal for this, you could use Active Directory pass-through / Unity Catalog but both of these require Databricks Premium (which I have not subscribed to).
Mounting the Container
At this point we know a location in our storage account we want to mount. Based on the previous post we also know how to fetch our service principals credentials from the azure key vault using the secrets Databricks utility.
We need to write a workbook that performs the following steps;
- Gets the key-vault secrets
- Sets the spark configuration
- Unmounts the mount if it already exists
- Mounts the storage account container
We will do this by writing a function that can be re-used within a Notebook for this – in this example this offers little benefit, but if you had a project where you needed to mount several locations this would guarantee a level of consistency and reduce the code you need to write.
The Notebook – Mounting / Unmounting
Make a new workbook – I’ve named mine 2.Mounting_adls_Containers. Firstly we know we are going to need to make a function. This function is going to receive 2 parameters, both relating to what we are going to mount.
- The storage account name
- The container name
we also know we are going to need to fetch the client id, tenant id and client secret for our service principal from the Azure key vault (note the key of these differs from the previous post as I re-created them with values not contained in the documented screenshots!).
At this point our workbook should look like the following;
The next step is to set the Spark config. Luckily this is well documented (Access Azure Data Lake Storage Gen2 and Blob Storage – Azure Databricks | Microsoft Learn) using what is detailed here and passing in our secrets we have the following.
Now we need to introduce the unmount / mount methods. These exist in file system (fs) data bricks utility. As with all these utilities you can look as the documentation by requesting help in a notebook.
In this documentation we can see a few functions we will be interested in Mount,Unmount and Mounts.
For our function to not error if someone wants to add a mount that already exists – we should remove that mount and then re-add it. To do that we need to know what mounts already exist (the mounts function), using this we can unmount if required. The following code should achieve this.
Here we are evaluating if note if any mountpoint (a property of each individual mount in the mounts collection) is equal to the one we want to create – if it is we call the unmount method on the mount we want to create .
please note the syntax /mnt/ is a requirement for mounts.
Now we have this in place the only code left is to call the dbutils.fs.mount method for the mount we want to create.
The source value maps the abfs location to the mount, the mount_point is what we will reference and the extra_configs are the spark configuration we set up earlier with our credentials.
At this point our function is complete, but remember running this in a notebook doesn’t execute the function – it creates it so that it is callable. we now need a new cell in our notebook to execute it.
we can call our function like this – mounting the cookiecodes container within the cookiecodesdl datalake.
We can display all our mounts with the dbutils.fs.mounts function we used earlier. The screenshot below shows our newly created mount.
This is useful function if users are manually creating mounts without a strict naming convention (creating mounts manually without a function) as you may think a mount points to a particular location but it does not.
Ok – what now?
Well now you have a mount pointing to a particular location, this was created by the service principal and will remain on the cluster in use until someone unmounts it, this means no credentials need to be passed to access this location and we can refer to the path /mnt/cookiecodesdl/cookiecodes instead of abfss://[email protected]/
Importing the File
Lets make a new notebook – the mounting task is clearly separate from loading / analysing data. I’ve created a new notebook named “loading useless info”.
What is the Useless Info
The useless info file is a csv with 3 fields (it really is useless – I just made it in excel for the purpose of demonstration).
The file looks like this;
There are 2 fields that contain integers and one field that contains strings. My approach for importing any files from a “Bronze” stage (raw import) to a “Silver stage” (processed) is as follows.
- Import statements
- Create a schema for the data
- Read the file into a Dataframe
- Rename/Drop Columns
- Write the file to “Silver” Staging
Note: Databricks documentation refers to 3 stages of data in a medallion architecture;
- Bronze – Raw Ingestion
- Siler – Filtered, Cleaned, Augmented
- Gold – Business Leve Aggregations
This is documented at the following page: What is a Medallion Architecture? (databricks.com)
So lets go through these steps – please be aware there are several different approaches for each of these, these are one of many.
Import Statements
There will be some python libraries we want to use for this, I tend to use the first cell to import the ones I will use.
here I am importing data types I will use in schema creation, the column function and the current_timestamp function from pyspark.sql
Schema Creation
I need to define what the structure of the data is like to enforce consistency (it will error if a different file format was to arrive) and to inform the Dataframe what it will receive. The code below achieves this.
Here I am using the structType object (which is an object that represents a row of data). A row contains a collection of StructFields (each representing a column of data). I need to reference the names of the fields expected, their data types and if they are nullable.
Read the file into a Dataframe
Now we have a schema we are able to import the data passing this schema to tell the reading method what to expect. We can also use the mount we created earlier to point to the file. The code below achieves this.
You can see how we are able to access the mount we created earlier (to the cookiecodes container) from any notebook that is now created in the cluster. We also state that the file contains a Header row.
At this point our data is in a Dataframe, We can look at it by displaying the newly created Dataframe.
We have successfully mounted a storage location (using a service principal) and imported data into our notebook in Databricks!
For completeness I will put some examples of the other stages I mentioned.
Rename/Drop Columns
This is a stage in the process where you will restructure the dataframe to give fields names that you want to reference in the future and to remove any data you feel is unnecessary in your future data endeavours.
For example with this file I might want to make the titles snake case (row_id, value instead of Value) and I might want to drop the UselessInformation column… as quite frankly it is useless!
Its also sensible to add a timestamp to show the time the data was loaded, this might well help debugging / identifying when jobs failed / slow changing dimensions in the future.
The following notebook cell would achieve this.
Write the file to “Silver” Staging
Now we have our “Silver” data frame we should write it back to our storage account. I have added a directory in the cookiecodes container named Silver. This will be our target destination. And lets assume we want the file saved as a parquet file type. Given this we can write a cell in our notebook that is very similar to reading to the dataframe but instead writing it to our destination (using the mount).
We should now be able to see our exported “Silver” file in the storage container silver directory, as a sanity check we can even read it from the parquet file and display it in our notebook, again notice how we can use the mount for all of these tasks!
Conclusion
I have demonstrated how to set up a mount and the advantages this gives you with authentication and access familiarity when compared to the abfs protocol.
I have began to explain the medallion architecture Databricks recommends for data engineering, whereby the raw files are considered “Bronze”, processed and cleaned files are considered “Silver” and finally business aggregations are considered “Gold”. This is a much wider topic with a huge amount of variability. I will no doubt write more complete examples of this in the future, rather than this very simple Csv.
It is worth stating that as you would expect there is an enormous amount of functionality within spark for importing, manipulating and exporting data and this example was a very very very basic small one, really to demonstrate mounts more than any processing, but you would be amazed how little has to be changed in this code to import Json or files that are separated into multiple files within a folder. The Pyspark documentation is a good point of reference for any specific real world tasks you might face. (API Reference — PySpark 3.3.2 documentation (apache.org))
In the near future I will try to pull together a much more detailed example and show progression from raw files to establishing relationships and performing business aggregations, among other features in Databricks.