Introduction
One of the first tasks to learn when using Data bricks is how to access data contained in a data lake. I will be using Azure as my cloud provider in these posts and therefore my data lake will be Azure Data Lake (gen2). This post will show the necessary steps that you need to undertake in order to create a data lake and set up authentication using a Service Principal.
Why a Service Principal?
There are several ways of authenticating with a service account using Data bricks, probably the most common within organisations is to use Azure Active pass throughs (RBAC), however this feature is only available in Data Bricks premium (which I have not subscribed to), not only that but for my cloud environment consisting of one user, it is totally overkill.
Service principals are often used when running automated tasks, it is effectively a user which is not an individual that runs jobs on the cloud/data bricks estate. Good examples of when a service principles would be used (even if RBAC was in place) would be triggering jobs from Azure Data Factory or executing jobs within Data bricks directly.
What we will need to create in this task
The following services will be required to be created in azure for this task;
- A Databricks cluster
- Data Lake Storage account
- Service Principal in Active Directory
- Azure Key Vault (to hold authentication secrets)
Databricks Cluster
Within the Databricks workspace we will need to create a cluster to be able to execute code against. A cluster is simply 1 or more Virtual machines that run tasks for us (distributed across several machines if possible). Given these are Vm’s there are a wide range of settings that can be applied. I have created a basic Standard Ds3 Cluster
It has one driver note and 1 or 2 worker nodes. meaning there are a maximum of 3 VM’s in the cluster.
Data Lake Storage Account
I created a new storage account in the Azure portal named cookiecodes. We need this to be a Data Lake Storage Gen2. So make sure this option is selected.
Creating a Service Principal
We need to make a Service Principal in the Azure Active directory – it is this principal which we will use to interact with our storage account. The first step is to open Azure Active Directory.
Registering the Service Principal
Select “App Registrations” and click “new Registration”. and enter a relevant name. Once you have registered this new App you will see the following page, this contains some vital information you will need in a bit.
you need to note the application (client) ID and the directory (tenant ID). I have decided to share the Id’s and secrets I will generate in this post – this would be a security blunder, except I deleted all the resources I created for this post before I uploaded this article, so there are no vulnerabilities created by this.
Generating the Clients Secret
Within the principal we just created click the “Certificates & Secrets” option. Then select “New Client Secret”. Enter a relevant name, select an expiry time and click Add.
At this point you will see the window below, you will only be able to see this once so its important you store the value generated somewhere. Copy the value highlighted below
At this point we have the Service Principal created, and have generated a secret for it. We don’t want to store these secrets in visible text as that presents a security risk. We should store these generated values in an Azure Key vault.
Creating a Key Vault
Create a new key vault resource in the Azure portal. Give it a sensible name, but it in your selected region. As the creator I will have full access to the key vault. In reality we might also want to put some controls in place dictating what ip addresses can connect to this, but for this example this isn’t necessary.
We now have a key vault created. Click the “Secrets” option and you will see the following.
We should add our 3 pieces of sensitive data (client id, tenant id and secret) into this vault. click Generate/Import.
Fill in this form for all 3 pieces of sensitive data. you should see something similar to below;
Role Assignment
We are nearly there, but we have 2 more task to complete before all our infrastructure, accounts and roles are set up. We need to assign a role assignment onto our storage account. Go back the storage account and click “Access Control (IAM)” click Add and select “Add role Assignment”. Here we need to search for the relevant role – which in this example is “Blob Storage Data Contributor”;
Click on the role and select next. We will select our previously created Service principal by clicking the select members option. You should see something similar to below.
At this point our service principal now has access the storage account.
There is one more task we need to do – Add Data!
Uploading Data to the Storage
This particular post is about setting up a Service Principal and using it to authenticate much more than it is about the data analysis (which is a shame – but we will get to that!) due to this I am just going to create a simple csv file containing nothing particularly interesting to upload. I have a file Locally on my machine.
I can then access my storage account from the Azure Portal. I have created a new container (this is basically a folder) named databricks.
If I click on this container, select the “upload” button and select my local file. I should then see it uploaded into the container.
Phew! all done with the set-up!
Back to Databricks!
Secret Scopes
Databricks does offer some support to hold secrets. however the better approach (in Azure) is to use key vault. This is for 2 reasons;
- All your Azure Secrets remain in one place
- If you need the Databricks secrets for other services (e.g. Data factory executing a Data bricks notebook) you can access them
This does mean we need to do a little set-up in Databricks. We need to create what data bricks calls a “Secret Scope” and link this to the key vault.
Navigate to the Databricks homepage (not the workspace page) – you can get to this by clicking the Databricks logo next to the search bar;
once you are here you need to access a hidden UI within Databricks. Go to the url in your browser and add “/secrets/createScope” at the end of the address (note the upper case S). You should now see the following.
We need to complete this form. These fields are explained below.
Field Name | Description |
Scope Name | the name of the scope, how you will refer to it. |
Manage Principal | Creator (allows only the creator to use it – Premium only) / All Users (allows anyone to use it) |
Key Vault – DNS Name | The URI of the key vault (Vault URI in properties of the Key vault) |
Key Vault – Resource ID | The unique ID of the key vault resource (resource ID in the key vault properties) |
If you go back to the key vault you created and look at the properties you should be able to see these values, mine are shown below.
So my completed form in Databricks looks as follows.
Click Create.
Notebooks
We are going to create a Notebook in our cluster. Notebooks are where you can write code to query data in Databricks. Its very similar to a Jupyter Notebook (if you have used them). basically you write codes in cells that can be ran individuals (and be written in different supported languages if required) you can also add rich text via markdown to help document your code.
Put simply this is where you write code.
If you create a new workbook in your cluster you should see something like the following.
What we want to do is to connect to our storage account using the service principal we have created and obtaining the secrets from the key vault.
Luckily Databricks has an inbuilt utility to help us fetch the secrets. dbutils.secrets
Lets demonstrate this in our notebook.
The help command shows us all the options within the exposed utilitiy.
Now lets list the metadata for the scope we created.
We can see that we have successfully connected to the key vault using the data bricks scope, as we can see the keys of our 3 secrets. Lets obtain these and put them in relevant variables in a new cell.
At this point we have our secrets stored as variables – but importantly they are never exposed in the notebook, this means should we publish code into source control it would never be revealed.
Databricks also prevents this data from being displayed – look at the output of the following command.
It is aware i’m trying to display sensitive information and shows ‘[REDACTED]’
no we have the necessary information for our service principal we can set up a connection to the storage account. Thankfully the commands for this are well documented
We can enter the spark settings and replace the secrets with our variables.
Please note that the name of the storage account is referenced in the keys (the first value in each of the pairs passed)
Notice how again – nothing is exposed, we are using our variables (which cant be displayed) to handle this sensitive data. At this point we should be able to connect to our file in the storage account (fingers crossed). Lets try to query it.
To Access the file we need to use the abfs protocol. This is similar to http but optimised for big data loads. It is a Hadoop system driver for Azure data lakes. The structure of the path is as follows;
abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/<file_name>
Yeah… not the easiest, there is a way of abstracting this to act like a file system, you can do by creating a mount (possible next post on this issue) – but this post is already getting long and it is getting late!
Please google this protocol if that’s your bag! the example below is me trying to access the file I uploaded.
You can also see the data frame displayed with data from the file!
well.. that was a marathon, but now the necessary exists, I can just connect from each notebook.
Well that would be the case had I not displayed secrets all over this site. Time for me to remove everything I just created!
Conclusion
This post I hope will be incredibly useful to anyone who does need to set up a service principal within Databricks. While most users of Databricks will likely have RBAC via their active directory pass through for their accounts, its quite probable they will need to create a service principal should they want to automate any of the jobs within Databricks, so this process is necessary to know.