Introduction
This post will continue to be on the theme of Databricks, the posts so far have shown an introduction to spark / data bricks as a concept and a couple of administrative tasks (creating a service principal / mounting / accessing storage). This post is similar in nature. It will show how you can connect GitHub to Databricks so you care store your code in a repository, branch and pull request in the way that you are familiar with.
Creating a Repository
It goes without saying that if we want to connect a repository in GitHub to data bricks we first need a repository. Create one in GitHub as normal. Here I have created a repository named DatabricksNotebooks. It is currently empty.
Access Tokens
We will need an access token to allow Databricks to authenticate with this repository. These can be obtained in Github. Click your user icon in the top right corner of the screen and select “Settings”.
Then select “Developer Settings”.
From here select “Personal Access Tokens” and select “Generate a personal access token”. Enter a relevant token note to describe the token.
From here you can also set up scopes dictating what this token will allow to happen in the repository. Here I have selected the DatabricksNotebooks repository with the “repo” scope.
Scroll down and click the generate token button. You will then see a screen that shows the token. Copy this – you will not see it again, if you lose this you have to create another one. It looks like below;
Note: Don’t worry – I revoked this token and created another one, this isn’t a security breach!
Now we have a repository and token, we can set up our connection in Databricks
Databricks – Initiating Git Integration
Open your Databricks workspace. Click your username on the top right of the page and select User Settings. From here select the “Git Integration” tab. you should see something like below;
We are using GitHub so need to select that as our Git provider. We enter our Github username and the token we just generated. After this click Save.
At this point the token is securely saved in Databricks and we should discard anywhere we saved it to!
Databricks – Databricks Repo
We now need to add our repository to our Databricks workspace. Click the Repos option on the sidebar.
And select “Add Repo”. You should now see the following.
What this will do is clone our git repo to make a “local” copy. but unlike traditional git use our local copy will exist in our Databricks workspace. Enter the git url, repository provider and repository name (what you want to see it as in Databricks).
Click “Create Repo” (this really means clone repo). At this point I can see my repository in the Databricks workspace.
but the repository is empty (as we would expect). Lets look at how we could interact with this repostiory.
Interactive with the repository
Before we go into the details of creating branches it might be worth looking at common branching strategies used in git. below is a diagram that shows a fairly standard git branching strategy.
- We have out main branch – what will actually be released to production
- We have a dev branch – which is where all the individual developments will be merged into, and then when a release is made this development branch will be merged into main
- We have individual features. being developed at different times and being merged into dev
Initial Commit
Before we create any branches we need an initial commit – something has to exist in the repository. Lets make a folder structure to hold our notebooks.
you can create objects on currently checked out repository branch by right clicking on it and selecting create.
Alternatively you can also import items by selecting the Import option. In this example I am going to create a folder named “Notebooks” and add aa notebook into this folde;
notice how the user interface doesn’t actually change much – I am interacting with folders and notebooks the same as I did before, just now they are in a branch in a repo.
To make this initial commit I can click the branch button. It now shows an editor showing my change. I have one changed file in folder Notebooks.
I need to enter a commit message and then I can commit my changes and push them to my Github hosted repository.
If I go to the repository url in github I can see this new folder/file there.
Creating a branch
oh no! there is a bug! – print “abc” is an incorrect command. To achieve what we want here we need to code print(“abc”). We need to make a feature branch to fix this bug.
To create a branch click on the repository and then click the “Create Branch” button.
Click create. I now have 2 branches in my repository. When you create a new branch Databricks automatically checks that branch out but you can toggle between the two. Lets update the string printed in the notebook in my feature branch. Edit the notebook as normal.
Now revisit the repo page. You can see the change that was made in the window;
if I commit and push this change – I now can see it in that branch on github.
Pull Requests
Now I need to take my feature branch and merge it into the main branch. This is not done in Databricks at all but is the usual approach in github.
Conclusion
This post has shown how to create a new github repository relating to Databricks artifacts, how to create separate branches and commit and push them to your github repository. The process is not that complicated to set up, but if you are working with team members with limited git experience, it’s probably worth explaining to them the notions of branches, branching strategies, commits, pushing / pulling code prior to letting them lose in the Databricks repos. The main reason for this is that the git implementation in Databricks still requires a number of tasks to be undertaken in the git provider, specifically;
- Creating a pull request.
- Resolving merge conflicts.
- Merge or deleting branches.
- Rebasing a branch.
That being said its great there is support for source control in such a way and this greatly benefits teams with multiple developers working on the same data engineering / analysis tasks or any teams looking to incorporate dev ops practices into their Databricks useage.