Introduction
I had a job interview today and one of the topics that was raised was importing JSON based data (which can be considered as unstructured) into a table format (structured). I offered a few ways of how this could be done and I think answered the question appropriately, but the interviewer said a number of applicants had found this difficult. Due to this I thought I would offer a couple of ways this can be achieved, obviously there are hundreds of ways of doing this – but I have shown approaches using the most common languages for a Data Engineer; SQL and Python and an example of how an ETL tool could be used for this.
Example JSON
we can use a JSON file containing colours for these examples.
we want to take this JSON and format it into a table as below;
Using Azure SQL Sever
Lets assume we have our JSON file in an azure storage container.
We also have an Azure SQL Database, which we can connect to with SQL Management Studio.
Accessing the File
Firstly we need a means of accessing the data from the storage account.
we can enter various IP addresses that are allowed / time durations where access to the container is allowed etc..
once we have this account we need to bind it to a Database credential in our database. To do this we need to run 2 commands. Firstly we need to be able to make a symmetric key to protect the private keys of certificates and asymmetric keys that are present in the database.
Where the password meets the Windows password policy requirements of the computer that is running the instance of SQL Server.
Once we have this we can create our database credential to access the storage account, via the shared access signature.
The secret here is the SAS token generated by the creation of the shared access signature from the storage account.
We now can use this credential to map the blob storage account as an external data source, this can be done as below.
We now should have the ability for the database to access the file.
Querying the JSON File
You can use some commands in SQL to directly access the file in blob storage and parse the JSON. If I run the following query.
I obtain the result set below.
BulkColumn contains the whole JSON file, but you can see each property is parsed out in the following columns that were cross applied. Note that RGBA is a JSON object as it is a dictionary containing other values – we could parse this further if we wished into its separate elements.
We have successfully take the unstructured JSON file and parsed it into a table format using just SQL.
Using Pandas
If we didn’t want to use SQL functions and have a scripted approach for performing this task Python would be a common choice. Given Pandas popularity in data science we could use that to perform this task.
Firstly we need to read the file, using the JSON library.
From here we can use a very useful dataframe method called json_normalize to convert the json into a dataframe.
which gives us the result we are after, the output to the console is as follows.
We could then write some fairly standard code to insert this dataframe into the database.
we renamed the values with the hierarchy path (properties of code) so their names do not contain full stops. We cast the objects as strings and then we set up a connection string, looping through the Dataframe writing an insert SQL statement (the colours table already existed in the database).
upon competition we can select from the database.
and see that our values are inserted as we expected.
Using an ETL Tool – Azure Data Factory.
Firstly you need to set up Linked Services in Azure Data Factory, these will be connection strings to the source JSON file (in blob storage) and the target SQL database.
From here you need to create 2 datasets – one representing the file to be imported and one representing the table to be imported to. I created a target table in the database with a simple create table statement;
The database dataset points to this table.
and the JSON dataset points to the specific file to be imported.
You can test these datasets with the preview-data option to check they work. You will need to configure a service account to allow the data factory to access the DB.
Once we have this we can make a new pipeline with a copy data task.
we set the source to the JSON file.
and the sink to the target db.
The only stage left is to instruct the task on how to map the JSON structure to the database table. this is done via the mapping tab.
Here we say we are iterating over a JSON collection with the root of the collection being “Colors”. we then map our attributes – the rgba property is a json array we could handle this as a secondary collection, but we’ve not done this in previous examples so I tell the task to handle it as a string (map complex values to string).
we are now in a position to run the data factory pipeline.
the pipeline has completed successfully (the durationis 57 seconds as the job was queued for a while on the cloud agent – the performance is certainly not this bad!).
we can then check the database table.
and see we have our data imported in a structured format.
Conclusion
Obviously there hundreds of ways of achieving this, for example most programming languages and ETL tools will have this functionality.
if the job needed to be repeatable, automated and possibly frequently extended (new files to be loaded as part of a job) – perhaps I would go the ETL tool (Data Factory/Databricks/Azure Automation) route as my preference, whereas if this was more of a one off / side of the desk task, scripting it is faster to set up and requires less resources to exist in the cloud to perform the task.
Regarding what I was told in this interview, I must admit I was surprised to hear many candidates who applied for this role struggled to explain how to perform this task. I think perhaps its that pure database (SQL) developers are not necessarily familiar with unstructured data, whereas perhaps developers who write object oriented code or use JSON data a lot via web services see this type of problem daily.