You Can’t Put the Genie Back Into the Bottle – Genie AI/BI

AI services are everywhere now, even (apparently) used in washing machine cycles (SAMSUNG Series 6 AI Energy), whether you like it or not those 2 letters (A.I) are on the lips of every manager you will come across for the foreseeable future, if you’re reading this blog you have no doubt heard many conversations stating “we need to use more AI”. To some extent AI is a technology currently looking for a home, touted as a silver bullet that will revolutionise the world, everyone has jumped on the bandwagon offering some services that are in my opinion pointless sales pitches (yes.. i’m looking at you washing machine cycle!) to some very useful offerings that greatly enrich human computer interaction.

Recently I have being using Databricks Genie, an AI service that allows a user to type questions in natural language about their data and receive replies in the form of SQL queries or text statements. While I am massively sceptical about the use of a lot of AI services (I can’t blame the marketing peeps, if you think you hear about using AI a lot, imagine them!), this is a service I can see real value in as it allows casual non-technical business users to be able to access, obtain and interrogate data. I am going to walkthrough a small example of setting up a Genie space in Databricks and then show how you can access it via an API.

Property Inc – Our Fictional Company

we have a fictional company that has a data lakehouse in Databricks. This company works in the property rental sector and has the following tables;

  • Rental Space – A table listing all the properties owned by Property Inc, these properties have a status “occupied, under renovation etc..) and an address
  • Estate – A parent of Rentalspace – property inc can own several rental properties in close proximity to each other for example it might own flat12 and flat13 of a building, these 2 rentalspaces would exist in the same estate
  • Tenant A customer who is renting a rental Space from property Inc, we will hold their name, email, phone number etc..
  • Lease – The contract that links a tenant to a rentalspace, this will contain a tenant id, rentalspace id, start date, end date and rent amount

It is known that there are some data quality issues within Property Inc and we are tasked with using AI to help find these and increase user interaction with the existing data warehouse in Databricks.

Setting up the Genie Workspace

We have an existing data warehouse in Databricks, it uses Unity Catalog (required for using Genie) and access Polices are all appropriate. What we want to do is create a Genie workspace to allow us to use AI services provided by Databricks on our data.

Create A Genie Space

Within Databricks select the Genie tab (under the SQL menu);

Click New and Make a new Genie space – you should see the menu below;

In this menu we can pick the tables we want this genie space to access – it is advisable to keep each workspace relatively small to start with (5-6 tables) and to keep them grouped within a particular business function, in this example we will call our space “portfolio_space” – as the tables we reference refer to the property portfolio within Property Inc. We can select the tables of interest;

Click Create.

We now have a Genie space that can reference these tables – Yes, it is as simple as that! we can now ask Genie questions about our data and have it automatically generate SQL and execute it for us. Look at the example below;

Here you can see the question – “Show me all the rentalspaces in london” – you can also see the output dataframe. If we click on “Show Code” you can see the SQL Generated by Genie;

Pretty cool eh?

Instructions and Using Genie

Obviously Genie has very little context about how the operations of this business work – if we refer to specific internal company language it might not understand what we are referring to for example we might want to refer to a specific report that is known internally within the business. We already know the SQL that generates this report – and we want that to be executed and only that when someone refers to “the portfolio report”.

If we navigate to the instructions tab – we can set known relationships in the data, known queries, functions and explain some context that genie will evaluate.

Here I am saving a query I have written and telling genie – when people state “the portfolio report” they mean this.

The question “show me the portfolio report” – now results in executing this exact query;

We can also define relationships – so Genie doesn’t have to guess;

and we can also write free-text instructions;

which can be evaluated as shown below;

Notice how searching for “under repair” correctly identifes statuses of under renovation.

Here you can see how the “current financial year” can be referenced;

you can also request basic graphs;

As you can see this is really powerful for a casual user who doesn’t know SQL or power BI and previously is not able to access data confidently. It however isn’t fool-proof and like all AI systems will not be 100% accurate.

Recommendations for Genie Workspaces

One important thing to consider is that AI services are not 100% accurate, while we are getting the huge advantage of allowing more business users to interact with their data, we do need to make them well aware that there is the possibility of incorrect data being provided to them. Databricks themselves makes several recommendations to help minimise this issue;

  • Start off small – expose a small set of tables and a manageable number of columns for specific topics
  • Use different genie spaces for different business departmental needs
  • Add primary key/Foreign key constraints to make Genie away of connections
  • Annotate the data (field descriptions etc..) so that Genie can understand what different fields represent
  • Add instructions to transfer business keywords into logic – for example define when your fiscal year starts / ends – provide queries for existing reports (“Show me the portfolio report”)
  • convert text so that it isn’t case sensitive
  • define trusted assets (register functions with Databricks for example to show currency conversions)
  • Monitor and review what questions the business are answering – encourage a culture of the business reporting obvious false answers.

Genie API

So hopefully at this point you can see how the service works, but its not a great user experience, for example;

  • Do you want to give non technical staff access to databricks?
  • Do they want to use Databricks?

the answer to both these is almost certainly a No.

Luckily There is an API exposing Genie Services, meaning you could access Genie from literally anywhere. There are a number of endpoints that are exposed – I will talk through a few of them very lightly. the full documentation is available at Genie API.

The endpoints I will use in my example are stated below;

MethodNameUrlBodyDescription
GETList Genie Spaces<workspace>/api/2.0/genie/spacesList the Genie Workspaces available in the workspace, you will need the workspace ID property to reference the other endpoints
POSTStart Conversation<workspace>/api/2.0/genie/spaces/<GenieSpaceId>/start-conversationThe minimum expected is JSON dictionary asking a question with the content key;

{
“content”: “show me all rentalspaces”
}
This is where the initial starting question of a genie conversation is submitted to the API service.

This will return the conversation and Message ID in its reply
GETGet Conversation Message<workspace>/api/2.0/genie/spaces/<GenieSpaceId>/conversations/<ConversationId>/messages/<MessageId>This method is used to poll the service for when the result is ready – you will need to handle this in code monitoring the status of when the reply is available to be consumed.

This returns the Attachment ID
GETGet Message Attachment Query Result<workspace>/api/2.0/genie/spaces/<GenieSpaceId>/conversations/<conversationId>/messages/<MesageId>/query-result/<attachmentId>This returns the dataset / text from the service – the answer to the question you asked
POSTCreate Conversation Message<workspace>/api/2.0/genie/spaces/<GenieSpaceId>/conversations/<ConversationId>/messagesThe minimum expected is JSON dictionary asking a question with the content key;

{
“content”: “Filter to Just those in London”
}
This asks another question in the conversational chain, allowing an operator to reference the context generated from previous answers.

Genie API – Walkthrough

I have knocked up some code in python that uses this API, its not fool-proof and is something I did in a couple of hours this morning – but I think it would show any developer who wanted to use this a solid example of how this could work.

Function to Make HTTP Requests

As we are going to be making several requests to an API, we should write one function that handles this – something like the code below would achieve this;

Here we have a re-usable function that will support authentication via a bearer token, GET and POST requests, any headers, bodys or params they might have and handle the returning of values as well as capturing request failures.

Get Genie Space ID

We need to know the ID of our genie space, we can use our make_request function to call the endpoint that will list them – here is a function using that endpoint.

Here is the execution of it;

you can see I am obtaining my authentication from a secret scope I have set up, I have specified my workspace url and have successfully called the API. In the reply you can see one genie space ID 01f0949e561d1e61857cfd19f1cdaae3. (Please note – i’ve deleted this workspace prior to publishing this article – so showing this is of no security risk, try to hack it if you want, it doesnt exist now)

Starting a Conversation

here is a function that starts a new conversation with the genie service – the question is passed into this function as a string. This can be executed as follows;

The question has being submitted to the genie service. You can actually see this logged in the Databricks UI monitoring tab;

However the service does need some time before the question is “answered” – we need to obtain from the reply json the message_id and conversation_id – we will use these to poll the service until our “answer” is complete;

Here you can see obtaining these values from the reply of the start_conversation request.

Polling for a Reply

We need a function that will keep asking the Genie service “have you answered the question yet” – we will put some parameters in place for this, if a query takes more than 5 minutes – we will ask this question every 5 seconds and in the unlikely event that the query was not answered within 5 minutes – we will abort and return a timeout error;

Here you can see a function that achieves this, the key thing here is it is looping until the response status value is COMPLETED – meaning that Genie has produced a result. One this status is reached i call another function – that displays the result.

Displaying the result

Obviously there are thousands of ways this section could work – it all depends how you are viewing the result (in a website, on a console, in a chat bot etc..) – I am executing this from Databricks – and I want to return a displayed dataframe and / or narrative from the Genie service. this function is as follows;

One key thing to note here is that there are 2 types of reply you will get from the Genie service – one of an executed sql query (results and a query) to answer your question or text – narrative from the Genie service asking for more details, saying it doesn’t understand etc.. this is handed in this function via the if statement – if query exists in the reply – it has an SQL based result, otherwise the reply is text based.

Getting Result Dataset

to obtain the actual dataset returned in an SQL query we need to call a specific endpoint – we have a function for that, this was called in the display_results function;

Creating a Dataframe

As I am executing this from a notebook – I wanted to convert the result dataset into a dataframe so it looked nice, the referenced “create_df_from_results” function does this;

I return everything as a string – but this could be easily extended should your application need to cast values appropriately.

Continuing the Conversation

As adding additional messages to an existing conversation is a different endpoint – we also need to have a function to call that, this is near identical to the start conversation function – just referencing the continuation endpoint;

Genie API – Example Conversation

Executing these functions via databricks we can see how this all fits together;

we can start our conversation and obtain the response as follows;

this results in;

Notice how the polling ran 3 times (fetching meta data, asking ai, completed) – allowing a front end to update if required. the result returns narrative from the service, the result data set (as a Dataframe) and the SQL query executed by Genie. It took 16.31 seconds to complete

we can continue this same conversation as below – the executing code would be;

and the returned result would be;

This shows replies that are SQL queries – we could continue the same conversation and get a text reply (to show the distinction);

here I am asking the question “how hot is the sun”, Genie is not Chat GPT – and will not answer questions unrelated to the data provided. It replies not with a query result set – but narrative – we handled this outcome in our display_result method.

Genie API – Recommendations

Of course there are things to consider with using this API – the following are things to consider

  • the use of a Serverless Data warehouse Cluster – users will not want to wait 10 minutes for a cluster to spin up in Databricks to answer their questions
  • Cost – you are going to need compute available and will be charged for this time
  • Authentication – in this example as its one user and dev I set up a secret scope with a PAT token – in production instances its recommended to use OAuth2 and have users in specific AD groups per Genie space, or use service accounts – please refer to the documentation here: Authorize user access to Databricks with OAuth
  • Implement your own queuing, genie does not manage request retries – any services using this might benefit from implementing a message queue
  • Start a new conversation each session to avoid unintented context reuse
  • remove old conversations (there is an API endpoint for this) – there is a limit of 10,000
  • Monitor the space and look at he responses users receive – if many users are asking the same question – perhaps make an actual report for it in power BI or something similar – or add it as a known query within genie

Conclusions – What’s next?

I don’t believe in silver bullets, I hope no one looks at this technology and thinks “we no longer need reporting or analysis” – this is far from the case. The real strength of this service is it allows users who are not technical the ability to query data easily, much much more easily than previous methods. Data Experts will still need to be on hand to make sure that the data stored is as accurate as possible and to monitor and review the space activities. Despite its connivence I would argue that anything often requested via a service like this probably should be productionised as a report regardless.

I would hope that an approach like this would encourage data completeness and accuracy in the core systems that feed the warehouse to be improved, as the value of the results obtained from the AI service is directly linked to the data quality and completeness in the core systems. I would also like to hope this moves away from the classic “johns spreadsheet” approach to non-technical reporting whereby a member of staff has enriched data on a spreadsheet which is a single point of failure.

As for the potential uses for the API – two spring to mind, both similar

  • The use of an Azure Chat bot (a hosted web app) to have a teams Channel “Ask Genie”
  • The use of the API in a core application / Intranet page – “Ask Genie”

It simply depends how people feel best placed to interact with the service.

I might well do a post on the Teams Bot option in the future.