The Data LakeHouse – Introducing Databricks

Introduction

One of the goals in making this blog was to allow me to have a space to learn new technologies I haven’t used much in the past and hopefully provide interesting examples to help others learn these technologies. I have never used Databricks before but have heard a lot about it, I simply just have not worked on clients who decide to use it.

Curiosity has finally got the better of me and I suspect the next few posts on this site will be introducing an explaining some basic features within Databricks, with my goal to build an end to end solution in Databricks from scratch.

What is Data Bricks?

Databricks was founded in 2013 and was founded by the creators of Apache Spark, Databricks sits on a Lakehouse architecture in the cloud which seeks to combine the best elements of data warehousing (data management and performance) with the low cost flexible object stores of data lakes. Databricks main purpose is to allow clients to store, clean and present vast amounts data from disparate sources. One real world example of this is Shell monitoring sensor date from 200m valves to predict ahead of time if any will break.

The heart of data bricks is of course (given the founders) Apache Spark, which is a lightning fast unified analytics engine for big data and machine learning.

So why not just use spark?

Well spark is one of three solutions within data bricks;

  • Spark – the engine for processing data
  • Delta Lake – a particular type of storage
  • MF Flow – model deployment / monitoring of machine learning tasks

So Databricks contains much more than just spark – but in addition to this spark itself is quite difficult to manage, you need to create and administer your own cluster, monitor jobs on them etc… As Databricks runs as a service in the cloud. Clusters can be spun up on demand and are managed for you. Databricks at its most basic level a cloud service that provides a managed spark environment for you.

Why do we need this?

To understand the argument for this architecture we need to go back – way back to the late 1980’s and look at the evolution of data warehousing architectures from then to now. These broadly fall into 2 different approaches.

Data Warehouses

In the 1980s businesses realised that they wanted to move past relational databases to systems that could handle higher volumes of data generated at a faster pace. They designed data warehouses to collect and consolidate data for business analysis.

They took data from several sources – loaded them into a warehouse and used this warehouse to generate reporting. The data had to be structured, clean and fit into pre-defined schemas. The disadvantage of this approach was they were quite inflexible when it came to unstructured or semi structured data as this data couldn’t be ingested easily as it didn’t fit into the pre-defined schemas.

In addition as companies grew data collection massively increased in volume and variety, warehouses struggled to cope with this variety and this led to the 2000s “Big Data” evolution.

Big Data – Data Lakes

In the 2000s the phrase “Big Data” was coined. Companies increasingly seeing the value of their data started to demand that all kinds of data be stored for analysis, this included a huge amount of unstructured data. Due to this a flexible data storage approach was needed and data lakes were born. Data lakes were predominantly cloud based and allowed for storage of structured, semi structured and unstructured data, they also had some integrated support for AI and machine learning.

However data lakes were not perfect, data lakes don’t support ACID transactions, so concurrent updates and inserts on the data could not work, so the reliability of the data wasn’t optimal, implementations often found that analysis performance was slow (due to the huge volume of data now stored).

A lot of business tried to solve this issue by using a data lake for storing unstructured data (logs, text, images video) and having traditional data warehouses store their structured data, however this often created complex infrastructure and disjointed silos of data which frequently had duplication (copying data back and forth across systems).

One architecture recently suggested to solve this problem is the Data Lakehouse – this is the fundamental model within Databricks

The Data Lakehouse

The term Data Lakehouse was first coined by the creators of data bricks in a journal Titled “Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics” (Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics – Databricks), and this paper argues that traditional data warehouse architecture will wither in the coming years and be replaced with a new architecture – the Lakehouse that has 3 key criteria;

  • Be Based on open direct-access data formats (such as Parquet)
  • Have first class support for machine learning and data science
  • Offer state of the art performance

In addition to these the paper sites this new architecture as being able to address several major challenges with data warehouses including data staleness, reliability, cost of ownership, data lock-in and limited use-case support.

This architecture seeks to take combine the benefits of both data lakes and traditional data warehousing. The data lake house can store all types of data together allowing it to provide a single reliable point of data.

The diagram below shows the transition across these 3 architectures;

Databricks connects directly to the cloud storage however it allows for ACID transactions. It uses the spark engine to compute across multiple nodes allowing for billions of rows of data to be processed. It can handle all data formats and does this in one centralised service.

Conclusion

This blog has introduced the core concepts of the data lakehouse, the central model to Databricks. In the future I will write some posts showing implementations of various aspects of this approach, with the goal to demonstrate a working lakehouse. The key takeaways for me while making this post is Databricks is a cloud based service that gives you process data on managed spark clusters, and that the central architecture for this model is one that looks to combine the flexible benefits of object storage, with the reliability and performance of traditional data warehousing.