Delta Lake

Aarfah Ahmad
8 min readMay 27, 2020

Hola!

I recently did a course on Delta Lake by Databricks, and as much I liked going through the course it enlightened me on the various features which helps resolve the many difficulties faced by organisations while working with data. This made me realise that I should help others too to get familiar with the concepts, thus this article :)

In this article, I will try to make you understand concepts about data lake, delta lakes and how delta lake is better.

To start with, first let us understand the term Data Lake:

Basically, going by the term ‘Data lake’ we can understand it is a lake consisting of data. I know it sounds a little obvious. But trying to understand better, you can imagine a lake(pool) where data of raw formats are dumped. (See Image1)

A data lake is a centralised repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data or clean it, and run different types of analytics — from dashboards and visualisations to big data processing, real-time analytics, and machine learning to guide better decisions.

A pool where data can be dumped-Data Lake (Image 1)
Image 1

Data Warehouse & Data Lake

Don’t get confused between Data Warehouse and Data Lake. Let’s understand that well.

Data Lake is a storage repository that inexpensively stores a vast amount of raw data in its native format.

A data warehouse is a database optimized to analyse relational data coming from transactional systems and line of business applications.

While warehouses excel in handling structured data, most enterprises have to deal with unstructured, semi-structured, and data with high variety, velocity, and volume. Data warehouses are not suited for many of these use cases, and they are certainly not the most cost-efficient.

You can say that the data in Data warehouse is in an organized format of data which is cleaned, enriched, and transformed so it can act as the “single source of truth” that users can trust unlike Data lake which consists of just data in unclean, unprocessed or not in a standard format.

Coming to Data Lake again, while suitable for storing data, data lakes lack some critical features. Data lakes do not support ACID transactions, do not enforce data quality, and their lack of consistency/isolation makes it almost impossible to mix appends and reads, and batch and stream jobs.

So, to resolve these issues with Data Warehouse and Data Lake, here comes to the rescue DELTA LAKE (Data Lakehouse)

Data Lakehouse — Data Lake + Data Warehouse

Data Lakehouse combines the capabilities of Data Lake and Data Warehouse. Merging data lakes and data warehouses into a single system means that data teams can move faster as they are able to use data without needing to access multiple systems.

Data Lakehouse basically combines the best elements of both to resolve the drawbacks. Data lakehouses also ensure that teams have the most complete and up to date data available for data science, machine learning, and business analytics projects.

As you can see in Image 2, Lakehouse helps us to do all our job including Data Science, ML, BI, Streaming Analytics with the same source, unlike in Data Lake where the data is integrated into different platforms and then the output or in Data Warehouse where unstructured data is difficult to handle.

Data Lakehouse
Image 2

Delta Lake

Delta lake is the open-source Data LakeHouse enabling tool that helps us to leverage our processing power of pre-built/pre-owned spark infrastructure. It also enables the ACID methodology on the Distributed storage.

A data lake built using Delta Lake is ACID compliant, meaning that the data stored inside of the data lake has guaranteed consistency.

Due to this guaranteed data consistency, Delta Lake is considered to be a robust data store, whereas a traditional data lake is not. Databricks Delta is a mechanism of effectively managing the flow of data (data pipeline) to and from a Data Lake. Databricks has open sourced Delta Lake so that users can easily build reliable data lakes.

So, we can say: Delta Lake = Data Lake + ACID Properties

Databricks Delta Lake offers ACID transactions via optimistic concurrency control between writes, snapshot isolation so that readers don’t see garbage data while someone is writing, data versioning and rollback, and schema enforcement to better handle schema changes and deal with data type changes.

We will see each of the term in detail below. Delta Lake is built upon mainly three pillars which addresses reliability, performance, and engineering challenges:

  • Clean, quality data
  • Consistent views across batch and stream data workloads
  • Optimized and easy to adopt

Clean, quality data:

  1. ACID transactions ensure that only complete writes are committed.
  2. Schema enforcement automatically handles schema variations to prevent insertion of bad records during ingestion.
  3. Time Travel, part of Delta Lake’s built-in data versioning, enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
  4. Exactly once semantics ensures that data are neither missed nor repeated erroneously.

Consistent views across batch and stream data workloads:

  1. Snapshot isolation provides support for multiple simultaneous writers and readers.
  2. Mixed streaming and batch data mean that a table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingestion, batch historic backfill, and interactive queries all just work out of the box.

Optimized and easy to adopt:

  1. Scalable metadata handling leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
  2. Effective on-premises means that Delta Lake works well with Hadoop Distributed File System (HDFS) on-premises.
  3. Compatibility with Spark APIs means that Delta Lake is easy to adopt for Spark users.
  4. As an open-source format, Delta Lake eliminates data lock-in. Using Delta Lake, there is no requirement only to use Delta Lake.
  5. In-place import allows efficient, fast-import from Parquet to Delta format.

Few important terms

The Enterprise Decision Support System: An enterprise decision support system, or EDSS, is an information system used by an organization to help decision-making.

An EDSS might be built:

· on-premises or in the cloud

· leveraging data lake, data warehouse, or data lake house technology

Analytical and Operational Layers: An important concept in data system design is the distinction between the EDSS and the Operational Data Store (ODS).

The EDSS and the ODS in a typical enterprise data system perform two distinct functions:

· an EDSS performs Online Analytical Processing (OLAP)

· an ODS performs Online Transaction Processing (OLTP)

Levels of data:

i. Operational (ODS):

· Day-to-day day

· Frequently accessed

· Ex: An Individuals’ Credit score on a Financial website

ii. Atomic (EDSS):

· SSOT(Single Source Of Truth)

· All values of any data attribute over time

· Most granular

· Fully integrated across all tables and data stores

· Ex: An individual’s credit history

iii. Departmental (Data mart):

· To meet specific requirements of a team, department, use case

· Smaller view of atomic data

· Ex: Monthly customer report

iv. Individual (Ad-hoc):

· To answer highly specific questions

· Most refined and aggregate

· Ex: One-time data query

Architecture

Most common architecture followed by companies is the Cloud Based Architecture(Cloud-based data warehouse-CDW) (See Image 3)

Cloud based Architecture-Data warehouse
Image 3

Challenges faced using the above architecture:

1. Data Lake not query able

Because data in the data lake only exists as raw data, it is not query able. In other words, the data housed in a CDW data lake would only be accessible by running an ETL process to load it into a data warehouse.

2. Black-box processing of queries

CDWs typically offer little opportunity for optimizing data queries, often lacking the ability to index data or provide transparency into the query execution plan.

3. Expense

A CDW is expensive to build and maintain. Expenses accrue in several areas:

  • ETL processes required for leveraging other cloud-based analytics systems
  • Storage in vendor object-stores

4. Structured data only

A CDW is by nature a structured data store. It can only ingest structured data.

Delta Architecture

Delta architecture supports both:

  • Batch processing
  • Stream processing

This was not possible in Lambda architecture in which they had different pipelines for processing.

We can think of the Delta Architecture provided as a combined Data Warehouse and Data Lake technology where we can have our data cleaned and processed on the go. With the Delta architecture, multiple tables of data are kept in the same data lake. We write batch and stream data to the same table. Data is written to a series of progressively cleaner and more refined tables.

By doing this, we will achieve our goal of having a Single Source of Truth source for BI, SQL analytics, real-time monitoring, data science, and machine learning.

As we move ahead in the process the data transitions from Bronze->Silver to Gold (Consider it as the value of data that changes during the process). The data keeps improvising in terms of quality as we progress further.

With the Delta architecture, multiple tables of data are kept in the same data lake. We write batch and stream data to the same table. Data is written to a series of progressively cleaner and more refined tables.

data transformation from Bronze-Silver-Gold tables in Delta Lake
Image 4

Consider the data moving through different tables named as Bronze, Silver and Gold. The data from Streaming and Batch jobs are first captured in the Bronze table in their raw formats, then in the Silver tables it is cleaned and processing is done to make it queryable and it now consists of normalized raw data that is easier to query. Gold tables are called Summary tables which contain aggregated key business metrics that are queried frequently.

In this image (Image 5), we can see data streams moving through a data pipeline. These streams have low latency or are manually triggered, which eliminates the need for schedules or jobs. In addition, this architecture supports methods not historically supported by an ODW or CDW such as DELETE, MERGE, UPDATE, and Upserts, useful for GDPR or CCPA compliance.

Thanks for reading! Please do share the article, if you liked it. Any comments or suggestions are welcome!

--

--

Aarfah Ahmad

Data Engineer | AWS | GCP | Snowflake | ETL | Warehousing