With the growing popularity and use of data lakes and big data services in the cloud, people are questioning the need of a database in their analytic solution. This article will talk about how a database fit in a modern analytic architecture.
The data Lakehouse
Data lakehouses are being more and more used in the industry. They combine every resource needed to ingest, clean, transform and present data to consumers. These consumers can be other systems like machine learning workspaces and other API’s as well as data science notebooks and BI report users. They can ingest and store all kinds of data as described in the following table.
Type of data |
Description |
Examples |
Unstructured |
Data without any usable structure |
Text documents like PDF files, word processing and image files. Media filles like music or videos. |
Semi-structured |
Data have some structured element but without formal or strict organization of them. |
Log files from IoT systems, Json and XML, web pages, etc. |
Structured |
Data is organized into a formal schema. |
Database, Excel spreadsheets, delimited files, etc. |
Major vendors support data lakehouse features such as Databricks, and Microsoft Synapse Analytics. Databricks introduced the Delta Lake project and open sourced it in early 2019. It brings the following benefits to a data lake:
- Delta Sharing: enable data sharing from Delta tables with other systems without having to copy it before.
- ACID transactions: Ensure data integrity.
- Time travel: data is snapshotted, and every change made to it can be viewed and used by your pipelines. It helps when we need to revert a specific version of the data or audit it.
- Parquet file format, an open-source format, is leveraged for its compression and encoding schemes.
- Delta tables can be filles using both streaming and batch processing.
- Schema is enforced to prevent bad data and help enforcing data quality.
- Schema evolution: although not recommended in a production environment, this feature allows to change the schema over time. A simple option (mergeSchema) allows to add new columns to a table. For a productionized environment, proper DDL statements like ALTER TABLE are more suitable to prevent pipeline failures down the road.
- Update, Merge and Delete support: Prio to Delta, data was loaded in append only mode. DML statement can now be used with Delta tables.
A complete exhaustive list of Delta tables features can be found on Delta.io web page. Delta tables can be integrated into Azure Synapse Analytics quite easily as it is supported by both SQL Serverless and Spark pools. Data factory and Synapse pipelines have direct connectors for Delta tables.
Microsoft is developing a similar concept in Synapse Analytics called Lake Database. These databases can contain tables that are linked to an Azure Storage account folder or file. They are still in preview as the time of writing, and they support CSV and Parquet file format. What’s nice with the Lake Database tables is that we have a designer inside Synapse Analytics to design our tables based on our files. Microsoft also offers templates that are out of the box solutions for a database such as Accounting, Customer, etc. Like Databricks Delta, Lake Database tables can also be used by SQL Serverless Pools, Spark Pools as well as Data factory and Synapse pipelines Since the Lake databases are quite new, I will write other articles on that topic in the coming weeks.
The role of the database in the mix
So, if we’re using Delta tables or Lake Database tables, can we query them directly using various methods like Azure Synapse workspaces, Power BY, etc. Do we still need a database? The answer is most of the time, yes. As good and powerful data lakes are, they are still struggling to deliver fast query results. Databricks added a new query engine called Photon a few months ago, but it might become quite expensive with heavy usage. It requires computation services to be up and running all the time to serve the analytic report needs. Data lake databases do not have features like granular row and column level security, dynamic data masking to hide certain fields values, etc.
Also, someone will want to compare or integrate the data lake tables with their existing application or database systems. Although Synapse Analytics provides access to the data lake files, other systems mighty not have this capability out of the box. Some data mart patterns like SCD Type 2 are better handled in a database as opposed to a data lake implementation.
Flexible and scalable
But the main reason to keep the database into an Analytical solution IMHO is flexibility and scalability. Flexibility, I can use multiple databases on top of my data lake for specific purposes or different business processes. I can use an Azure Serverless SQL database for small workloads and save money by not having a compute engine, read cluster, running all the time. I also have the possibility to scale up and down my database depending on my end user’s needs. I can even stop the database engine outside business hours the same way I could stop my data lake compute engine.
Helping report users
Another factor where having databases in our solution can help is that, while the data lake contains tons of data, my systems might only need a subset of it. For example, most of my reports might be using current or short-term data most of the time to render as fast as possible. I saw end users waiting for minutes for some reports that spawned over only a few months of data. The problem was that the database model was not optimized for such reports. People will easily understand that the report takes long time to render when we’re having many years of data to display. Such reports could easily be served by a specific connection to the data lake.
Performance optimizations
Databases have more handles in term of query optimizations compared to data lake tables. There are also more different database types when it comes to SQL Server in Azure. SMP databases like Azure SQL (serverless or managed instance, Hyperscale) or massive parallel processing (MPP) like Azure Synapse Dedicated pools can be used to expose data for different use cases. The former has replicated tables for dimensions and distributed tables for fact tables. Azure SQL database has also different types of indexes that can be used to optimize query performances.
Conclusion
The need for a database in an Analytical solution is still present even with modern data lake databases systems. The latter do not have all the features offered by good old relational databases. It’s better most of the time to segregate our workloads between the data lake for fast processing of big amount of data and using the database for mild transformation and consumption layer. This way, we get a secure, flexible solution that do not cost too much in term of operation and maintenance