Do we need database in a Data Analytic project

With the growing popularity and use of data lakes and big data services in the cloud, people are questioning the need of a database in their analytic solution. This article will talk about how a database fit in a modern analytic architecture.

The data Lakehouse

Data lakehouses are being more and more used in the industry. They combine every resource needed to ingest, clean, transform and present data to consumers. These consumers can be other systems like machine learning workspaces and other API’s as well as data science notebooks and BI report users. They can ingest and store all kinds of data as described in the following table.

Type of data

Description

Examples

Unstructured

Data without any usable structure

Text documents like PDF files, word processing and image files. Media filles like music or videos.

Semi-structured

Data have some structured element but without formal or strict organization of them.

Log files from IoT systems, Json and XML, web pages, etc.

Structured

Data is organized into a formal schema.

Database, Excel spreadsheets, delimited files, etc.

 

Major vendors support data lakehouse features such as Databricks, and Microsoft Synapse Analytics. Databricks introduced the Delta Lake project and open sourced it in early 2019. It brings the following benefits to a data lake:

  • Delta Sharing: enable data sharing from Delta tables with other systems without having to copy it before.
  • ACID transactions: Ensure data integrity.
  • Time travel: data is snapshotted, and every change made to it can be viewed and used by your pipelines. It helps when we need to revert a specific version of the data or audit it.
  • Parquet file format, an open-source format, is leveraged for its compression and encoding schemes.
  • Delta tables can be filles using both streaming and batch processing.
  • Schema is enforced to prevent bad data and help enforcing data quality.
  • Schema evolution: although not recommended in a production environment, this feature allows to change the schema over time. A simple option (mergeSchema) allows to add new columns to a table. For a productionized environment, proper DDL statements like ALTER TABLE are more suitable to prevent pipeline failures down the road.
  • Update, Merge and Delete support: Prio to Delta, data was loaded in append only mode. DML statement can now be used with Delta tables.

A complete exhaustive list of Delta tables features can be found on Delta.io web page. Delta tables can be integrated into Azure Synapse Analytics quite easily as it is supported by both SQL Serverless and Spark pools. Data factory and Synapse pipelines have direct connectors for Delta tables.

Microsoft is developing a similar concept in Synapse Analytics called Lake Database. These databases can contain tables that are linked to an Azure Storage account folder or file. They are still in preview as the time of writing, and they support CSV and Parquet file format. What’s nice with the Lake Database tables is that we have a designer inside Synapse Analytics to design our tables based on our files. Microsoft also offers templates that are out of the box solutions for a database such as Accounting, Customer, etc. Like Databricks Delta, Lake Database tables can also be used by SQL Serverless Pools, Spark Pools as well as Data factory and Synapse pipelines Since the Lake databases are quite new, I will write other articles on that topic in the coming weeks.

The role of the database in the mix

So, if we’re using Delta tables or Lake Database tables, can we query them directly using various methods like Azure Synapse workspaces, Power BY, etc. Do we still need a database? The answer is most of the time, yes. As good and powerful data lakes are, they are still struggling to deliver fast query results. Databricks added a new query engine called Photon a few months ago, but it might become quite expensive with heavy usage. It requires computation services to be up and running all the time to serve the analytic report needs. Data lake databases do not have features like granular row and column level security, dynamic data masking to hide certain fields values, etc.

Also, someone will want to compare or integrate the data lake tables with their existing application or database systems. Although Synapse Analytics provides access to the data lake files, other systems mighty not have this capability out of the box. Some data mart patterns like SCD Type 2 are better handled in a database as opposed to a data lake implementation.

Flexible and scalable

But the main reason to keep the database into an Analytical solution IMHO is flexibility and scalability. Flexibility, I can use multiple databases on top of my data lake for specific purposes or different business processes. I can use an Azure Serverless SQL database for small workloads and save money by not having a compute engine, read cluster, running all the time. I also have the possibility to scale up and down my database depending on my end user’s needs. I can even stop the database engine outside business hours the same way I could stop my data lake compute engine.

Helping report users

Another factor where having databases in our solution can help is that, while the data lake contains tons of data, my systems might only need a subset of it. For example, most of my reports might be using current or short-term data most of the time to render as fast as possible. I saw end users waiting for minutes for some reports that spawned over only a few months of data. The problem was that the database model was not optimized for such reports. People will easily understand that the report takes long time to render when we’re having many years of data to display. Such reports could easily be served by a specific connection to the data lake.

Performance optimizations

Databases have more handles in term of query optimizations compared to data lake tables. There are also more different database types when it comes to SQL Server in Azure. SMP databases like Azure SQL (serverless or managed instance, Hyperscale) or massive parallel processing (MPP) like Azure Synapse Dedicated pools can be used to expose data for different use cases. The former has replicated tables for dimensions and distributed tables for fact tables. Azure SQL database has also different types of indexes that can be used to optimize query performances.

Conclusion

The need for a database in an Analytical solution is still present even with modern data lake databases systems. The latter do not have all the features offered by good old relational databases. It’s better most of the time to segregate our workloads between the data lake for fast processing of big amount of data and using the database for mild transformation and consumption layer. This way, we get a secure, flexible solution that do not cost too much in term of operation and maintenance

SSIS vs TSQL and Azure migration perspectives

Microsoft always made SSIS flexible enough to enable developers to develop their ETL with different techniques. This article will talk about an old debate on how SSIS should be used as an ETL tool. I am sharing here my own experience on ETL’ing with SSIS and its implication when it comes to migrate our ETL’s to Azure.

The “DTS” way: using SSIS as an orchestrator

DTS (Data Transformation Services) has been around from the end of 20th century to the mid-2000’s. It relied on ActiveX technology to move and transform data. ActiveX was not very performant. ETL developers used DTS most of the time as an orchestrator of database stored procedures. Pipelines were designed to call a series of stored procedures that were doing data transformations. It was ELT, not ETL.

We can still use SSIS the same way we used DTS: use the control flow to orchestrate stored procedures. But there are a few caveats with this approach:

  • SSIS is not used to its to its full potential. Stored procedures still run on the database server and might create contention of resources on it.
  • Harder to tune-up performances: from an SSIS perspective, a stored procedure is a black box. SSIS has no clue on what’s going on when it runs a stored procedure, it simply calls it. I’ve seen stored procedures called in a dataflow OLE DB some executing very poorly. These dataflows were very hard to debug when they jammed, performances were slowing down or maintain because majority of transformations (if not all) were done in the underlying stored procedure.
  • Transformations logic are hidden: a dataflow task in SSIS exposes the transformations and it’s easy to understand the transformation flow.
  • Database might require supplemental indexes or views to speed up or ease the ETL process.

Bottom line, if you use TSQL to do the ETL, using SSIS is a bit overkill as the tool can do the same kind of transformations and many more.

Using the SSIS Data Flow

The SSIS data flow task has a lot of transforms available in its toolbox. They can read data and write from a variety of data sources like databases, files, on-premises Hadoop files, etc. With the addition of Azure Feature Pack, data can be read/written in various format like Avro or Parquet to an Azure storage. Data flows have a rich set of optimized transformations that allow doing all kind of transformations like pivoting or unpivoting the data, adding or modifying columns in the pipeline or simply joining or splitting data.

Data Flows are transforming the data in memory, so the performance greatly improved. They use strong data types that helps optimize memory consumption throughout the various transformations’ usage. Between every transformation, we can view the transformed data by enabling data viewers. They help greatly to develop or debug a transformation. Using various transformations leads to changing the way we design our ETL’s: we spread out and expose our transformation logic on the Data Flow design pane. Doing it allows to make our package more self-documenting and much easier to maintain.

There is also a quite large commercial third party’s components that provides lots of useful extension to SSIS. IF you ever need to read or validate a JSON file, read and write to a SharePoint list, there are components available for all of these.

Bottom line, if you’re using SSIS, the Data Flow task is your friend when it comes to data transformations. Of course, you can still use TSQL when it makes more sense. Let’s say you have two OLE DB sources in your package: Source A brings 1 million rows and Source B 1000. Both sources ar pulling data from the same RDBMS and inner joining the two sources would result to a dataset of 50000 records. It make sense to join the two sources in a single OLE DB source. Otherwise, bringing 1 million rows to the pipeline and filtering down 950 000 would make SSIS doing the join much slower that using the database for the same job.

Cloud migration perspectives

Migrating SSIS vs TSQL stored procedure to the cloud, Azure, can be easily done using the two following techniques:

  1. Lift and Shift of the ETL programs in Azure and use them almost as is.
  2. Re-write the ETL programs and ensuring the migration embrace the full cloud scalability paradigm.

If you’re using TSQL, you have to male sure that the Azure database supports all code syntax you used when it was developed on an on-premises server. If you’re using SSIS, you will need Azure Data Factory with its SSIS Integration Runtime to run your packages in Azure almost as is.

Doing a simple lift and shift does not mean that we’re using the Azure cloud technologies at its full potential. A re-write is required for that. This can be done by three major ETL technologies in Azure:

  • Azure Data Factory: using the mapping data flows of this technology, we can specify the joins and partitioning strategy that help distribute the ETL to many workers.
  • Azure Synapse: Microsoft is extending what was known as Azure Data Warehouse to create a unified Analytics experience. This technology will leverage parallel ETL distribution as it always did but this time, Spark is added to the mix.
  • Azure Databricks: This is a Spark specific technology, so it means that distributed processing is a no brainer for this one.

Now, if we use TSQL and want to migrate our ETL’s, we will have to reverse engineer our TSQL code and re-write the logic using one of the technologies stated above to ensure we’re fully using cloud at its full potential. If we’re using SSIS with the data flow, all the logic should be displayed in the package making the re-write much easier, especially if Data Factory and mapping data flows are leveraged. Data Factory data flows use many transformations found in SSIS data flows, so ETL logic migration should be very easy to do.

If Databricks is used, many TSQL can run leveraging SQL but since it’s not TSQL, some modifications to the code might be necessary as not all TSQL syntax is supported in Databricks. With SSIS, it means a complete re-write of the ETL. But since the logic is spread out in the data flows, it should be easy to migrate the sources and do transformations using DataFrames.

Bottom line, cloud migration is not a simple flip of the switch. Some work as to be done. Microsoft has a tool for that can greatly help us to asses the complexity for a Lift and Shift of our code: Data Migration Assistant. For a re-write, there’s no tool for it . I whish there was a tool that would script our SSIS packages to PySpark dataframe though. That would certainly be very helpful for lots a people.

Avoid SQL Server nvarchar(max) data type mapping in Databricks

When using dataframes and save it to SQL Server using JDBC, the resulting data type for a string column is nvarchar(max). This happens when let JDBC creating the table without supplemental instructions. We seldom need to have our string columns using with this data type as it limits the functions we can use with it such as substring().

This article will show some options to avoid having a nvarchar(max) column in our table

A simple query

First thing first, we’ll connect to a database. For this article. I am using a sample database; AdventureworksLT from Microsoft Git web site. With Databricks, we connect to the server using JDBC and query the “SalesLT.Product” table to bring back two columns:

  • ProductID: integer column;
  • Name: navarchar column, it will be seen as a string in the dataframe;

Now that we have retrieved data in our dataframe, we’re going to save it back in a different table. In real life, we would have transformed the data quite a bit but yhe goal of this article is to explain how to save a dataframe back to SQL Server with correct data types.

Save the dataframe; the simple way

The first thing we will do is to simply save, overwriting the destination table. Overwriting the table without any other options means we are going to drop and recreate the table.

The resulting table structure is shown below. I did the screenshot with Azure Data Studio. As we can see, the “ProductName” column’s data type is nvarchar(max). This is not the best option but, it ensures that whatever is stored in the dataframe’s string column will fit; it’s a safe choice.

There’s a way to specify the data type of the dataframe’s columns while saving it to the database. But in order to find the right length, we need to profile the data. Still in Databricks, we’ll find the maximum length of the “ProductName” column in our dataframe as shown below.

Specifying a data type

Now that we know the maximum value in the column, we use the “createTableColumnTypes” option when we save the dataset. However, specifying “nvarchar(100)” fails. The “nvarchar” data type is not supported for now.

Using the “varchar” data type executes successfully as shown below.

When we look at the table structure in Azure Data Studio, we clearly see that the command created a table with the specified structure.

 

However, I am not satisfied with it. I dislike using “varchar” data type as we live in a connected world now and sooner or later, this data type will limit us with international character sets. To circumvent this, we have to create the table first as shown below.

Truncate the table

Next, we’ll use the “truncate” option with the save command. This will NOT drop and recreate the table but simply empty (truncate) the table.

Going back to Azure Data Studio, we notice that the table’s structure has not been altered by the save with overwrite command.

 

But we have to be careful because if the structure of the dataframe changes by adding/renaming a column, the command will fails as shown below.

 

To sum up

There are some valid options to avoid be stucked with a “nvarchar(max)” column while saving a dataframe with Databricks An improvement I would like to see is the support for “nvarchar” data types. This might come in a near future.

SSIS Error: Cannot process request in SSDT

I was recently deploying a SSIS project to another server recently and I encountered the following error:

Microsoft Visual Studio

——————————

Cannot process request because the process (114328) has exited.

I searched on the web to find out what could be causing this error. Project opened on my development machine executed flawlessly. Of course, we cannot use my development machine as an execution server , so I had to find the answer.

When we create a SSIS project in SSDT/Visual Studio 2017, the deployment target server is set to SQL Server 2017. I hadn’t change it when I created the project and was tryin to execute a package on SSIS 2016…this was my mistake.

 

 

 

 

 

 

I simply set the Deployment Target Server (right-click on the SSIS project à Configuration Properties) of my dev machine SSIS project to SQL Server 2016 and the issue was gone!

Simple things like that sometimes make us loose lots of valuable time! Hopefully this article will help some of you as the error message I got was not really helpful.

 

Databricks-cli setup

Databricks-cli is used for the Databricks administration. With it, we can manage the following items:

  • Clusters:     Utility to interact with Databricks clusters.
  • Configure:     Configures host and authentication info for the CLI.
  • Fs:         Utility to interact with DBFS.
  • Groups:        Utility to interact with Databricks groups.
  • Jobs:        Utility to interact with jobs.
  • Libraries:    Utility to interact with libraries.
  • Runs:         Utility to interact with the jobs runs.
  • Secrets:     Utility to interact with Databricks secret API.
  • Stack:        Utility to deploy and download Databricks resource stacks.
  • Workspace :     Utility to interact with the Databricks workspace.

The remainder of this page will describe how to setup Databricks-cli.

Requirements

We need an access to a Databricks workspace’s token. Also, Python has to be installed on the developer’s PC.

Python installation

From https://www.python.org/downloads/windows/, download the latest Windows version as shown in the following screenshot. You can opt for the web or full installer, One or the other doesn’t matter.

Once the setup starts, you’ll get the following screen (the Python version may differ).

Make sure that the two checkboxes are selected :

  • Install launcher for all users
  • Add Python to PATH

Once the installation completes, go to the environment variables of your PC:

Click on environment variables

Select Path in the “User variables” (upper window) and  click on “Edit” as highlighted below.

Click on New as shown below

Enter the following string:

%USERPROFILE%\AppData\Local\Programs\Python\Python37\Scripts

And click on OK in both windows to save the path as shown below.

Open a command window and type the following command:

pip3 install databricks-cli

The installation will go on. At the end of it. Open to the %USERPROFILE%\AppData\Local\Programs\Python\Python37\Scripts folder and you should see  “Databricks.exe” as shown in the following screenshot.

Now, back to the command window, type the following command:

databricks --version 

The version of databricks should now be displayed as shown below.

We’re now ready to configure databricks-cli.

Databricks-cli configuration

Once databricks-cli is installed, we have to connect to an existing Databricks workspace. In order to do so, we’ll use the following command:

databricks configure --token

as shown in the following screenshot.

The Databricks host is where  Azure region where the workspace has been created; in our case, East US. The token can be obtained from the workspace itself. Here’s the procedure to do it. Open the Databricks workspace you want to link to the databricks-cli and follow this link to create the token for the workspace.

Databricks Load XML libraries

AT my client’s place we’re using Databricks in conjunction with Azure Data Factory to transform data coming from HTTP connections. These connections data end up in an Azure blob. When we’re receiving JSON data, Databricks and most Azure components knows how to deal such data. Databricks have JSON libraries already available for us to use.

We start receiving XML files from a provider lately. Using Databricks, I thought that I would be able to load the data in a data frame as easily than I am doing with JSON from the Azure blob storage. Here’s the command I use in Python to load JSON data:

df = spark.read.option(“multiLine”, True). json(completeFilePath)

The command tells Databricks to load my “completeFilePath” content from my blob storage into a data frame called “df”. I use the “multiline” option because the JSON data is spanning on multiple lines.

Naively, I thought that reading an XML file would be as easy. I just have to issue a command like the following in Python:

df = spark.read.option(“multiLine”, True). xml(completeXMLFilePath)

Unfortunately, this is not that easy. There’s no native XML library available with Databricks when we create a workspace. We need to load one. After some research on the web, I found this link on Github: https://github.com/databricks/spark-xml. In the document, we can download the library and we need to load it in the notebook. Here are this instructions on how to refer to the libfrary in the notebook.

Spark compiled with Scala 2.11
$SPARK_HOME/bin/spark-shell –packages com.databricks:spark-xml_2.11:0.4.1

From a Databricks notebook perspective, this is not obvious how to attach the library to it. We cannot use this code anywhere in the notebook. I don’t pretend to be a Databricks expert though but even when I looked at the cluster configuration, I could not find a way to attach such library as shown in the following screenshot

Even not from the Libraries tab:

Doing more research on the web pointed me to the right place. From the notebook folder menu, we click on the drop down arrow, select Create and then Library from the submenu that appears as shown below.

The following screen appears. Select Maven Coordinate as source and click on the “Serach Spark Packages and Maven Central button after.

From the search page, type “xml” and click on the search magnifier icon as shown below. From the list that appears in the center, select “spark-xml”, choose the release that matches your cluster, 2.11 in my case and click on “+ Select” right beside it as shown below.

The screen close and you’re back to the “Create Library” screen. Click on the “Create Library” button at the bottom of the screen.

The next screen allows us to attach the library to a cluster. Click on the “Attach automatically to all clusters.” If you want to be able to use it with any cluster you create whether it’s a job or interactive cluster. Since we’re calling our notebook from Azure Data Factory, it’ll be a job cluster. This checkbox has to be checked for us.

When we select the checkbox, we get a confirmation window, click on “Confirm”. You should see that the library is attached once done.

Going back to the Cluster’s library tab, we now see that the library is attached to it.

Voilà! The library is attached and ready to use. From our notebook, we can now load a file any issuing the following command:

df = sqlContext.read.format(‘xml’).options(rootTag=”‘<my root tag>”).load(completeXMLFilePath)

And the dataframe “df” gets loaded with the “completeXMLFilePath” file content from our blob storage.

Hope this article will save you some time as I spend a couple of hours to find out how to attach the XML library to my clusters.

Usefull Visio templates

A great part of my work has to do with documentation and diagramming. I'm always using various templates and I'll share them in this blog article.

Hope this helps you when you have to document your architecture!

Data Platform MVP 2016

Je suis fier et honoré d'avoir reçu mon renouvellement comme Microsoft Most Valuable Professional (MVP) pour une 8e année consécutive. Depuis 2009, j'ai le plaisir de faire partie d'une communauté qui aide à améliorer les produits Data Platform de Microsoft tout en étant très généreux dans le partage des connaissance avec leurs pairs ainsi qu'avec la communauté de développeurs au travers le mode. J'espère faire partie de cette communauté pour plusieurs années encore!

I'm glad and honoured to be awarded a Data Platform Microsoft Most Valuable (MVP) for another year. Since 2009, eight years now, I have the pleasure to be part of a community that help shaping MIcrosoft Data Platform products as well as being very generous in their involvment with their colleagues MVP's and the worldwide development community. Hop to be part of it for a long time!

Assistez au 24HOP du groupe virtuel PASS francophone

Je me joins à Isabelle Van Campenhoudt (@isabellevancampenhoudt) pour vous convier au premier 24 HOP (hour of PASS) de notre groupe virtuel francophone. Nous somme fiers d'avoir organisé cet événement. Isabelle et moi avons sélectionné des conférenciers de haut niveau provenant d'un peu partout dans la communauté francophone mondiale. La cédule des présentations étant maintenant disponible, je vous invite à vous inscrire aux présentations qui vous intéressent, c'est gratuit!

 

Christian Coté

#24HOPfrancophone

Disable SSIS packages from Entry Point packages using expressions

For years I have been controlling how tasks in event handlers or execute package tasks were called using empty sequence containers and expressions. Here is an example of package call I was doing:

My package has a parameter named CallPackage1 as shown in picture below:

The above parameter is then used in the paths precedence constraints expression to determine if the package will be called or not:

 

So, whenever the parameter was set, the package would execute or not. Since SSIS 2008, we have the ability to disable the package execution without using a sequence container as illustrated above. We simply set the expression in the “Disable” property of the execute package task as shown in the screenshot below.

The execute package task has now an “adornant” is SSIS 2012 and above telling us that an expression is used. In previous version, we have to install BIDSHelper to see such adornant. Even if we set the parameter “CallPackage1” to true, the execute package will not appear disabled unless we close and reopen the package.

Before package execution:

After the value was change and the package is executed or it has been closed and reopened:

 

So, what’s the advantage of using Execute Package Task “disable” property instead of having a sequence container above execute package tasks? Using a sequence container leads to clarity. We clearly see that the package is called upon a parameter’s value. When using the Disable expression, it can be harder to see what’s the expression of the task and what is does. But, when we have particularely large entry point packages (master packages – packages that call child packages), it’s easier to use the Disable property expression of the Execute Package Task because having all tasks linked to the sequence container leads to a more noisy control flow. Therefore, it doesn’t enhance package clarity in that case. As Mies van der Rohe said: “Less is More”.

Blog at WordPress.com.

Up ↑