Microsoft always made SSIS flexible enough to enable developers to develop their ETL with different techniques. This article will talk about an old debate on how SSIS should be used as an ETL tool. I am sharing here my own experience on ETL’ing with SSIS and its implication when it comes to migrate our ETL’s to Azure.
The “DTS” way: using SSIS as an orchestrator
DTS (Data Transformation Services) has been around from the end of 20th century to the mid-2000’s. It relied on ActiveX technology to move and transform data. ActiveX was not very performant. ETL developers used DTS most of the time as an orchestrator of database stored procedures. Pipelines were designed to call a series of stored procedures that were doing data transformations. It was ELT, not ETL.
We can still use SSIS the same way we used DTS: use the control flow to orchestrate stored procedures. But there are a few caveats with this approach:
- SSIS is not used to its to its full potential. Stored procedures still run on the database server and might create contention of resources on it.
- Harder to tune-up performances: from an SSIS perspective, a stored procedure is a black box. SSIS has no clue on what’s going on when it runs a stored procedure, it simply calls it. I’ve seen stored procedures called in a dataflow OLE DB some executing very poorly. These dataflows were very hard to debug when they jammed, performances were slowing down or maintain because majority of transformations (if not all) were done in the underlying stored procedure.
- Transformations logic are hidden: a dataflow task in SSIS exposes the transformations and it’s easy to understand the transformation flow.
- Database might require supplemental indexes or views to speed up or ease the ETL process.
Bottom line, if you use TSQL to do the ETL, using SSIS is a bit overkill as the tool can do the same kind of transformations and many more.
Using the SSIS Data Flow
The SSIS data flow task has a lot of transforms available in its toolbox. They can read data and write from a variety of data sources like databases, files, on-premises Hadoop files, etc. With the addition of Azure Feature Pack, data can be read/written in various format like Avro or Parquet to an Azure storage. Data flows have a rich set of optimized transformations that allow doing all kind of transformations like pivoting or unpivoting the data, adding or modifying columns in the pipeline or simply joining or splitting data.
Data Flows are transforming the data in memory, so the performance greatly improved. They use strong data types that helps optimize memory consumption throughout the various transformations’ usage. Between every transformation, we can view the transformed data by enabling data viewers. They help greatly to develop or debug a transformation. Using various transformations leads to changing the way we design our ETL’s: we spread out and expose our transformation logic on the Data Flow design pane. Doing it allows to make our package more self-documenting and much easier to maintain.
There is also a quite large commercial third party’s components that provides lots of useful extension to SSIS. IF you ever need to read or validate a JSON file, read and write to a SharePoint list, there are components available for all of these.
Bottom line, if you’re using SSIS, the Data Flow task is your friend when it comes to data transformations. Of course, you can still use TSQL when it makes more sense. Let’s say you have two OLE DB sources in your package: Source A brings 1 million rows and Source B 1000. Both sources ar pulling data from the same RDBMS and inner joining the two sources would result to a dataset of 50000 records. It make sense to join the two sources in a single OLE DB source. Otherwise, bringing 1 million rows to the pipeline and filtering down 950 000 would make SSIS doing the join much slower that using the database for the same job.
Cloud migration perspectives
Migrating SSIS vs TSQL stored procedure to the cloud, Azure, can be easily done using the two following techniques:
- Lift and Shift of the ETL programs in Azure and use them almost as is.
- Re-write the ETL programs and ensuring the migration embrace the full cloud scalability paradigm.
If you’re using TSQL, you have to male sure that the Azure database supports all code syntax you used when it was developed on an on-premises server. If you’re using SSIS, you will need Azure Data Factory with its SSIS Integration Runtime to run your packages in Azure almost as is.
Doing a simple lift and shift does not mean that we’re using the Azure cloud technologies at its full potential. A re-write is required for that. This can be done by three major ETL technologies in Azure:
- Azure Data Factory: using the mapping data flows of this technology, we can specify the joins and partitioning strategy that help distribute the ETL to many workers.
- Azure Synapse: Microsoft is extending what was known as Azure Data Warehouse to create a unified Analytics experience. This technology will leverage parallel ETL distribution as it always did but this time, Spark is added to the mix.
- Azure Databricks: This is a Spark specific technology, so it means that distributed processing is a no brainer for this one.
Now, if we use TSQL and want to migrate our ETL’s, we will have to reverse engineer our TSQL code and re-write the logic using one of the technologies stated above to ensure we’re fully using cloud at its full potential. If we’re using SSIS with the data flow, all the logic should be displayed in the package making the re-write much easier, especially if Data Factory and mapping data flows are leveraged. Data Factory data flows use many transformations found in SSIS data flows, so ETL logic migration should be very easy to do.
If Databricks is used, many TSQL can run leveraging SQL but since it’s not TSQL, some modifications to the code might be necessary as not all TSQL syntax is supported in Databricks. With SSIS, it means a complete re-write of the ETL. But since the logic is spread out in the data flows, it should be easy to migrate the sources and do transformations using DataFrames.
Bottom line, cloud migration is not a simple flip of the switch. Some work as to be done. Microsoft has a tool for that can greatly help us to asses the complexity for a Lift and Shift of our code: Data Migration Assistant. For a re-write, there’s no tool for it . I whish there was a tool that would script our SSIS packages to PySpark dataframe though. That would certainly be very helpful for lots a people.