Data Platform MVP 2016

Je suis fier et honoré d'avoir reçu mon renouvellement comme Microsoft Most Valuable Professional (MVP) pour une 8e année consécutive. Depuis 2009, j'ai le plaisir de faire partie d'une communauté qui aide à améliorer les produits Data Platform de Microsoft tout en étant très généreux dans le partage des connaissance avec leurs pairs ainsi qu'avec la communauté de développeurs au travers le mode. J'espère faire partie de cette communauté pour plusieurs années encore!

I'm glad and honoured to be awarded a Data Platform Microsoft Most Valuable (MVP) for another year. Since 2009, eight years now, I have the pleasure to be part of a community that help shaping MIcrosoft Data Platform products as well as being very generous in their involvment with their colleagues MVP's and the worldwide development community. Hop to be part of it for a long time!

October 1, 2016 0

Assistez au 24HOP du groupe virtuel PASS francophone

Je me joins à Isabelle Van Campenhoudt (@isabellevancampenhoudt) pour vous convier au premier 24 HOP (hour of PASS) de notre groupe virtuel francophone. Nous somme fiers d'avoir organisé cet événement. Isabelle et moi avons sélectionné des conférenciers de haut niveau provenant d'un peu partout dans la communauté francophone mondiale. La cédule des présentations étant maintenant disponible, je vous invite à vous inscrire aux présentations qui vous intéressent, c'est gratuit!

Christian Coté

#24HOPfrancophone

September 10, 2016 0

Disable SSIS packages from Entry Point packages using expressions

For years I have been controlling how tasks in event handlers or execute package tasks were called using empty sequence containers and expressions. Here is an example of package call I was doing:

My package has a parameter named CallPackage1 as shown in picture below:

The above parameter is then used in the paths precedence constraints expression to determine if the package will be called or not:

So, whenever the parameter was set, the package would execute or not. Since SSIS 2008, we have the ability to disable the package execution without using a sequence container as illustrated above. We simply set the expression in the “Disable” property of the execute package task as shown in the screenshot below.

The execute package task has now an “adornant” is SSIS 2012 and above telling us that an expression is used. In previous version, we have to install BIDSHelper to see such adornant. Even if we set the parameter “CallPackage1” to true, the execute package will not appear disabled unless we close and reopen the package.

Before package execution:

After the value was change and the package is executed or it has been closed and reopened:

So, what’s the advantage of using Execute Package Task “disable” property instead of having a sequence container above execute package tasks? Using a sequence container leads to clarity. We clearly see that the package is called upon a parameter’s value. When using the Disable expression, it can be harder to see what’s the expression of the task and what is does. But, when we have particularely large entry point packages (master packages – packages that call child packages), it’s easier to use the Disable property expression of the Execute Package Task because having all tasks linked to the sequence container leads to a more noisy control flow. Therefore, it doesn’t enhance package clarity in that case. As Mies van der Rohe said: “Less is More”.

May 25, 2016 0

This SQLCMD collation change script saved my day!

I was faced today with a SQL Server collation issue. The new SQL Server instance we had set up was using the wrong collation at both instance and database/tables/columns. This blog post will talk on how I fixed both issues.

Changing instance level collation

This was the easiest task to do. I followed a well documented method that can be found there. Here is the command I used:

Setup /QUIET /ACTION=REBUILDDATABASE /INSTANCENAME=mssqlserver /SQLSYSADMINACCOUNTS=”<MyMachine\SysadminUser>” /SAPWD=<MySysadminPassword> /SQLCOLLATION=Latin1_General_100_CI_AI

As indicated in the command, we’re rebuilding the databases here. We have to use a sysadmin user since we’re modifying the core database of our instance.

Changing the database collation

This is the most challenging part of the process. It’s not enough to change the collation at database level, we have to change all tables/columns/keys/indexes collation to have a complete collation change.

The first technique to use is to clone the database structure and to copy over all the data. One challenge here is to copy data in the right order considering that our database can have complex primary key/ foreign key relationships. We therefore have to proceed at the insertions in the right order. This might be harder when we have a database that has hundreds of tables. This become then a long and tedious task.

Since my time was limited (life is always too short in IT), I had to find a more efficient way of changing the collation of all my databases objects. I did some web searches thinking that I surely not the first one to face this challenge. I did find a script after several web searches that sole my problem. Here is the link to the MSDN article. It is dated to 2006 but collation conversion hasn’t changed much since then.

In short, kudos to Igor, he saved me a lot of time and the script ran under 5 minutes in my case. Since I had to convert 30+ databases, all of them were converted in half a day which was tremendous for me and my client!

Thanks again Igor!

November 26, 2015 0

Azure Data Factory book on Kindle

I’m glad to announce that the book I’ve been working on with Cem Demircioglu (@cemdemircioglu) is on pre-sale at Amazon:

We’re in the revision phase of the book and it should be available way before that the advertised date (January 2016). Writing a book is surely a lot of work but the knowledge you get out of this is invaluable!

#Azure, #SSIS

November 4, 2015 0

Lookups in SSIS, fail on duplicate option

While writing an article earlier on how we could have comments in SSIS, I noticed a property I had never noticed before: we can have a lookup failing when there are duplicates in the reference set.

Suppose we have a small package like the one in Figure 1

Figure 1

We have a source and a lookup, the lookup simulates a dimension that would have duplicate reference column values. Here js the reference set returned by the dimension.

Figure 2

The name “Christian” appears twice with two different ID’s. This should never happen in dimensions but it might be the case that we have other columns that makes the row unique e.g. different last names. But in the example above, I intentionally wanted to have duplicate values for the purposes of this article. The lookup column matches on the name column and returns the ID associated. Figure 3 is taken from the “Columns” tab in the lookup “lkp_DimEmployee_name” shown in Figure 2.

Figure 3

What happens when we have duplicate reference values (“Christian” in that case) is that SSIS will issue a warning at the execution of the package. Here is the warning from the execution pane in SSIS:

[lkp_ DimEmployee _name [5]] Warning: The lkp_DimEmployee_name encountered duplicate reference key values when caching reference data. This error occurs in Full Cache mode only. Either remove the duplicate key values, or change the cache mode to PARTIAL or NO_CACHE.

SSIS is warning us that the ID might not be consistent from one execution to another or it might not be able to find anything due to the fact that there are duplicate value in the column we use to lookup the ID. It can return 1 or 2 depending on several factors like how many rows we retrieve from SQL Server and in what order SQL Server will read those rows. Let’s say that we want to retrieve the first row only and that first row is the one that has the minimum ID. Since lookups return only one row, we simply have to order our reference set by ID. Something like “ORDER BY ID ASC” in our select statement. We would be sure that we retrieve consistent result every time.

But our problem is that we have duplicate references, two time the name “Christian” and this might lead to errors in data analysis down the road. Having such duplication is might be considered as a fatal error in data warehouse loads. Therefore, the is a lookup property that will fail the package if we have errors in our reference set.

Figure 4

This property is set to False by default and we simply get a warning like the one highlighted above when there are duplicates in the column used as a key in the lookup reference dataset. If we set it to True, the package will fail at execution no matter if we set the “Specify how to handle row with no matching entries” because the reference set is evaluated before dataflow execution, while the lookup cache is built (Figure 5).

Figure 5

This setting prevents misinterpretations of key assignment; the data flow will fail before executing, period. This might be annoying but it would be easily avoidable if we do some data cleansing or tune up our lookup query. It’s never good news when we have more that one values for one of our key columns that should be unique. SSIS helps us to prevent errors at key assignment and this lead us to have accurate results in our data warehouse!

November 4, 2015 0

Comments your SSIS data flows

SSIS data flow are very powerful and can sometime be very complex and long. Therefore, like any other programs, we would like the ability to comment sections of the data flows. This post will describe various way to d this.

Annotations

The obvious way to add comments or notes is to use annotations. Annotations can be used everywhere in a package, control flow, data flows and event handlers. There are efficient and easy to move around everywhere in the package.

Figure 1

Now, one feature I like a lot in SSIS data flows is the ability to re-arrange data flows objects by using the “Format”à“Auto Layout”à“Diagram” command from the top menu. This align objects correctly most of the time.

Figure 2

Now, when it comes to annotation, it doesn’t work properly. As you can see in Figure 3 below, the annotation has moved at the bottom of the package. If we had used many of these annotations, how can we know what transform(s) they belong to? As of now, there’s no way to tie an annotation to specific transforms. There is a connect issue that was closed in 2010 that relates to this issue.

Figure 3

Objects grouping

Since SSIS 2012, we now have the ability to group objects in a data flow. We can therefore group the annotation with the lookup and get something similar to Figure 1 above. This gives us something like Figure 4 below:

Figure 4

Then, using the “Format”à“Auto Layout”à“Diagram” command from the menu will also format the grouped objects but at least, they are still grouped. However, the group will now be moved to the bottom of the data flow.

Figure 5

So, we’re now back to square one basically.

Path annotation

Another option we have and it might be useful is to use the annotation property of the various paths. The regular paths underneath the lookup have “Lookup Match Output” assigned by default.

Figure 6

This is due to the fact that (as shown in Figure 6), the “PathAnnotation” property is set to “AsNeeded” and by default, it displays the “SourceName” property. We can’t change the “SourceName” property, it is read only. But we can change the “Name” property though.

In Figure 7 below, I changed the “Name” property and set the “PathAnnotation” property to “Name”.

Figure 7

Now, instead of having “Lookup Match Output”, we have a more descriptive path annotation.

Objects “Description” property

Another way of commenting our data flow transforms or control flow tasks is to use the “Description” property of every objects you want to comment on. To see the content of the “Description” property, we have to hover with over the object we want to see the description as shown in Figure 8.

Figure 8

By default, we have “Look up values in a reference dataset by using exact matching.”. We can change this using the property pane. Therefore, we’ll now see whatever we typed in the description of the object.

Figure 9

This might be useful to see what’s happening in the designer but the downside is that we can’t really know if a comment exists unless we hover the mouse over it.

To sum up

As useful as annotations are, they cannot be tied to a specific task or transform. So every time the task or transform moves, we have to carry the annotation manually.

Grouping helps but it takes significant space on the package.

Path annotation gives us more flexibility because we can see the annotation on the package, it is tied to it. Also, the name can be referred in the execution reports and log because we changed the name of the path. Still, it looks like a hack though.

Description is another option but are only visible when we hover the mouse on the object.

If Microsoft could let us tie annotation to a specific object, it would resolve many issues we talked in this post. Let’s hope they fixe it in future versions.

November 3, 2015 0

À ne pas manquer/Don’t miss PASS SQL Saturday 466 Montreal

Bonjour, ne manquez pas le SQL Saturday à Montréal le 21 novembre prochain ! Voici le lien pour vous y inscrire : http://www.sqlsaturday.com/466/eventhome.aspx

L’inscription est gratuite. Des frais de 10$ peuvent s’appliquer si vous désirez manger sur place

L’agenda est disponible en suivant ce lien : http://www.sqlsaturday.com/466/sessions/schedule.aspx

Hi, don’t miss the Montreal SQL Saturday this coming November 21^st! Here is the link to register: http://www.sqlsaturday.com/466/eventhome.aspx

Registration is free, there’s an optional fee (10$) for the lunch if you’re interested to eat at the event location.

The schedule for the event is available there: http://www.sqlsaturday.com/466/sessions/schedule.aspx

October 3, 2015 0

Renewed as SQL Server MVP! I'm glad to be nominated MVP for the 7th time! @MVPAward @Microsoft

October 1, 2015 0

Use HCatalog in HDInsight

HCatalog add a lot of functionalities when we want to reuse schemas between processing tools (Pig, MapReduce). It greatly simplifies data consumption by storing the metastore of Hive tables. There’s a great documentation about using HCatalog there. Now, let’s look at this scenario. A developer created an external table in Hive and another wants to use it in Pig.

For example, a Pig developer want to use the HiveSampleTable. This is a sample table that is created when you create an HDInsight cluster. Without the HCatalog, the Pig developer would have to know where the data is stored and its structure. First thing first, she needs to instruct Pig to sue HCatalog using this switch when Pig is called:

C:\apps\dist\hadoop-2.4.0.2.1.15.1-1234>%pig_home%\bin\pig –useHCatalog;

The above command allows Pig to leverage the HCatalog.

Then, she can declare a variable that points to the HIveSampleTable in Hive.

SampleTable = LOAD ‘HiveSampleTable’ USING org.apache.hive.hcatalog.pig.HCatLoader();
2015-09-11 23:01:59,308 [main] INFO hive.metastore – Trying to connect to metastore with URI thrift://headnodehost:9083
2015-09-11 23:01:59,391 [main] INFO hive.metastore – Connected to metastore.
2015-09-11 23:02:00,058 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS

Now, if she calls a describe command, here is what she gets:

grunt> DESCRIBE SampleTable;
2015-09-11 23:03:58,829 [main] INFO org.apache.hadoop.conf.Configuration.deprecation – fs.default.name is deprecated. Instead, use fs.defaultFS
SampleTable: {clientid: chararray,querytime: chararray,market: chararray,deviceplatform: chararray,devicemake: chararray,devicemodel: chararray,state
le,sessionid: long,sessionpagevieworder: long}

We clearly see here that we can leverage the HCatalog metastore since Pig has recognized the underneath file and its structure. Moving forward, we can now interact with this variable like we would do without knowledge of the file location and schema.

Happy HDInsight coding! J

September 11, 2015 0