Sunday, December 14, 2008

Talking Pipelines : SSIS and Powershell

 

Talking Pipelines : SSIS and Powershell

Of late I have been fiddling with Powershell quite a bit as I toil with SDS Shell. The simplicity with which one can build seemingly complex logic using Powershell really appeals to me and, inevitably, I have started to think about Powershell as a lightweight data integration tool. Certainly there are plenty of comparisons with the data integration tool that I know best – SQL Server Integration Services (SSIS) and so I’m going to compare the two herein.

Powershell has the notion of a pipeline – a means by which data can be passed from one unit of work to another. SSIS has a pipeline that does the same thing. In Powershell those units of work are called cmdlets, in SSIS they are called components. There are some subtle differences between the two but ostensibly the Powershell pipeline does the same thing as the SSIS pipeline – it takes data from one place and puts it somewhere else.

Earlier today I watched a video on Channel 9 which is of a discussion with the chief Powershell dude, Jeffrey Snover [Check out the video at Expert to Expert- Erik Meijer and Jeffrey Snover - Inside PowerShell]. There’s some great information in the video and I picked out the following quotes

Firstly on the underlying engine:

“this is all running in an object-flow engine … its like a data flow engine except we have objects instead of data … we pass control to this first thing, it runs until it emits data, that data is then passed to the Powershell engine.” (4:40 in)

That sounds almost analogous to the SSIS pipeline engine to me. A SSIS pipeline also has things (called components) that emit data and that data is passed to other components by the underlying engine. We are able to build our own custom Powershell cmdlets in .Net, similarly we can build custom SSIS components in .Net as well.

Now onto the internal architecture of that engine:

“We have some buffering in here .. you can control the buffer … every [cmdlet] supports an output buffer and that is implemented by the engine” (9:20)

The Powershell pipeline engine uses buffers just like the SSIS pipeline engine does. And every cmdlet has an output buffer, just like SSIS components. Cool, the similarities are becoming uncanny! I should point out one big difference though, a Powershell cmdlet can have only one output buffer, a SSIS component can have many of them.

Not everything in Powershell is analogous to SSIS though. Another quote from the video:

What Powershell does is it takes a look at the downstream [cmdlet] and it takes a look at its data requirements and the data we have … we’ll take a look at what data [the upstream cmdlet] has and whatever [the downstream cmdlet] needs and we do whatever casting or coercion is required to meet the data contract” (5:00)

That’s actually quite different from SSIS. The Powershell pipeline is doing type inference (using .Net reflection if you must know) at execution-time in order to determine how it should wire up its cmdlets and the engine takes care care of binding data between them; its a very functional approach because we tell the Powershell pipeline WHAT to do, not HOW to do it. That’s not the case with SSIS which is wholly imperative; we tell SSIS exactly which datatypes it is going to be dealing with and we tell it exactly HOW it should wire them together. In this regard the Powershell pipeline and the SSIS pipeline differ in architecture and behaviour.

Another key difference between the two is that the Powershell pipeline has only one input and one output whereas the SSIS pipeline can have multiple inputs and multiple outputs. This really is one of the key differentiators of SSIS; you can simply do much more in the SSIS pipeline than you can in the Powershell pipeline.

Enough talking, let’s do a real comparison of these things. The following screenshot shows a SSIS dataflow (aka the SSIS pipeline) that simply reads in the contents of a file and then outputs it straight back out to another file:

image Not too difficult. In Powershell we define the same operation thus:

PS C:> Import-Csv source.csv | Export-Csv destination.csv

We’re using two cmdlets called Import-csv and Export-csv and we pass them the parameters that they need to do their job. The vertical bar (|) that you see is very important – it means take the output of Import-csv and pass it as input to Export-csv. Note that we haven’t had to tell Powershell about any of the column data types of source.csv nor which column in the source maps to which column in the destination. All that stuff is inferred by the pipeline engine. Very clever stuff indeed.

That’s an example of a pipeline operation in its simplest form. Let’s make it a bit more complicated by introducing a sorting operation. In SSIS that looks like this:

image and now in Powershell:

PS C:> Import-Csv source.csv | Sort-Object -property name | Export-Csv destination.csv

This Powershell command contains a new cmdlet Sort-Object which takes the objects output from Import-csv, sorts them over a property called [name], and passes them on to Export-csv. Its so simple. I love it!!

At this stage you may be wondering why we would even bother with SSIS given how simple it is to do this stuff in Powershell but the reality is that most data integration tasks aren’t as simple as

import—>sort—>export

If the required logic is anything more complicated than “take data from one place, change it, and put it in one other place then you’re going to have to resort to something a little more heavy-duty and in a Microsoft shop SSIS is perfect for that.

There is another very good reason to use SSIS as well. Performance. Let’s take a look at a video of the first SSIS pipeline from above operating on a 10MB file containing ~89000rows:

[Its likely that if you’re viewing this blog post in a newsreader then you won’t be able to see them in which case click through to the original blog post.]

The whole thing ran in approximately 7 seconds (would have been even quicker had we run it outside of the design environment), we got some upward movement on the CPU usage and memory remained flat. Let’s compare that to doing the same thing in Powershell:

In Powershell it took over 60 seconds and we got near constant 100% CPU usage so its clear that SSIS is significantly faster. Just to prove the point I ran some additional tests where I measured the time taken by both pipelines to import and export files of varying sizes. The results are tabulated and charted below:

File Size (MB)
2
4
6
8
10

SSIS
0.9
1.3
1.7
2.3
2.6

PS
7.6
15.1
22.9
31.3
37.5

commparison of SSIS and Powershell pipelines

None of this should be surprising. We discussed earlier how Powershell has to infer the data types and the bindings at execution-time and none of that comes cheaply. SSIS of course is optimised to give us lightning fast throughput, that’s why we have to define the data types and the column bindings up front at design-time.

In conclusion, Powershell has many similarities to SSIS but there are very few situations (if any) where there is not an obvious choice about which should be used. Use the right tool for the job, don’t use a screwdriver to try and hammer in a nail! For all you SSIS guys out there, I hope I’ve convinced you that Powershell could be a very very useful string to add to your bow.

As always, I would welcome any comments.

-Jamie

Published 09 December 2008 22:02 by jamie.thomson

SSIS Junkie : Talking Pipelines : SSIS and Powershell

No comments:

Blog Archive