SSIS: Removing HTML from a field using Script Component

Let’s say you are transferring data from the database of one program to the database of another program.  The problem is that one of the fields, say it’s a field called “usedescription” that is a data type of DT_NTEXT, has HTML in it because the program used that HTML formatting to display the text a certain way on it’s screen.  However, the new program doesn’t need that HTML and perhaps that HTML even causes issues.  So, how do you move the data from one database to the other and properly remove the HTML?

In SSIS, you can use the script component and the HTML Agility Pack.  The HTML Agility Pack is a free code library that you can use to parse HTML (or remove it as we are doing).  First, go to the following website to download the library:

http://htmlagilitypack.codeplex.com

Note that you can now get the library via Nuget.

Once you’ve downloaded the library to your downloads folder and unzipped it, note that the HTML Agility Pack contains sub-folders for the various versions of .Net and even WinRT and Windows Phone.  Since we are using SSIS on SQL Server 2014, copy the files in the .Net40 directory to the C:\Windows\Microsoft.NET\Framework\v4.0.30319\ directory of the server that will be running your SSIS packages.  Next, open up your script component and set it in your data flow and add the field in question to the list of inputs (don’t forget to set it to Read/Write).  Click on Edit Script and when the code editor comes up add the HtmlAgilityPack.DLL to your list of references (you may have to browse for it to add it).

image

You will also need to add the namespace: using HtmlAgilityPack;

Here is the code in the script component with comments along the way (note that I took out the stuff we don’t need):

image

When running our SSIS package, the HTML Agility Pack removes unwanted HTML quite nicely.  Check out our before and after screenshots:

Before:

image

After:

image

Obviously, there is more clean-up to be done.  However, getting the HTML out was a good first step.

JamesNT

SSIS: Using Expressions to build SQL Statements for ADO.NET Connections

In SQL Server Integration Services, you can specify that an OLE DB Connection use a SQL Statement from a variable.

image

Using this approach, you can dynamically build SQL statements using the OLE DB Connection.  But what about ADO.NET?

image

It appears we have no way to dynamically build SQL statements when using ADO.Net providers.  And to think I’ve been standardizing on them.  On the other hand, maybe we do have a way.  I made a package with two variables.  One is a DateTime called LoadControl and the other is a string called strSQL.  I’m going to load a DateTime from a load control table into the LoadControl variable then use the LoadControl variable to build the WHERE clause of a SQL statement to pull out all medical claims with a date of service greater than or equal to the LoadControl date.  First, our variables.

image

Nothing hard about that.  Next, our Execute SQL task to populate the LoadControl variable.

image

image

Next up, our data flow. 

image

The only thing I’ve done is set up the ADO.Net Source.  Now we need to get our strSQL variable populated.  First, be certain to set the property EvaluateAsExpression to TRUE for strSQL.

image

Next, create an expression for this variable like so.  Notice that since DateTime variables cannot be NULL when you create them, SSIS fills in the current DateTime, hence the 5/3/2015 7:38:04 PM.

image

Now the interesting part.  From the Control Flow, single-left click your data flow to highlight it.  Now, look over at your Properties for your data flow.  Scroll down to the Misc. section. 

image

That’s right, you see the SQL statement for the ADO.Net source.  Of course, this is where it is important to call your connection sources something meaningful so you can find them readily (I didn’t bother since we only have one).    Notice that we have two spots for the ADO.Net source:  SQL Command and TableOrViewName.  We aren’t going to change the SQL statement there.  Rather, go down further until you see Expressions.  That’s right, build an expression.

image

image

Notice that for this expression, we only need the strSQL variable.  Once you have that saved, put a data viewer on your package and run it.

image

Notice that only dates for 4/10/2015 or higher are shown (I added an Order By to the SQL Statement in strSQL and 4/10/2015 is what was in our LoadControl Table).  This is where our Expression was evaluated and placed in for the SQL Command of the ADO.Net source.  One thing of note, notice how before I set up a SQL statement in the ADO.NET Connection when I first created it earlier.  This statement is IGNORED when the Expression is evaluated.  However, if the SQL Statement in your Expression adds or changes columns, you may need to go into the Advanced Editor of the ADO.NET Connection and click Refresh to get those changes to show.  Otherwise, your new or changed columns may not show up in the data flow right away.

image

With this approach, you can still dynamically build SQL statements for ADO.NET Connections like you can OLE DB Connections.  A little more work, yes, but I think worth it when you have lots of Script Tasks/Components that need to use Connection Managers.

JamesNT