Saturday, May 30, 2015

A diversion: Python

For almost ten years I've been an ardent devotee of the Python language, initially developed by Guido van Rossum - Python at Centrum Wiskunde & Informatica in Amsterdam, the Netherlands. I learned it from the ground up during my participation the GRAMPS project, motivated by the unfolding history of my remarkable family.
A few years back, I was approached by Raul Chong of IBM's Big Data University to help author an introductory book on the language. That book is nearing completion!
Along with the book, I was invited to give a two-hour introductory class on Python at Ryerson University. That class went very well, with more than 100 students hanging on to the end. Naturally two hours is hardly enough to do more than scratch the surface, especially for non-Computer Science students, many of whom had little programming experience. Still, I was very much energized by the opportunity and the response.
I was able to use Visual Studio 2013 Express with Python support added via NuGet as my demonstration IDE. Though there are some evident growing pains (especially wrt debugging) VSCE has the potential to become one of the best (free) Python IDEs available.

Update: Link to book: Getting Started with Python

Sunday, May 24, 2015

Integration Services, MySql and SET FMTONLY ON



Like many large organizations, the company I work for has a large, heterogeneous database environment.  We have at least four RDBMSs (five, if you count MS Access) in use to support the various lines of business.  That list includes major, multiple instances of SQL Server, Oracle, DB2 and MySql.  Some of those came as part of certain vendor packages.  This article concerns my adventures integrating data from MySql into SQL Server for reporting purposes.

One of the instances of MySql that I need to pull data from to populate a data warehouse on SQL Server is the database used by SANscreen, developed by Onaro and now part of the NetApp product suite.  It provides information on storage allocations and usage in our SAN. Developing an ETL process for this data load has been challenging -- even frustrating at times -- but very satisfying as the barriers we encountered along the way began to fall.

The first thing we did was install the MySql ODBC drivers into our development instance of SQL Server and verify that we could at least connect to the SANScreen database.  That worked and we could see the SANScreen tables under the Linked Servers node of System Objects in SSMS.  On to the queries!

We learned right away that we could not use SQL Server's four-part naming to query the MySql tables, since the MySql ODBC driver does not suport the required interfaces.  So, we built views using OPENQUERY that work quite well.  A typical view looks like this:
CREATE VIEW MySqlTable AS SELECT col1, col2, col3 FROM OPENQUERY(SANSCREEN, ' SELECT col1, col2, col3 FROM MySqlSchema.MySqlTable; ';
No problem!  We used the views to build ETL packages in Integration Services and the world was a happier place. After a few runs however, it became obvious that were pulling largely the same data every night for some of the biggest tables.  With 20/20 hindsight we realized that this was not unexpected, since those tables are essentially data warehouse fact tables.  Only a few hundred or thousand rows were new for any given run.  The rest (millions of rows, in some cases) were from older transactions.  An incremental load approach was needed.  

Writing incremental loads in Integration Services means that you need some column you can use to define the start of the new data.  That is commonly a column containing a transaction date or something that can be mapped to such a date.  You want to be able to restrict your query to rows that are new with respect to that column.  Then, you want to save the new, maximum value of that column to use for the next incremental load.  We found columns that mapped to dates and decided to use those.  Adding those dates to our query created a new challenge.

We didn't want to retrieve the whole table from MySql and then filter out the new rows.  That would defeat one of our key goals in the incremental load approach: only pull data from MySql that we need.  So, we needed to pass that date to the call to OPENQUERY.  Unfortunately, OPENQUERY does not support a variable as its second parameter.  Our nice, neat views would not do the job.  We needed new stored procedures.  The procedures have the general form:
CREATE PROCEDURE usp_MySqlTable @StartDate datetime = null AS DECLARE @query varchar(max) = ' SELECT col1, col2, col3 from MySqlSchema.MySqlTable WHERE datecol >= ''{StartDate}''; '; SET @query = REPLACE(@query, '{StartDate}', @StartDate); EXEC sp_executesql @sql;
(Note that I like to declare a variable containing the whole query including place-holders for variables, then replace those variables before executing the query.  This way, I can see the whole query at a glance.  I find this easier to follow than alternatives that assemble the query from parts.  YMMV.)

Voila!  A stored procedure that calls MySql with a variable start date!  Worked like a charm in all our unit tests.  So, now we went on to build SSIS packages that use the new procs.  More pain ensued...

Since we were no longer working with views, there has to be some way for Integration Services to figure out the metatdata for the source adapter. Since we are using OLE DB Sources in our packages, we changed the input from Table or View to SQL Command. The commands look like:
EXEC usp_MySqlTable ?
and we put a variable holding a default starting date in the  Parameters section.  (As an aside: For some reason, when you do this, the Integration Services designer cannot preview the results.  It throws an error, complaining that parameters were not supplied, even though they clearly are.) Once we had changed the view to the call to the stored procedure, I clicked on Columns to make sure that everything was still OK. We waited...and waited...and waited some more...until the designer threw an error saying that the query has timed out!  What on earth was going on?  To find out, we fired up SQL Server Profiler to observer.  we found a sequence like this (reformatted for readability):
DECLARE @p1 int SET @p1 = 0 EXEC sp_prepare @p1 OUTPUT , N'@P1 datetime' , N'exec [dbo].[usp_MySqlTable] @StartDate = @P1;' , 1 SELECT @p1 SET NO_BROWSETABLE ON SET FMTONLY ON EXEC sp_execute 1 , '01-01-1992 01:01:01:000'
Of course, SSIS needs to know the metadata, hence the SET FMTONLY ON.  But where did that weird date come from?  We had not specified January 1, 1992 as the start date!  (Note that this remains a mystery.)  The consequence of the bogus date is that MySql is asked to return the entire table, all four million rows of it.  Since this is inside an OPENQUERY, SQL Server cannot know what the metadata is until something is returned.  MySql chose to process a large portion of the table before returning anything to SQL Server, hence the timeout.

What to do?  Well, we thought if we could somehow limit the results returned by MySql, in spite of the bogus date in 1992, we might be able to avoid the timeout and make the package design less frustrating as well.  Reading through the MySql docs, we saw that there is a LIMIT clause available for MySql queries.  That's it! we thought.  Now, we needed a way to include that in the query, but conditionally, since we only wanted to do that when FMTONLY was ON.  So how could we know that?  BOL was no help, but some Internet searches turned up this gem:


In a nutshell, when FMTONLY is ON, SQL Server tries all conditional branches in the stored procedure, looking for metadata that it can use.  That means that this statement:
SET FMTONLY ON DECLARE @fmtonly bit = 0; IF 1 = 0 SET @fmtonly = 1;
will actually set the variable @fmtonly to 1, if  FMTONLY is indeed ON.  What we wanted to do is to add a LIMIT clause.  So, to begin with, we modified the query template in the stored procedure to:
DECLARE @query varchar(max) = ' SELECT col1, col2, col3 from MySqlSchema.MySqlTable WHERE datecol >= ''{StartDate}'' /* LIMIT */; ';

Note that adding a comment to the query passed to MySql does not have any ill effects.  However, it gives us a place to insert a proper limit clause.  So, before actually executing the dynamic SQL, we added this one line:
IF 1=0 SET @query = replace(@query, '/* LIMIT */', ' LIMIT 1');
The effect?  It worked!  Now, when SSIS does its FMTONLY run, only one row is returned from MySql, almost instantly.  No more lengthy waits and time outs!


Friday, May 22, 2015

Problems in SSIS land

Using SQL Server 2012, SSDT with Visual Studio 2012

There's an interesting article on MSDN: OLE DB Source discussing, among other things, how to use parameters when calling a stored procedure in an OLE DB Source component. The section "Specifying Parameters by Using Ordinal Positions" (about halfway down the page), shows an example using the AdventureWorks database. Unfortunately, the example doesn't work, at least not if you try to Preview the results.

Note that I want to use Preview as a sanity check that I've set things up correctly. Instead, it's driving me insane!

I set up my test using exactly the setup described:
  1. OLE DB Connection Manager to AdventureWorks (I used the 2012 version for the test) 
  2. OLE DB Source Component using the Connection Manager in step 1 
  3. Data Access mode: SQL Command 
  4. SQL Command text: EXEC uspGetWhereUsedProductID ?, ? -- copied from the web page 
  5. Two variables created: User::StartProductId (int) and User::CheckDate (datetime), set to 0 and getdate(), respectively 
  6. Two parameters in the SQL query Parameters window: Parameter0 mapped to StartProductId Parameter1 mapped to CheckDate 
  7. back in the OLE DB Source Editor, clicked Preview ------->>> BOOM!  

"No value given for one or more required parameters."

 
So, it seems that the documentation is wrong (or incomplete) on this point. Curious to see what was going on, I fired up Profiler to trace the calls. I was surprised to see this sequence:


declare @p1 int
set @p1=0
exec sp_prepare @p1 output,N'@P1 int,@P2 datetime',N'EXEC uspGetWhereUsedProductID @P1, @P2',1
select @p1
exec sp_execute 1,0,'01-01-1992 01:01:01:000' 

So, I learned that the Preview did not use the variable values I had set. Why?

Since the ordinal positions method didn't work, I tried the second option: Specifying Parameters by Using Names

To do this, I changed my SQL command to:



EXEC uspGetWhereUsedProductID 
    @StartProductID=?,
    @CheckDate=?
;
 
 
And changed the mappings to:

@StartProductID --> User::StartProductID
 @CheckDate --> User::CheckDate

Then I clicked Preview again. -- Same result ----> BOOM! Same error message. Still didn't use my variable settings So, this is totally messed up.

HOWEVER! When I actually execute the task under debug in SSDT, it runs fine, uses the variable values I set, etc.

So, the big questions:
  1. Why does Preview bomb? From what I can see in the trace, it shouldn't. That is, it properly prepares the call to the stored procedure (except for my variable values), FWIW I copied the statements from Profiler and ran them fine in SSMS. So why does this fail in SSDT debug? 
  2. Why doesn't it use my variable values? 
  3. Where did the date '01-01-1992 01:01:01:000' come from? Is it just a historical reference to SQL-92?