Tuesday, 20 May 2008

Reporting Services: Passing MultiValue Parameters

I recently came across an article by Wayne Sheffield on SQL Server Central which contained a very neat idea for passing multi-value parameters from SSRS to a SQL stored proc by using XML.

Because SQL stored procs can't handle arrays, it can't handle parameters with multiple values. There are a few ugly ways around this of course by using delimiters and manipulating strings but that just isn't pretty at all. Wayne's idea is to use XML string parameters.

So SSRS would send a string in the following format:


<root>
       <node>
              <element>element data</element>
       </node>
</root>


It would look something like this:


<customers>
       <customer>
              <customerid>1234</customerid>
       </customer>
</customers>

Wayne has written a bit of code which you can add to your report or create a DLL for which can then be referenced by your report.


Function ReturnXML(ByVal MultiValueList As Object, ByVal Root As String, ByVal Node As String
       **************************************************************************
       Returns an XML string by using the specified values.
        Parameters:MultiValueList - a multi value list from SSRS
        Root, Node, Element - String to use in building the XML string
        **************************************************************************
        Dim ReturnString = ""
        Dim sParamItem As Object
        ReturnString = "<" & Root & ">"
        For Each sParamItem In MultiValueList
               ReturnString &= "<" & Node & "><" & Element & ">" & Replace(Replace(sParamItem,"&","&"),"<", "<") & ""
       Next
       ReturnString &= ""
       Return (ReturnString)
End Function


This code would be referenced in your Reporting Services parameter like:

ReturnXML(Parameters!MultiValue.Value, "Customers", "Customer", "CustomerId")

To then use your XML parameter within the stored proc:

Select CustomerId, CustomerName, ActiveFlag
From tCustomer a
INNER JOIN @ipCustomerList.nodes('/Customers/Customer') AS x(item) ON a.CustomerId = x.item.value('CustomerId[1]', 'integer')


Pretty handy no?

Thursday, 8 May 2008

Datamining Part II - Terminology

Datamining, like all other IT subjects has it's own lingo. This quick blog post will explain them.

Datamining
Datamining attempts to deduce knowledge by examining existing data

Case
A case is a unit of measure.
It equates to a single appearance of an entity. In relational terms that would mean one row in a table. A case includes all the information relating to an entity.

Variable
The attributes of a case.

Model
A model stores information about variables, the algorithms used and their parameters and extracted knowledge. A model can be descriptive or predictive - it's behaviour is driven by the algorithm which was used to derive it.

Structure
A structure stores models.

Algorithm
My definition here is from the perspective of datamining rather than a general definition. An algorithm is a method of mining data. Some methods are predictive (forecasting) and some are relative (showing relationships). 7 algorithms are included with SQL Server 2005.

Neural Network
An algorithm designed to predict in a non-linear fashion, like a human neuron. Often used to predict outcomes based on previous behaviour.

Decision Tree
An algorithm which provides tree-like output showing paths or rules to reach an end point or value.

Naive Bayes
An algorithm often used for classifying text documents, it shows probability based on independant data.

Clustering
An algorithm which groups cases based on similar characteristics. Often used to identify anomalies or outliers.

Association
An algorithm describes how often events have occured together. Defines an 'itemset' from a single transaction. Often used to detect cross-selling opportunities.

Sequence
An algorithm which is every similar to the association algorithm except that it also includes time.

Time Series
An algorithm used to forecast future values of a time series based on past values. Also known as Auto Regression Trees (ART).

Cluster
A cluster is a grouping of related data.

Discrete
This is more a statistical term than a strictly datamining term however it is used frequently - hence it's inclusion here. Discrete refers to values which are not sequential and have a finite set of values eg true/false

Continuous
Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete.

Outlier
Data that falls well outside the statistical norms of other data. An outlier is data that should be closely examined.

Antecedent
When an association between two variables is defined, the first item (or left-hand side) is called the antecedent. For example, in the relationship "When a prospector buys a pick, he buys a shovel 14% of the time," "buys a pick" is the antecedent.

Leaf
A node at it's lowest level - it has no more splits.

Mean
The arithmetic average of a dataset

Median
The arithmetic middle value of a dataset

Standard Deviation
Measures the spread of the values in the data set around the median.

Skew
Measures the symmetry of the data set ie is it skewed in a particular direction on either side of the median

Kurtosis
Measures whether the data set has lots of peaks or is flat in relation to a normal distribution

Monday, 5 May 2008

SQL Server Releases

Microsoft have announced that they are changing their approach to releases for SQL Server. This is interesting because SQL Server releases can be a touchy subject for businesses, particularly those with big server farms. Inevitably the development team wants the Service Pack to be installed ASAP whereas the server team is keen to protect their stable server and pretend service packs don't exist. This means a lot of pushing and shoving.

This new approach should help to alleviate the pressure a little but I'm not altogether convinced.

· Smaller Service Packs which will be easier to deploy
I suspect smaller service packs will make server teams less inclined to come to the party because less inclusions on a per service pack basis inherently implies more service packs.

· Higher quality of Service Pack releases due to reduced change introduced
It's all very well to say that the quality is better but that's a very airy fairy 'benefit' which I can't imagine will go down very effectively with server teams as an argument for implementation. It's just not very quantitative which means server teams are likely to ignore it.

· Predictable Service Pack scheduling to allow for better customer test scheduling and deployment planning.
On this point, I demure. This can have a huge impact on getting releases implemented. Presuming of course that you can get your server team to operate on a scheduled release process themselves. It's all very well for the vendor to do it but if the server team doesn't ALSO do it, there's no gain. That said, I believe that such a process SHOULD be followed. I just don't see it as terribly likely. I fervently hope to be disproven.

It's really easy to be cynical about this approach and say 'my organisation will never do this'. Which is the trap I've fallen into here I realise, but the fact of the matter is, good on Microsoft for considering these issues and attempting to find ways to improve them. The approach is right and a positive move. Now the onus is on us to follow in their footsteps. This should be a wakeup call to server and development teams to find more common ground, to develop processes which satisfy everyone's needs and to communicate with each other better.

For more details:
http://blogs.msdn.com/sqlreleaseservices/archive/2008/04/27/a-changed-approach-to-service-packs.aspx

Tuesday, 15 April 2008

Datamining – Reviewing the Data Source

I'm currently going through the Training Kit for Exam 40-445 – the MCTS SQL BI exam. And data-mining, joy of joys, an area where I'm not exactly at expert level, is the focus of my study. And of course, by all reports, data-mining makes up a large chunk of the exam.

In the book, they provide sample code for the 4 moments of a particular attribute which can be used to validate test vs training data in reviewing source data for a data mining project.

The Four Moments

Moments are individual measures or parameters of the moment method. The moment method estimates and equates samples of data. The moments most commonly used in data mining are:

Mean – the centre point of the data set
Standard Deviation – measures the spread of the values in the data set around the centre point
Skew – measures the symmetry of the data set ie is it skewed in a particular direction on either side of the centre point
Kurtosis – measures whether the data set has lots of peaks or is flat in relation to a normal distribution

Datasets

In reviewing source data, you need a training data set (to teach the patterns) and a test data set to validate whether that which was learnt is in fact correct!



In the code below, we are returned 2 rows – a row for each dataset - showing the 4 moments for both. If the values are fairly similar in both rows then we know the sampling is valid.

Sample Code

This example uses, from the AdventureWorksDW database:

  • the 'CustomerKey' column in order to identify records for sampling

  • the 'Age' column as the attribute which we are measuring the 4 moments against
--populate testing data
Select * into TK445_Ch09_TestSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2

-- populate training data
Select * into TK445_Ch09_TrainingSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2
except select * from TK445_Ch09_TestSet

--can be used to validate that datasets have been populated
--select 1,* from TK445_Ch09_TrainingSet union all select 2,* from TK445_Ch09_TestSet

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TrainingSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m

UNION

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TestSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m

Thursday, 13 March 2008

Microsoft 70-556 Exam - MCTS Performance Point

Last week I completed and successfully passed the 70-556 exam - which gives me an MCTS PerformancePoint certification.

The exam was relatively simple - I scored highly in all areas but one. Obviously a weak point there on my part... my success in the other areas was due to a week and a bit of study using the Rational Guide to PerformancePoint (authored by my esteemed colleagues Adrian Downes and Nick Barclay). I have no experience with the product whatsoever but going through the samples in the book got me through. Much thanks to them!

The exam is strongly skewed towards Planning - only a few questions on Monitoring & Analysing - and the M&A questions were very simple. There are quite a few questions on Business Rules and Configuring Models & Dimensions - and these are complex areas so spend time on them in your study.

If you are doing the exam and using the Rational Guide, don't ignore the Bonus chapters - you'll need them!

Next up is the 70-445 (MCTS SQL BI) which is much harder... I'm currently spending a lot of time with the Microsoft Training kit. Wish me luck...

Thursday, 21 February 2008

Using TSQL Queries for AD Authentication

Yes you can directly query AD using TSQL queries. And it's REALLY easy.

Why you ask – well, LDAP can be painfully slow if you have to trawl through groups eg if you are storing the results in a table or outputting them to screen or suchlike. I've found this method to be pretty reliable and reasonable with regards to performance.

First things first… on a SQL server, you need to create a linked server to AD as follows:

EXEC master.dbo.sp_addlinkedserver @server = N'ADSI', @srvproduct=N'Active Directory Service Interfaces', @provider=N'ADSDSOObject', @datasrc=N'adsdatasource'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'collation compatible', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'data access', @optvalue=N'true'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'dist', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'pub', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'rpc', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'rpc out', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'sub', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'connect timeout', @optvalue=N'0'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'collation name', @optvalue=null
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'lazy schema validation', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'query timeout', @optvalue=N'0'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'use remote collation', @optvalue=N'true'
GO

That's all the configuration you have to do. You are good to go!

Below is an example of the TSQL syntax.

Note that the DC is the domain in which the AD resides – in this instance it has three parts MyDomainName,DC=MyTopLevelDomain,DC=MyDomainCountryCode eg DC=microsoft,DC=com,DC=au. In this example, I am returning the username of all users who are a member of the AD group called GAMyADGroupName. This example checks if a particular user is a member of a particular AD group:

SELECT *
FROM OPENQUERY(ADSI,
'SELECT sAMAccountName
FROM ''LDAP://DC=csr,DC=com,DC=au''
WHERE objectCategory = ''Person''
AND objectClass = ''user''
AND memberOf = ''CN=GAMyADGroupName,OU=Other,OU=Global Groups,OU=Security Groups,DC=MyDomainName,DC=com,DC=au''
AND sAMAccountName = ''MyUserName''')

Monday, 18 February 2008

SSIS Synchronous/Asynchronous Script Transformations in a Data Flow

Asynchronous just means it consumes all rows before progressing to the next task in the data flow – unlike most other tasks where the entire data flow is executed for each row before the next row is commenced. Script tasks can be synchronous (eg Lookup) or asynchronous (eg Sort or Aggregate).

A synchronous script transformation is identified because the output is linked to a specified input. This is done in the Inputs and Outputs area of the task by entering the ID from the Input properties in the SynchronousInputID field of the Output properties. For each row of input, there will be one row of output.

An asynchronous script transformation is identified firstly because the SynchronousInputID field of the Output properties contains "None" or "0". When a new row is to be added, it is explicitly done using:

Output0Buffer.AddRow()

In order to aggregate data, flatten rows or concatenate data from different rows, you simply set a flag to identify when a new row should be added. To do this you need to declare a boolean in the class but outside the main sub:

Public Class ScriptMain

Inherits UserComponent

Private blnFirstRow As Boolean = True
Dim opMessage1 As String = ""

Public Overrides
Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

In this example, I'm taking all rows and flattenning them into one row so the first step in my sub is:

If blnFirstRow = True Then Output0Buffer.AddRow()
blnFirstRow = False
opMessage1 = opMessage1 & " " & Row.ipMessage.ToString

Note that I then concatenate a variable with a new value… Alternately, you could set a flag so that a new row is added when certain conditions are met such as the value of the ipCustomerNo changes.

One other thing to add - you need to use the PostExecute() sub to write to variables.