Monday 22 December 2008

SQL 2005 SP3 has landed

SP3 for SQL 2005 is now available for download.

Included are a number of Bug Fixes , performance enhancements for using SSRS in SharePoint Integration mode, improved rendering of PDF fonts in SSRS, support for SQL 2008 Dbs with SQL 2005 Notification Services and support for Teradata Dbs as a data source for Report Models.

This Service Pack is cumulative and can be run on any version of SQL 2005.

Wednesday 17 December 2008

PPS-P: Where, oh where, does my data go?

A user enters a value into a PerformancePoint Planning form... so what happens next? And why are we the least bit interested?

Firstly, to answer the why do we care question, ask yourself this...

  1. When do your extracts from PPS run (presuming you run it more regularly than say once overnight)?

  2. Do users eg approvers expect to see updates immediately - particularly at certain times of the cycle ie month end?

  3. What data do reviewers, approvers and report consumers expect to see? Approved? Currency Converted?

So we do care after all. So now to what actually happens. In fact, there are a few things that can happen. Let's examine each separately.

(1) The user selects the Save Privately action

  • The data is stored on the local file system - known as the Offline Cache. This location is usually C:\Documents and Settings\%username%\Local Settings\Application Data\Microsoft\PerformancePoint\OfflineCache
  • In the Offline Cache, the Catalog.xml file stores a list of all assignments stored locally
  • In the Offline Cache, each assignment is stored in a .dat file. The dat file an xml file which contains binaries.
  • Data is NOT sent to the SQL server or the cube
  • The status of the assignment is still 'started'
(2) The user selects the Submit Draft action
  • The data is stored in the SQL table called MG_ModelName_Default_Partition
  • The resulting values from Definition Rules are stored in the MG table also however they're linked to the 'No Assignment' assigment
  • Once the pre-configured polling period has passed, the cube called ApplicationLabel_ModelSiteLabel will be refreshed
  • The status of the assignment is 'Partial'
(3) The user selects the Submit Final action.
  • The data is stored in the SQL table called MG_ModelName_Default_Partition
  • The resulting values from Definition Rules are stored in the MG table also however they are linked to the 'No Assignment' assigment
  • Once the pre-configured polling period has passed, the cube called ApplicationLabel_ModelSiteLabel will be refreshed
  • The status of the assignment is 'Submitted'

(4) A data movement job occurs

  • The data is copied from the source SQL table called MG_ModelName_Default_Partition to the destination table called MG_ModelName_Default_Partition
  • Data is aggregated when the scope of a dimension (in the data association) has been set to "All" on the source model.
  • All data is copied: Partial, Submitted, Approved
Then of course the question is, how long does all this take? Sadly, there's no simple answer to that question, as so aptly discussed by Peter Ebley. To paraphrase Peter, there are a few steps that occur:
  1. Send it to the server (Status = Pending)
    Depends on the size of the submission (How many rows? Is the workbook attached? How many dimensions?) and network bandwidth.

  2. Process the Queue (Status = WaitProcess)
    Depends on the Planning Process Poll Interval (recommendation is 1 minute) and how many items are in the queue and of course, how long it takes to validate the submission.

  3. Cube Processes
    Depends on the OLAP Cube Refresh Interval property in the Workflow tab of PAC.

Weighing in on Project Gemini

My reception to Project Gemini to date has been lukewarm. Sure, self-service cubes in Excel sounds sexy and looks sexy... but... we don't let our users have free reign with data for a reason. These controls didn't just happen because we want to keep ourselves in a job (I hope!).

I'm gonna take a stab and suggest that every BI pro has walked into a job and found an environment where there are 2000 reports on a single system and some of them consume all the machine resources and take 10 minutes to execute. That sort of malarky. So we did the hard slog, got the BA's back up and whipped everyone back into line. The users were happier because the reports execute quickly and were easy to find, you were happier because there are far fewer reports to maintain, the server team are happier because the box didn't have to be rebooted every day and the BA's are happier because everyone else is happy (one hopes!).

This is why we put controls in place.

So. Doesn't this leave Project Gemini dead in the water in the real world?

Well, I read an interesting perspective from Patrick Husting which made me re-think my thoughts (non-pun intended).

"People are always going to do self service reporting in some way whether we like it or not, no matter how good the underlying data is, so why not do it in a controlled manner where everything is audited and logged and the IT team has full visibility of what is going"

Ok so, to quote from a really crappy movie I saw recently "just because it is, doesn't mean it should be". But I think he might actually be right. To a point.

I really do hate to be a naysayer. And I'm not just saying that. But, I have a big BUT (please no jokes!).

Where is all this magical data that is being analysed coming from in the first place? From systems. Well guess, what, we don't let users have hollus bollus access to our databases to avoid the suituations caused by our predecessors.

Actually that's not my only issue... users do NOT have a big picture view of the world. They don't care what other parts of the business are doing and they are not interested in how data aggregates upwards beyond their little fish pond. They are not interested in whether there is a single source of truth or if their version of the truth is different to someone else's. They are not interested in network bandwidth or performance at other sites.

Nor should they care about this stuff, that's not their job. It's ours.

It's all very well to say "everything is audited and logged" but aren't we supposed to operate preventatively not reactively?

So I have to ask, isn't that what something like ProClarity is for? To allow users to slice and dice data however they choose while ensuring that the structures and models have been put in place to consider the big picture?

You know what, I think I've just talked myself back to my original stance. It's sexy and I like it but I can't envision practical applications in the real world yet.

That said, I'm always happy to be proven wrong.

Tuesday 18 November 2008

PerformancePoint Planning - Enable/Disable Spreading

Contrary to the incorrect instructions on Office Online, to enable and disable spreading on a matrix you must do the following:

From the Advanced group on the PPS ribbon, you select Options and the SpreadingType drop-down allows you to enable/disable spreading.

Tuesday 11 November 2008

Using SAP as a Data Source

I was recently asked a question which I thought might be of value to everyone.

Can you please let me know...
1. Is share Point and Performance Point can be used in SAP BI or is it strictly Integrated with MircoSoft Technologies!!

So my response was...

I’m assuming you mean “can you use SAP as a data source?”

Firstly, SharePoint isn’t a BI technology as such. It’s a collaboration, portal & document management tool. However it does have the ability to create lists and simple KPIs which you could use SAP data as a source for if you wanted to.

That said, I’m sure your SAP operations team would never allow you to link directly to SAP – you would need to extract into a database first.

Secondly, you can certainly use SAP as a data source for both PerformancePoint modules - Planning and Monitoring & Analyzing. However you would never link directly to the system from PerformancePoint. You would almost always extract from your transactional system into a datawarehouse and that would be your data source (there may be exceptions from time to time but generally it's is wise to separate your reporting engine and your transactional engine. There are numerous gains in doing so - security, performance, aggregation, modelling).

The other key consideration is licensing. SAP is pretty strict about licensing when it comes to extracting data so you would need to take that into consideration also.

Hope that answers your question!

Wednesday 5 November 2008

PerformancePoint Planning - Impersonate a Contributor during Form Development

It was be absolutely invaluable to have the ability to impersonate a 'contributor' during form development.

Currently, when developing a form, I have 2 sesssions of Excel open - one as me and one as a contributor. This means I need to create a cycle and an assignment also and republish every time I make a change to the form I'm working on. Merely to validate that I am getting 'yellow cells' as expected (which is by no means a sure thang!).

A feature, similar to Analysis Services where I can impersonate a particular user/role would eliminate the need for all that extra effort - republishing, regenerating etc.

If you too would value this functionality, please vote for it at https://connect.microsoft.com/feedback/ViewFeedback.aspx?FeedbackID=380070&SiteID=181

PerformancePoint Planning - Object Names

Just a quick FYI....

In PerformancePoint Planning, objects are named as follows:

  • Cubes: ApplicationLabel_ModelSiteLabel
  • Databases: As specified in PAC during create
  • Application Label must be unique and cannot be changed
  • When connecting to an application, the Application Name is displayed

Tuesday 4 November 2008

Microsoft Azure.... the blue loo

Microsoft announced it's Azure service last week. In their words Azure is a tool that allows you to "Build new applications in the cloud - or use interoperable services that run on Microsoft infrastructure to extend and enhance your existing applications". From our, the BI developers', perspective Azure is hosting not just data but also Visual Studio solutions ie developing and deployment within the cloud.

Not that it's limited to Microsoft tools and environments of course.

So I immediately joined the CTP for SQL Services. I've downloaded the SQL Data Services SDK and have been placed on a waitlist for access to the cloud.

From what I can tell so far, you can use REST or SOAP for CRUD operations and LINQ for queries. Interestingly, there is no 'schema' as such - new properties can be added to entities as required - no need to add a new 'column'. What that means in practical terms I'm yet to find out. I suppose it will become apparent once I can access the cloud.

More info will follow shortly.

Tuesday 28 October 2008

What’s New in SQL 2008 – Part 3 – Table Type

The new Table type allows you to save a table definition in the database which can be reused later to define table variables, stored procedure parameters and functions. To create a Type, we use the CREATE TYPE clause, for example:

CREATE TYPE dbo.ProductList as TABLE (

    ProductName varchar(200) NULL,

    ProductCategory varchar(200) NOT NULL)


Important Notes:

  • The metadata is visible using sys.table_types
  • Variables/parameters cannot be set to NULL – if no table is supplied, an empty table is generated
  • Not affected by transaction rollbacks
  • Do not cause recompilations


Examples

Create Table-Type Variables

DECLARE @t as dbo.ProductList;
INSERT INTO @t VALUES
('Grey Slate Paver 10x10','Paver'),
('Blue Slate Paver 10x10','Paver'),
('Heritage Blue 12x6','Brick')

Creating a Stored Procedure with a Table-typed Parameter

CREATE PROC dbo.uspGetOrdersforProduct (
@t as dbo.ProductList READONLY,
@StartDate as Date,
@EndDate as Date
) AS
SELECT o.OrderId,
o.OrderDate,
o.CustomerId,
o.ProductId,
o.Amount,
o.Qty
FROM tOrder as o
INNER JOIN tProduct t on t.ProductId = o.ProductId
INNER JOIN @t as f on f.ProductCategory = t.ProductCategory and (t.ProductName is NULL or t.ProductName = f.ProductName)
WHERE o.OrderDate between @StartDate and @EndDate

Passing Table-parameters into Stored Procedures

DECLARE @t as dbo.ProductList;
INSERT INTO @t VALUES
(NULL,'Paver'),
(NULL,'Brick')

EXEC dbo.uspGetOrdersforProduct @t

What’s New in SQL 2008 – Part 2 – Data Types

Date and Time

Data Type

Size (bytes)

Example

Notes

Client Providers

DATE

3

'2009-10-01'

OLEDB: DBTYPE_DBDATE
ADO.Net: DateTime
ODBC: SQL_DATE

TIME

3-5

'13:12:00.1234567'

Accurate to 100 nanoseconds

OLEDB: DBTYPE_DBTIME
ADO.Net: TimeSpan
ODBC: SQL_TIME

DATETIME2

6-8

'2009-10-01 13:12:00.1234567'

Accurate to 100 nanoseconds
Can specify precision ie decimal places eg DATETIME(3)

OLEDB: DBTYPE_DBTIMESTAMP
ADO.Net: DateTime
ODBC: SQL_TIMESTAMP

DATETIMEOFFSET

8-10

'2009-10-01 13:12:00.1234567+02:00'

Accurate to 100 nanoseconds

OLEDB: DBTYPE_DBTIMESTAMP
ADO.Net: DateTime
ODBC: SQL_TIMESTAMP


There are a number of new functions which can also be used with these new datatypes including:

  • SYSDATETIME (returns DATETIME2), SYSUTCDATETIME, SYSDATETIMEOFFSET (includes time zone in return value) – Current Date and Time
  • SWITCHOFFSET – to change the timezone for a DATETIMEOFFSET value
  • TODATETIMEOFFSET – convert non-timezone-aware values to time-zone-aware

Other

Data Type

Size (bytes)

Useage

Returns

Notes

HIERARCHYID

VARBINARY
eg 5 bytes for 100k records at 6-Levels

DECLARE @h hierarchyid = hierarchyid::GetRoot()

DECLARE @c hierarchyid

SET @c = @h.GetDescendant(NULL, NULL)
SELECT @c.ToString()

'/1/'

Represents position in a hierarchy

Actually a CLR function with several methods eg

  • GetDescendant/IsDescendantOf
  • GetLevel/GetRoot
  • GetReParentedValue
  • ToString/Read/Write (CLR)

What's New in SQL 2008 - Part 1 - Functionality


Function

Usage Example

Notes

Declare & Initialise

DECLARE @iValue as integer = 100;


Compound Assignments

SET @iValue -= 50

UPDATE dimCustomer SET Balance += 200

+=

-=

*=

/=

%=

Multi-Row Inserts Using VALUES

INSERT INTO dimProduct (Label, BusinessKey,LastUpdatedBy,LastUpdatedDate) VALUES
('Grey Slate Paver 10x10',1111,'dbo','20091001 13:12:00'),
('Blue Slate Paver 10x10',1112,'dbo','20091001 13:12:00'),
('Black Slate Paver 10x10',1113,'dbo','20091001 13:12:00')


The insert is executed as a single transaction.

Interestingly, it can also be used like this:

SELECT * FROM (VALUES
('Grey Slate Paver 10x10',1111,'dbo','20091001 13:12:00')
('Blue Slate Paver 10x10',1112,'dbo','20091001 13:12:00')
('Black Slate Paver 10x10',1113,'dbo','20091001 13:12:00')
) as p (Label, BusinessKey,LastUpdatedBy,LastUpdatedDate)

MERGE

MERGE INTO dbo.dimProduct as d
USING dbo.dimProductStage as s ON d.ProductKey = s.ProductId
WHEN MATCHED THEN
UPDATE SET
d.ProductName = s.ProductName,
d.ProductCategory = s.ProductCategory
WHEN NOT MATCHED THEN

INSERT (ProductName,ProductCategory)

VALUES (s.ProductName, s.ProductCategory)

WHEN NOT MATCHED BY SOURCE THEN DELETE
OUTPUT $action,
deleted.ProductId as DeletedId,
inserted.ProductId as InsertedId;

Grouping Sets

SELECT Year(OrderDate) as OrderYear,
SalespersonId,
OrderType,
count(OrderId) as OrderCount,
sum(Amount) as OrderAmount
FROM factOrders
GROUP BY GROUPING SETS (
(OrderType, Year(OrderDate)),
(SalespersonId,OrderType, Year(OrderDate)),
(SalespersonId, Year(OrderDate)),
(Year(OrderDate)),

());

() empty brackets represents an 'ALL' grouping


Filtered Indexes

CREATE NONCLUSTERED INDEX idxOrders_Product
ON factOrders(OrderId,ProductId)
WHERE CreditFlag IS NULL

Change Data Capture


** to be discussed in further detail in an upcoming post **

Tracks data changes into a table.
Enabled using sys.sp_cdc_enable_db or sys.sp_cdc_enable_table
Disabled using sys.sp_cdc_disable_db or sys.sp_cdc_disable_table
Uses SQL Agent


SharePoint Magazine - Part 2 available now

My next article for SharePoint Magazine is available now.

http://sharepointmagazine.net/technical/administration/sell-sell-sell-why-build-a-dashboard-anyway-part-2-of-6

Tuesday 14 October 2008

PerformancePoint Planning: Error When Deleting Members of a Member Set

Unfortunately I discovered a problem in PPS Planning today which caused me a few headaches... thought I would pass it on to you!! The issue occurs when editing/creating a dimension member set... a save will fail if you do an outdent and a delete member immediately afterwards. Let me demonstrate.


Let's presume you have a structure something like this:


--- Expenses
--- --- Salaries & Related
--- --- --- Salaries
--- --- --- --- Base Salary
--- --- --- --- Bonuses
--- --- --- --- Allowances
--- --- --- Tax Expense


But you want to remove the Salaries level and end up with a structure that looks like this:


--- Expenses
--- --- Salaries & Related
--- --- --- Base Salary
--- --- --- Bonuses
--- --- --- Allowances
--- --- Tax Expense


If you do this:

  1. Outdent the 3 leaf level members
  2. Remove the Salary level member
  3. Save

You will get an error when you save similar to the following:

In order for it to work correctly, you MUST (in this order):

  1. Outdent the 3 leaf level members
  2. Save
  3. Remove the Salary level member
  4. Save

I have logged this problem on Microsoft Connect. Please feel free to vote! https://connect.microsoft.com/feedback/ViewFeedback.aspx?FeedbackID=375153&SiteID=181

Thursday 18 September 2008

SharePoint Magazine - Part 1 Out Now!

I've commenced a 6-part series on building PerformancePoint dashboards in SharePoint which you can read at SharePoint Magazine.

I'll be releasing a new part each week for the next 6 weeks covering the various aspects of building dashboards - including why bother at all!

Have a read if you're interested. Feedback so far has been excellent.

Monday 25 August 2008

Interesting Read

Great article comparing the PIVOT (SQL 2005+) function with doing it the old-fashioned way! Take a read...

http://www.sqlservercentral.com/articles/T-SQL/63681/

Thursday 3 July 2008

SQL: Concatenating Rows into a Scalar Value

This is an issue which pops up from time to time, often when passing values back to .Net applications.

The Problem
Taking values in rows and creating a single string value containing a delimited list of all those values.

The Old Solution
The method I have used in the past is to use the FOR XML clause and build a string that way. It works.

DECLARE @DepartmentName VARCHAR(1000)
SELECT @DepartmentName = Name + ','
FROM AdventureWorks.HumanResources.Department
WHERE GroupName 'Executive General and Administration'
FOR XML PATH ('')
SELECT @DepartmentName AS DepartmentNames


The Sexier Solution
This idea comes courtesy of Ken Simmons at SQL Tips... so simple and I'm kind of annoyed that I've never tried it myself!

DECLARE @DepartmentName VARCHAR(1000)
SELECT @DepartmentName COALESCE(@DepartmentName,'') + Name ';' 
FROM AdventureWorks.HumanResources.Department
WHERE GroupName 'Executive General and Administration'
SELECT @DepartmentName AS DepartmentNames


What Were the Actual Results from Testing?
Sadly, the sexier solution under-performs in comparison to the FOR XML solution.

MethodMillisecondsBytesReceivedRows Returned
FOR XML7898913
COALESCE2651050560921


Obviously this lag is due to the number of rows which are returned to the client. Based on the stats, the FOR XML does all it's aggregation and concatenation server-side whereas the COALESCE method does it client-side.

So I guess that means the old way is still the best. Or is it? I wonder in what other circumstances COALESCE could be used? Ok so it under-performed in this instance but I suspect there are other uses which could be of great value. Got any ideas?

Tuesday 20 May 2008

Reporting Services: Passing MultiValue Parameters

I recently came across an article by Wayne Sheffield on SQL Server Central which contained a very neat idea for passing multi-value parameters from SSRS to a SQL stored proc by using XML.

Because SQL stored procs can't handle arrays, it can't handle parameters with multiple values. There are a few ugly ways around this of course by using delimiters and manipulating strings but that just isn't pretty at all. Wayne's idea is to use XML string parameters.

So SSRS would send a string in the following format:


<root>
       <node>
              <element>element data</element>
       </node>
</root>


It would look something like this:


<customers>
       <customer>
              <customerid>1234</customerid>
       </customer>
</customers>

Wayne has written a bit of code which you can add to your report or create a DLL for which can then be referenced by your report.


Function ReturnXML(ByVal MultiValueList As Object, ByVal Root As String, ByVal Node As String
       **************************************************************************
       Returns an XML string by using the specified values.
        Parameters:MultiValueList - a multi value list from SSRS
        Root, Node, Element - String to use in building the XML string
        **************************************************************************
        Dim ReturnString = ""
        Dim sParamItem As Object
        ReturnString = "<" & Root & ">"
        For Each sParamItem In MultiValueList
               ReturnString &= "<" & Node & "><" & Element & ">" & Replace(Replace(sParamItem,"&","&"),"<", "<") & ""
       Next
       ReturnString &= ""
       Return (ReturnString)
End Function


This code would be referenced in your Reporting Services parameter like:

ReturnXML(Parameters!MultiValue.Value, "Customers", "Customer", "CustomerId")

To then use your XML parameter within the stored proc:

Select CustomerId, CustomerName, ActiveFlag
From tCustomer a
INNER JOIN @ipCustomerList.nodes('/Customers/Customer') AS x(item) ON a.CustomerId = x.item.value('CustomerId[1]', 'integer')


Pretty handy no?

Thursday 8 May 2008

Datamining Part II - Terminology

Datamining, like all other IT subjects has it's own lingo. This quick blog post will explain them.

Datamining
Datamining attempts to deduce knowledge by examining existing data

Case
A case is a unit of measure.
It equates to a single appearance of an entity. In relational terms that would mean one row in a table. A case includes all the information relating to an entity.

Variable
The attributes of a case.

Model
A model stores information about variables, the algorithms used and their parameters and extracted knowledge. A model can be descriptive or predictive - it's behaviour is driven by the algorithm which was used to derive it.

Structure
A structure stores models.

Algorithm
My definition here is from the perspective of datamining rather than a general definition. An algorithm is a method of mining data. Some methods are predictive (forecasting) and some are relative (showing relationships). 7 algorithms are included with SQL Server 2005.

Neural Network
An algorithm designed to predict in a non-linear fashion, like a human neuron. Often used to predict outcomes based on previous behaviour.

Decision Tree
An algorithm which provides tree-like output showing paths or rules to reach an end point or value.

Naive Bayes
An algorithm often used for classifying text documents, it shows probability based on independant data.

Clustering
An algorithm which groups cases based on similar characteristics. Often used to identify anomalies or outliers.

Association
An algorithm describes how often events have occured together. Defines an 'itemset' from a single transaction. Often used to detect cross-selling opportunities.

Sequence
An algorithm which is every similar to the association algorithm except that it also includes time.

Time Series
An algorithm used to forecast future values of a time series based on past values. Also known as Auto Regression Trees (ART).

Cluster
A cluster is a grouping of related data.

Discrete
This is more a statistical term than a strictly datamining term however it is used frequently - hence it's inclusion here. Discrete refers to values which are not sequential and have a finite set of values eg true/false

Continuous
Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete.

Outlier
Data that falls well outside the statistical norms of other data. An outlier is data that should be closely examined.

Antecedent
When an association between two variables is defined, the first item (or left-hand side) is called the antecedent. For example, in the relationship "When a prospector buys a pick, he buys a shovel 14% of the time," "buys a pick" is the antecedent.

Leaf
A node at it's lowest level - it has no more splits.

Mean
The arithmetic average of a dataset

Median
The arithmetic middle value of a dataset

Standard Deviation
Measures the spread of the values in the data set around the median.

Skew
Measures the symmetry of the data set ie is it skewed in a particular direction on either side of the median

Kurtosis
Measures whether the data set has lots of peaks or is flat in relation to a normal distribution

Monday 5 May 2008

SQL Server Releases

Microsoft have announced that they are changing their approach to releases for SQL Server. This is interesting because SQL Server releases can be a touchy subject for businesses, particularly those with big server farms. Inevitably the development team wants the Service Pack to be installed ASAP whereas the server team is keen to protect their stable server and pretend service packs don't exist. This means a lot of pushing and shoving.

This new approach should help to alleviate the pressure a little but I'm not altogether convinced.

· Smaller Service Packs which will be easier to deploy
I suspect smaller service packs will make server teams less inclined to come to the party because less inclusions on a per service pack basis inherently implies more service packs.

· Higher quality of Service Pack releases due to reduced change introduced
It's all very well to say that the quality is better but that's a very airy fairy 'benefit' which I can't imagine will go down very effectively with server teams as an argument for implementation. It's just not very quantitative which means server teams are likely to ignore it.

· Predictable Service Pack scheduling to allow for better customer test scheduling and deployment planning.
On this point, I demure. This can have a huge impact on getting releases implemented. Presuming of course that you can get your server team to operate on a scheduled release process themselves. It's all very well for the vendor to do it but if the server team doesn't ALSO do it, there's no gain. That said, I believe that such a process SHOULD be followed. I just don't see it as terribly likely. I fervently hope to be disproven.

It's really easy to be cynical about this approach and say 'my organisation will never do this'. Which is the trap I've fallen into here I realise, but the fact of the matter is, good on Microsoft for considering these issues and attempting to find ways to improve them. The approach is right and a positive move. Now the onus is on us to follow in their footsteps. This should be a wakeup call to server and development teams to find more common ground, to develop processes which satisfy everyone's needs and to communicate with each other better.

For more details:
http://blogs.msdn.com/sqlreleaseservices/archive/2008/04/27/a-changed-approach-to-service-packs.aspx

Tuesday 15 April 2008

Datamining – Reviewing the Data Source

I'm currently going through the Training Kit for Exam 40-445 – the MCTS SQL BI exam. And data-mining, joy of joys, an area where I'm not exactly at expert level, is the focus of my study. And of course, by all reports, data-mining makes up a large chunk of the exam.

In the book, they provide sample code for the 4 moments of a particular attribute which can be used to validate test vs training data in reviewing source data for a data mining project.

The Four Moments

Moments are individual measures or parameters of the moment method. The moment method estimates and equates samples of data. The moments most commonly used in data mining are:

Mean – the centre point of the data set
Standard Deviation – measures the spread of the values in the data set around the centre point
Skew – measures the symmetry of the data set ie is it skewed in a particular direction on either side of the centre point
Kurtosis – measures whether the data set has lots of peaks or is flat in relation to a normal distribution

Datasets

In reviewing source data, you need a training data set (to teach the patterns) and a test data set to validate whether that which was learnt is in fact correct!



In the code below, we are returned 2 rows – a row for each dataset - showing the 4 moments for both. If the values are fairly similar in both rows then we know the sampling is valid.

Sample Code

This example uses, from the AdventureWorksDW database:

  • the 'CustomerKey' column in order to identify records for sampling

  • the 'Age' column as the attribute which we are measuring the 4 moments against
--populate testing data
Select * into TK445_Ch09_TestSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2

-- populate training data
Select * into TK445_Ch09_TrainingSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2
except select * from TK445_Ch09_TestSet

--can be used to validate that datasets have been populated
--select 1,* from TK445_Ch09_TrainingSet union all select 2,* from TK445_Ch09_TestSet

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TrainingSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m

UNION

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TestSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m

Thursday 13 March 2008

Microsoft 70-556 Exam - MCTS Performance Point

Last week I completed and successfully passed the 70-556 exam - which gives me an MCTS PerformancePoint certification.

The exam was relatively simple - I scored highly in all areas but one. Obviously a weak point there on my part... my success in the other areas was due to a week and a bit of study using the Rational Guides to PerformancePoint (authored by my esteemed colleagues Adrian Downes and Nick Barclay) - the Planning volume and the Monitoring & Analyzing Volume. Much thanks to them!

The exam is strongly skewed towards Planning - only a few questions on Monitoring & Analysing - and the M&A questions were very simple. There are quite a few questions on Business Rules and Configuring Models & Dimensions - and these are complex areas so spend time on them in your study.

If you are doing the exam and using the Rational Guide books, don't ignore the Bonus chapters - you'll need them!

Next up is the 70-445 (MCTS SQL BI) which is much harder... I'm currently spending a lot of time with the Microsoft Training kit. Wish me luck...

Thursday 21 February 2008

Using TSQL Queries for AD Authentication

Yes you can directly query AD using TSQL queries. And it's REALLY easy.

Why you ask – well, LDAP can be painfully slow if you have to trawl through groups eg if you are storing the results in a table or outputting them to screen or suchlike. I've found this method to be pretty reliable and reasonable with regards to performance.

First things first… on a SQL server, you need to create a linked server to AD as follows:

EXEC master.dbo.sp_addlinkedserver @server = N'ADSI', @srvproduct=N'Active Directory Service Interfaces', @provider=N'ADSDSOObject', @datasrc=N'adsdatasource'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'collation compatible', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'data access', @optvalue=N'true'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'dist', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'pub', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'rpc', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'rpc out', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'sub', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'connect timeout', @optvalue=N'0'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'collation name', @optvalue=null
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'lazy schema validation', @optvalue=N'false'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'query timeout', @optvalue=N'0'
GO
EXEC master.dbo.sp_serveroption @server=N'ADSI', @optname=N'use remote collation', @optvalue=N'true'
GO

That's all the configuration you have to do. You are good to go!

Below is an example of the TSQL syntax.

Note that the DC is the domain in which the AD resides – in this instance it has three parts MyDomainName,DC=MyTopLevelDomain,DC=MyDomainCountryCode eg DC=microsoft,DC=com,DC=au. In this example, I am returning the username of all users who are a member of the AD group called GAMyADGroupName. This example checks if a particular user is a member of a particular AD group:

SELECT *
FROM OPENQUERY(ADSI,
'SELECT sAMAccountName
FROM ''LDAP://DC=csr,DC=com,DC=au''
WHERE objectCategory = ''Person''
AND objectClass = ''user''
AND memberOf = ''CN=GAMyADGroupName,OU=Other,OU=Global Groups,OU=Security Groups,DC=MyDomainName,DC=com,DC=au''
AND sAMAccountName = ''MyUserName''')

Monday 18 February 2008

SSIS Synchronous/Asynchronous Script Transformations in a Data Flow

Asynchronous just means it consumes all rows before progressing to the next step in the data flow – unlike most other transformations where the entire data flow is executed for each row before the next row is commenced. Script transformations can be synchronous (eg Lookup) or asynchronous (eg Sort or Aggregate).

A synchronous script transformation is identified because the output is linked to a specified input. This is done in the Inputs and Outputs area of the task by entering the ID from the Input properties in the SynchronousInputID field of the Output properties. For each row of input, there will be one row of output.

An asynchronous script transformation is identified firstly because the SynchronousInputID field of the Output properties contains "None" or "0". When a new row is to be added, it is explicitly done using:

Output0Buffer.AddRow()

In order to aggregate data, flatten rows or concatenate data from different rows, you simply set a flag to identify when a new row should be added. To do this you need to declare a boolean in the class but outside the main sub:

Public Class ScriptMain

Inherits UserComponent

Private blnFirstRow As Boolean = True
Dim opMessage1 As String = ""

Public Overrides
Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

In this example, I'm taking all rows and flattenning them into one row so the first step in my sub is:

If blnFirstRow = True Then Output0Buffer.AddRow()
blnFirstRow = False
opMessage1 = opMessage1 & " " & Row.ipMessage.ToString

Note that I then concatenate a variable with a new value… Alternately, you could set a flag so that a new row is added when certain conditions are met such as the value of the ipCustomerNo changes.

One other thing to add - you need to use the PostExecute() sub to write to variables.

Tuesday 8 January 2008

VS Db Pro – Script Deployment - Database Context

When deploying databases using Db Pro the scripts which are executed have a database context. Visual Studio will automatically include a 'USE [YourDbName]' where YourDbName is taken from the project's BUILD property. This is the context for the scripts to execute. This is important because if you change it in any of the scripts it can mess with your deployment.

For example, you create a script called 'AgentJobs.sql' in the Post-Deployment folder and add a reference to the script in Script.PostDeployment.sql. This means your AgentJobs script will be executed when you do a deployment. Now assuming you add a 'USE MSDB' statement in the script, your deployment runs successfully HOWEVER any scripts which execute after AgentJobs eg Data.sql (it executes in alphabetical order) will now have MSDB as the context rather than [YourDbName] as expected.

So make sure that you use fully qualified references to objects in other databases and DON'T use 'USE'!