Tuesday, 15 April 2008

Datamining – Reviewing the Data Source

I'm currently going through the Training Kit for Exam 40-445 – the MCTS SQL BI exam. And data-mining, joy of joys, an area where I'm not exactly at expert level, is the focus of my study. And of course, by all reports, data-mining makes up a large chunk of the exam.

In the book, they provide sample code for the 4 moments of a particular attribute which can be used to validate test vs training data in reviewing source data for a data mining project.

The Four Moments

Moments are individual measures or parameters of the moment method. The moment method estimates and equates samples of data. The moments most commonly used in data mining are:

Mean – the centre point of the data set
Standard Deviation – measures the spread of the values in the data set around the centre point
Skew – measures the symmetry of the data set ie is it skewed in a particular direction on either side of the centre point
Kurtosis – measures whether the data set has lots of peaks or is flat in relation to a normal distribution

Datasets

In reviewing source data, you need a training data set (to teach the patterns) and a test data set to validate whether that which was learnt is in fact correct!



In the code below, we are returned 2 rows – a row for each dataset - showing the 4 moments for both. If the values are fairly similar in both rows then we know the sampling is valid.

Sample Code

This example uses, from the AdventureWorksDW database:

  • the 'CustomerKey' column in order to identify records for sampling

  • the 'Age' column as the attribute which we are measuring the 4 moments against
--populate testing data
Select * into TK445_Ch09_TestSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2

-- populate training data
Select * into TK445_Ch09_TrainingSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2
except select * from TK445_Ch09_TestSet

--can be used to validate that datasets have been populated
--select 1,* from TK445_Ch09_TrainingSet union all select 2,* from TK445_Ch09_TestSet

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TrainingSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m

UNION

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TestSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m