I'm currently going through the Training Kit for Exam 40-445 – the MCTS SQL BI exam. And data-mining, joy of joys, an area where I'm not exactly at expert level, is the focus of my study. And of course, by all reports, data-mining makes up a large chunk of the exam. In the book, they provide sample code for the 4 moments of a particular attribute which can be used to validate test vs training data in reviewing source data for a data mining project. The Four Moments Mean – the centre point of the data set Datasets In reviewing source data, you need a training data set (to teach the patterns) and a test data set to validate whether that which was learnt is in fact correct! In the code below, we are returned 2 rows – a row for each dataset - showing the 4 moments for both. If the values are fairly similar in both rows then we know the sampling is valid. Sample Code This example uses, from the AdventureWorksDW database:
Moments are individual measures or parameters of the moment method. The moment method estimates and equates samples of data. The moments most commonly used in data mining are:
Standard Deviation – measures the spread of the values in the data set around the centre point
Skew – measures the symmetry of the data set ie is it skewed in a particular direction on either side of the centre point
Kurtosis – measures whether the data set has lots of peaks or is flat in relation to a normal distribution
--populate testing data
Select * into TK445_Ch09_TestSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2
-- populate training data
Select * into TK445_Ch09_TrainingSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2
except select * from TK445_Ch09_TestSet
--can be used to validate that datasets have been populated
--select 1,* from TK445_Ch09_TrainingSet union all select 2,* from TK445_Ch09_TestSet
SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TrainingSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m
UNION
SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TestSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m
Tuesday, 15 April 2008
Datamining – Reviewing the Data Source
Posted by Kristen Hodges at 9:36 am 1 comments
Labels: Certification, Data-Mining, SQL
Subscribe to:
Posts (Atom)