Showing posts with label Certification. Show all posts
Showing posts with label Certification. Show all posts

Tuesday, 15 April 2008

Datamining – Reviewing the Data Source

I'm currently going through the Training Kit for Exam 40-445 – the MCTS SQL BI exam. And data-mining, joy of joys, an area where I'm not exactly at expert level, is the focus of my study. And of course, by all reports, data-mining makes up a large chunk of the exam.

In the book, they provide sample code for the 4 moments of a particular attribute which can be used to validate test vs training data in reviewing source data for a data mining project.

The Four Moments

Moments are individual measures or parameters of the moment method. The moment method estimates and equates samples of data. The moments most commonly used in data mining are:

Mean – the centre point of the data set
Standard Deviation – measures the spread of the values in the data set around the centre point
Skew – measures the symmetry of the data set ie is it skewed in a particular direction on either side of the centre point
Kurtosis – measures whether the data set has lots of peaks or is flat in relation to a normal distribution

Datasets

In reviewing source data, you need a training data set (to teach the patterns) and a test data set to validate whether that which was learnt is in fact correct!



In the code below, we are returned 2 rows – a row for each dataset - showing the 4 moments for both. If the values are fairly similar in both rows then we know the sampling is valid.

Sample Code

This example uses, from the AdventureWorksDW database:

  • the 'CustomerKey' column in order to identify records for sampling

  • the 'Age' column as the attribute which we are measuring the 4 moments against
--populate testing data
Select * into TK445_Ch09_TestSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2

-- populate training data
Select * into TK445_Ch09_TrainingSet
from vTargetMail
where rand(checksum(newid())%1000000000+CustomerKey) <.2
except select * from TK445_Ch09_TestSet

--can be used to validate that datasets have been populated
--select 1,* from TK445_Ch09_TrainingSet union all select 2,* from TK445_Ch09_TestSet

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TrainingSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m

UNION

SELECT
Mean = MIN(m.mean),
StdDev = Min(m.StdDev),
Skew = SUM(((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev)*((Age*1.0-m.mean)/m.StdDev))*MIN(m.corrfact1),
Kurt = sum(square(square(((Age*1.0-m.mean)/m.StdDev)))) * MIN(m.corrfact2) - min(m.subfact)
FROM TK445_Ch09_TestSet v
CROSS JOIN (SELECT
mean = AVG(age*1.0),
stddev = STDEV(Age),
corrfact1 = count(*)*1.0 / (Count(*)-1) / (count(*)-2),
corrfact2 = count(*)*1.0 * (Count(*)+1) / (count(*)-1) / COUNT(*)-2 / COUNT(*)-3,
subfact = 3.0 * SQUARE((COUNT(*)-1)) / (COUNT(*)-2) / (COUNT(*)-3)
FROM TK445_Ch09_TrainingSet) m

Thursday, 13 March 2008

Microsoft 70-556 Exam - MCTS Performance Point

Last week I completed and successfully passed the 70-556 exam - which gives me an MCTS PerformancePoint certification.

The exam was relatively simple - I scored highly in all areas but one. Obviously a weak point there on my part... my success in the other areas was due to a week and a bit of study using the Rational Guides to PerformancePoint (authored by my esteemed colleagues Adrian Downes and Nick Barclay) - the Planning volume and the Monitoring & Analyzing Volume. Much thanks to them!

The exam is strongly skewed towards Planning - only a few questions on Monitoring & Analysing - and the M&A questions were very simple. There are quite a few questions on Business Rules and Configuring Models & Dimensions - and these are complex areas so spend time on them in your study.

If you are doing the exam and using the Rational Guide books, don't ignore the Bonus chapters - you'll need them!

Next up is the 70-445 (MCTS SQL BI) which is much harder... I'm currently spending a lot of time with the Microsoft Training kit. Wish me luck...