Thursday, 8 May 2008

Datamining Part II - Terminology

Datamining, like all other IT subjects has it's own lingo. This quick blog post will explain them.

Datamining attempts to deduce knowledge by examining existing data

A case is a unit of measure.
It equates to a single appearance of an entity. In relational terms that would mean one row in a table. A case includes all the information relating to an entity.

The attributes of a case.

A model stores information about variables, the algorithms used and their parameters and extracted knowledge. A model can be descriptive or predictive - it's behaviour is driven by the algorithm which was used to derive it.

A structure stores models.

My definition here is from the perspective of datamining rather than a general definition. An algorithm is a method of mining data. Some methods are predictive (forecasting) and some are relative (showing relationships). 7 algorithms are included with SQL Server 2005.

Neural Network
An algorithm designed to predict in a non-linear fashion, like a human neuron. Often used to predict outcomes based on previous behaviour.

Decision Tree
An algorithm which provides tree-like output showing paths or rules to reach an end point or value.

Naive Bayes
An algorithm often used for classifying text documents, it shows probability based on independant data.

An algorithm which groups cases based on similar characteristics. Often used to identify anomalies or outliers.

An algorithm describes how often events have occured together. Defines an 'itemset' from a single transaction. Often used to detect cross-selling opportunities.

An algorithm which is every similar to the association algorithm except that it also includes time.

Time Series
An algorithm used to forecast future values of a time series based on past values. Also known as Auto Regression Trees (ART).

A cluster is a grouping of related data.

This is more a statistical term than a strictly datamining term however it is used frequently - hence it's inclusion here. Discrete refers to values which are not sequential and have a finite set of values eg true/false

Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete.

Data that falls well outside the statistical norms of other data. An outlier is data that should be closely examined.

When an association between two variables is defined, the first item (or left-hand side) is called the antecedent. For example, in the relationship "When a prospector buys a pick, he buys a shovel 14% of the time," "buys a pick" is the antecedent.

A node at it's lowest level - it has no more splits.

The arithmetic average of a dataset

The arithmetic middle value of a dataset

Standard Deviation
Measures the spread of the values in the data set around the median.

Measures the symmetry of the data set ie is it skewed in a particular direction on either side of the median

Measures whether the data set has lots of peaks or is flat in relation to a normal distribution

No comments: