Datamining, like all other IT subjects has it's own lingo. This quick blog post will explain them.**Datamining**Datamining attempts to deduce knowledge by examining existing data

**Case**

A case is a unit of measure.

It equates to a single appearance of an entity. In relational terms that would mean one row in a table. A case includes all the information relating to an entity.

**Variable**

The attributes of a case.

**Model**

A model stores information about variables, the algorithms used and their parameters and extracted knowledge. A model can be descriptive or predictive - it's behaviour is driven by the algorithm which was used to derive it.

**Structure**

A structure stores models.

**Algorithm**

My definition here is from the perspective of datamining rather than a general definition. An algorithm is a method of mining data. Some methods are predictive (forecasting) and some are relative (showing relationships). 7 algorithms are included with SQL Server 2005.

**Neural Network**

An algorithm designed to predict in a non-linear fashion, like a human neuron. Often used to predict outcomes based on previous behaviour.

**Decision Tree**

An algorithm which provides tree-like output showing paths or rules to reach an end point or value.

**Naive Bayes**

An algorithm often used for classifying text documents, it shows probability based on independant data.

**Clustering**

An algorithm which groups cases based on similar characteristics. Often used to identify anomalies or outliers.

**Association**

An algorithm describes how often events have occured together. Defines an 'itemset' from a single transaction. Often used to detect cross-selling opportunities.

**Sequence**

An algorithm which is every similar to the association algorithm except that it also includes time.

**Time Series**

An algorithm used to forecast future values of a time series based on past values. Also known as Auto Regression Trees (ART).

**Cluster**

A cluster is a grouping of related data.

**Discrete**

This is more a statistical term than a strictly datamining term however it is used frequently - hence it's inclusion here. Discrete refers to values which are not sequential and have a finite set of values eg true/false

**Continuous**

Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete.

**Outlier**

Data that falls well outside the statistical norms of other data. An outlier is data that should be closely examined.

**Antecedent**

When an association between two variables is defined, the first item (or left-hand side) is called the antecedent. For example, in the relationship "When a prospector buys a pick, he buys a shovel 14% of the time," "buys a pick" is the antecedent.

**Leaf**

A node at it's lowest level - it has no more splits.

**Mean**

The arithmetic average of a dataset

**Median**

The arithmetic middle value of a dataset

**Standard Deviation**

Measures the spread of the values in the data set around the median.

**Skew**

Measures the symmetry of the data set ie is it skewed in a particular direction on either side of the median

**Kurtosis**

Measures whether the data set has lots of peaks or is flat in relation to a normal distribution

## No comments:

Post a Comment