Download Data Classification Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Data Mining Glossary
A
Accuracy. A measure of a predictive model that reflects the proportionate number of
times that the model is correct when applied to data.
Application Programming Interface (API). The formally defined programming
language interface between a program (system control program, licensed program)
and its user.
Artificial Intelligence. The scientific field concerned with the creation of intelligent
behavior in a machine.
Artificial Neural Network (ANN). See Neural Network.
Association Rule. A rule in the form of “if this then that” that associates events in a
database. For example the association between purchased items at a supermarket.
B
Back Propagation. One of the most common learning algorithms for training neural
networks.
Binning. The process of breaking up continuous values into bins. Usually done as a
preprocessing step for some data mining algorithms. For example breaking up age
into bins for every ten years.
Brute Force Algorithm. A computer technique that exhaustively uses the repetition
of very simple steps repeated in order to find an optimal solution. They stand in
contrast to complex techniques that are less wasteful in moving toward and optimal
solution but are harder to construct and are more computationally expensive to
execute.
C
Cardinality. The number of different values a categorical predictor or OLAP
dimension can have. High cardinality predictors and dimensions have large numbers
of different values (e.g. zip codes), low cardinality fields have few different values
(e.g. eye color).
CART. Classification and Regression Trees. A type of decision tree algorithm that
automates the pruning process through cross validation and other techniques.
CHAID. Chi-Square Automatic Interaction Detector. A decision tree that uses
contingency tables and the chi-square test to create the tree. Classification. The
process of learning to distinguish and discriminate between different input patterns
using a supervised training algorithm. Classification is the process of determining that
a record belongs to a group.
Clustering. The technique of grouping records together based on their locality and
connectivity within the n-dimensional space. This is an unsupervised learning
technique.
Collinearity. The property of two predictors showing significant correlation without a
causal relationship between them.
Clustering. The process of grouping similar input patterns together using an
unsupervised training algorithm.
Conditional Probability. The probability of an event happening given that some
event has already occurred. For example the chance of a person committing fraud is
much greater given that the person had previously committed fraud.
Coverage. A number that represents either the number of times that a rule can be
applied or the percentage of times that it can be applied.
CRM. See Customer Relationship Management.
Cross Validation (and Test Set Validation). The process of holding aside some
training data which is not used to build a predictive model and to later use that data to
estimate the accuracy of the model on unseen data simulating the real world
deployment of the model.
Customer Relationship Management. The process by which companies manage
their interactions with customers.
D
Data mining. The process of efficient discovery of nonobvious valuable patterns
from a large collection of data.
Database Management System (DBMS). A software system that controls and
manages the data to eliminate data redundancy and to ensure data integrity,
consistency and availability, among other features.
Decision Trees. A class of data mining and statistical methods that form tree like
predictive models.
E
Embedded Data Mining. An implementation of data mining where the data mining
algorithms are embedded into existing data stores and information delivery processes
rather than requiring data extraction and new data stores.
Entropy. A measure often used in data mining algorithms that measures the disorder
of a set of data.
Error Rate. A number that reflects the rate of errors made by a predictive model. It is
one minus the accuracy.
Expert System. A data processing system comprising a knowledge base (rules), an
inference (rules) engine, and a working memory.
Exploratory Data Analysis. The processes and techniques for general exploration of
data for patterns in preparation for more directed analysis of the data.
F
Factor Analysis. A statistical technique which seeks to reduce the number of total
predictors from a large number to only a few “factors” that have the majority of the
impact on the predicted outcome.
Field. The structural component of a database that is common to all records in the
database. Fields have values. Also called features, attributes, variables, table columns,
dimensions.
Front Office. The part of a company's computer system that is responsible for
keeping track of relationships with customers.
Fuzzy Logic. A system of logic based on the fuzzy set theory.
Fuzzy Set. A set of items whose degree of membership in the set may range from 0 to
1.
Fuzzy system. A set of rules using fuzzy linguistic variables described by fuzzy sets
and processed using fuzzy logic operations.
G
Genetic algorithm. A method of solving optimization problems using parallel search,
based on Darwin's biological model of natural selection and survival of the fittest.
Genetic operator. An operation on the population member strings in a genetic
algorithm which are used to produce new strings.
Gini Metric. A measure of the disorder reduction caused by the splitting of data in a
decision tree algorithm. Gini and the entropy metric are the most popular ways of
selected predictors in the CART decision tree algorithm.
H
Hebbian Learning. One of the simplest and oldest forms of training a neural
network. It is loosely based on observations of the human brain. The neural net link
weights are strengthened between any nodes that are active at the same time.
Hill Climbing Search. A simple optimization technique that modifies a proposed
solution by a small amount and then accepts it if it is better than the previous solution.
The technique can be slow and suffers from being caught in local optima.
Hypothesis Testing. The statistical process of proposing a hypothesis to explain the
existing data and then testing to see the likelihood of that hypothesis being the
explanation.
I
ID3. One of the earliest decision tree algorithms.
Independence (statistical). The property of two events displaying no causality or
relationship of any kind. This can be quantitatively defined as occurring when the
product of the probabilities of each event is equal to the probability of the both events
occurring.
Intelligent Agent. A software application which assists a system or a user by
automating a task. Intelligent agents must recognize events and use domain
knowledge to take appropriate actions based on those events.
K
Kohonen Networks. A type of neural network where locality of the nodes learn as
local neighborhoods and locality of the nodes is important in the training process.
They are often used for clustering.
Knowledge Discovery. A term often used interchangeably with data mining.
L
Lift. A number representing the increase in responses from a targeted marketing
application using a predictive model over the response rate achieved when no model
is used.
M
Machine Learning. A field of science and technology concerned with building
machines that learn. In general it differs from Artificial Intelligence in that learning is
considered to be just one of a number of ways of creating an artificial intelligence.
Memory-Based Reasoning (MBR). A technique for classifying records in a database
by comparing them with similar records that are already classified. A form of nearest
neighbor classification.
Minimum Description Length (MDL) Principle. The idea that the least complex
predictive model (with acceptable accuracy) will be the one that best reflects the true
underlying model and performs most accurately on new data.
Model. A description that adequately explains and predicts relevant data but is
generally much smaller than the data itself.
N
Nearest Neighbor. A data mining technique that performs prediction by finding the
prediction value of records (near neighbors) similar to the record to be predicted.
Neural Network. A computing model based on the architecture of the brain. A neural
network consists of multiple simple processing units connected by adaptive weights.
Nominal Categorical Predictor. A predictor that is categorical (finite cardinality)
but where the values of the predictor have no particular order. For example, red,
green, blue as values for the predictor “eye color”.
O
Occam’s Razor. A rule of thumb used by many scientists that advocates favoring the
simplest theory that adequately explains (or predicts) an event. This is more formally
captured for machine learning and data mining as the minimum description length
principle.
On-Line Analytical Processing (OLAP). Computer-based techniques used to
analyze trends and perform business analysis using multidimensional views of
business data.
Ordinal Categorical Predictor. A categorical predictor (i.e. has finite number of
values) where the values have order but do not convey meaningful intervals or
distances between them. For example the values high, middle and low for the income
predictor.
Outlier Analysis. A type of data analysis that seeks to determine and report on
records in the database that are significantly different from expectations. The
technique is used for data cleansing, spotting emerging trends and recognizing
unusually good or bad performers.
Overfitting. The effect in data analysis, data mining and biological learning of
training too closely on limited available data and building models that do not
generalize well to new unseen data. At the limit, overfitting is synonymous with rote
memorization where no generalized model of future situations is built.
P
Predictor. The column or field in a database that could be used to build a predictive
model to predict the values in another field or column. Also called variable,
independent variable, dimension, or feature.
Prediction. 1. Then or field in a database that currently has unknown value that will
be assigned when a predictive model is run over other predictor values in the record.
Also called dependent variable, target, classification. 2. The process of applying a
predictive model to a record. Generally prediction implies the generation of unknown
values within time series though in this book prediction is used to mean any process
for assigning values to previously unassigned fields including classification and
regression.
Predictive Model. A model created or used to perform prediction. In contrast to
models created solely for pattern detection, exploration or general organization of the
data.
Principle Components Analysis. A data analysis technique that seeks to weight the
importance of a variety of predictors so that they optimally discriminate between
various possible predicted outcomes.
Prior Probability. The probability of an event occurring without dependence on
(conditional to) some other event. In contrast to conditional probability.
R
Radial Basis Function Networks. Neural networks that combine some of the
advantages of neural networks with those of nearest neighbor techniques. In radial
basis functions the hidden layer is made up of nodes that represent prototypes or
clusters of records.
Record. The fundamental data structure used for performing data analysis. Also
called a table row or example. A typical record would be the structure that contains all
relevant information pertinent to one particular customer or account.
Regression. A data analysis technique classically used in statistics for building
predictive models for continuous prediction fields. The technique automatically
determines a mathematical equation that minimizes some measure of the error
between the prediction from the regression model and the actual data.
Reinforcement learning. A training model where an intelligence engine (e.g. neural
network) is presented with a sequence of input data followed by a reinforcement
signal.
Relational Database (RDB). A database built to conform to the relational data
model; includes the catalog and all the data described therein.
Response. A binary prediction field that indicates response or non response to a
variety of marketing interventions. The term is generally used when referring to
models that predict response or to the response field itself.
S
Sampling. The process by which only a fraction of all available data is used to build a
model or perform exploratory analysis. Sampling can provide relatively good models
at much less computational expense than using the entire database.
Segmentation. The process or result of the process that creates mutually exclusive
collections of records that share similar attributes either in unsupervised learning
(such as clustering) or in supervised learning for a particular prediction field.
Sensitivity Analysis. The process which determines the sensitivity of a predictive
model to small fluctuations in predictor value. Through this technique end users can
gauge the effects of noise and environmental change on the accuracy of the model.
Simulated Annealing. An optimization algorithm loosely based on the physical
process of annealing metals through controlled heating and cooling.
Structured Query Language (SQL). A standard language for the access of data in a
relational database.
Supervised learning. A class of data mining and machine learning applications and
techniques where the system builds a model based on the prediction of a well defined
prediction field. This is in contrast to unsupervised learning where there is no
particular goal aside from pattern detection.
Support. The relative frequency or number of times a rule produced by a rule
induction system occurs within the database. The higher the support the better the
chance of the rule capturing a statistically significant pattern.
T
Targeted Marketing. The marketing of products to select groups of consumers that
are more likely than average to be interested in the offer.
Time-series forecasting. The process of using a data mining tool (e.g., neural
networks) to learn to predict temporal sequences of patterns, so that, given a set of
patterns, it can predict a future value.
U
Unsupervised learning. A data analysis technique whereby a model is built without a
well defined goal or prediction field. The systems are used for exploration and general
data organization. Clustering is an example of an unsupervised learning system.
V
Visualization. Graphical display of data and models which helps the user in
understanding the structure and meaning of the information contained in them.