Download Data mining algorithm components

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Chapter 5: A Systematic Overview of
Data Mining Algorithms
Fall 2011
Ming Li
Department of Computer Science and Technology
Nanjing University
Data mining algorithm components
• Mining tasks: the purpose of the data mining
• structures: the functional form of the model
or pattern that is used for fitting the data.
• Score function: used for judging the quality
of the fitted models or patterns.
• Optimization and search method: used for searching
over parameters or structures to find the one have the
best quality w.r.t. the score function
• Data management techniques: used for handling data
access efficiently during the mining process.
Data Mining: Chapter 5
Why adopt such systematic view?
Such systematic point of view (Reductionist) emphasizes the
fundamental properties of an algorithm avoiding the tendency
to think of lists of algorithms
Algorithm =
{model structure, score function, search method, data management technique}
• Analysis
Clarify the role of each component and make it easier to compare
competing algorithms
• Synthesis
Build novel data mining algorithms with different properties by
combining different components in various way
Data Mining: Chapter 5
Case studies
CART
BP Neural
Network
A Priori
Vector-Space
Task
Classification &
Regression
Classification
& Regression
Rule Pattern
Discovery
Content-based
Retrieval
Structure
Decision Tree
Neural
Network
Association
Rule
Vector of
occurrence
Score
Function
Cross-validated
Loss Function
Squared Error
Support /
Accuracy
Angle between
two vector
Search
Method
Greedy search
over Structures
Gradient
Descent on
Parameters
Breadth-First
with Pruning
Various
techniques
Data
Managemen
t Techniques
Unspecified
Unspecified
Linear Scans
Various fast
indexing
strategies
Data Mining: Chapter 5
Case 1: CART -- overview
CART tries to construct a decision tree from the data
General procedure:
1.
Construct a root node which contains the whole data set
2.
Selecting an attribute that benefit the task most according to some criterion
computed using the data within the current node (e.g., gini index for CART
and GainRatio for C4.5)
3.
Split the examples of the current node into different subsets based on values
of the selected attributes
4.
Create a new node as a child of the current node for each subset and passes
the examples in the subset to the node
5.
Recursively repeat step 2~5 until some stopping criterion is met (e.g.,
minNumPts, predictive ability)
6.
Pruning of the tree is usually required to avoid overfitting
Data Mining: Chapter 5
Case 1: CART – model structure
The model structure used in CART is a decision tree
Decision tree is a flow-chart-like tree structure:
– each internal node is a test on an attribute
– each branch represents an outcome of the test
– each leaf is a class or class distribution
The path from the root to the leaf
leads an unseen instance to its
prediction
Data Mining: Chapter 5
Case 1: CART – Score function
The misclassification error / MSE is used as the score
function for a given instantiation of the model structure
(i.e., the classification & regression tree)
Regression
classification
Cross-validation is used to estimate this misclassification
error / MSE
Data Mining: Chapter 5
Case 1: CART – Search Method
Two-phase greedy search over tree structures with
different complexity
• Growing
Recursively expanding the tree from a root node to a large
tree by greedily selecting the best attribute for split for the
data arrived in the current node.
• Pruning
Recursively pruning back specific branches
of this large tree (in order to avoid
overfitting)
Data Mining: Chapter 5
Case 1: CART – data management
technique
• Unspecified.
• Assumption for CART:
Data are all in main memory
• Any data management technique that facilitate
learning process can be used
Data Mining: Chapter 5
Case 2: Backpropagation -- overview
• Backpropagation (BP) is a common method of teaching
artificial neural networks how to perform a given task
• General Procedure:
1.
2.
3.
4.
5.
6.
Present a training sample to the neural network.
Compare the network's output to the desired output from that sample.
Calculate the error in each output neuron.
For each neuron, calculate what the output should have been, and a scaling
factor, how much lower or higher the output must be adjusted to match the
desired output. This is the local error.
Adjust the weights of each neuron to lower the local error.
Assign "blame" for the local error to neurons at the previous level, giving
greater responsibility to neurons connected by stronger weights.
Repeat from step 3 on the neurons at the previous level, using each one's
"blame" as its error.
Data Mining: Chapter 5
Case 2: BP – model structure
The model structure used in BP is a multi-layer perceptrons,
non-linear transformation of weighted sums of the input
• Multi-layer perceptrons is a type of feedforward neural network.
– Each unit is only connected with the units in its next layer
– Full connection between layers
• Input layer takes the attribute value as the
input signal and units of the output laryer give
the prediction
• The hidden layer process the input signal and
produces intermediate results to the output layer.
weights
Output layer
Hidden layer
Input layer
A multi-layer perceptrons with one hidden layer can approximate any
function at any accuracy
Data Mining: Chapter 5
Case 2: BP – Score function
• The sum of the squared errors (squared loss) is
used as the score function.
– The SSE score function is widely used in BP.
– In fact, any differentiable function can be used to replace this score
function. (e.g., log-likelihood).
– In classification , it is a upper bound for misclassification error (0-1 error)
Data Mining: Chapter 5
Case 2: BP – Search method
Greedy search over the model parameters with the
model complexity fixed.
• Bridging the model instantiation with the score function
– Model parameters: all the weights that connect the units in different layers
– The output is a non-linear transformation of the weighted sum of
input, which can be computed from the input to the output
layer-by-layer.
– Plugging the output into the score function yielding a new a
non-linear function whose variables are the model parameters.
• Gradient descent over the augmented score
function is used to find the parameters in that
give the minimum value of the score function
Data Mining: Chapter 5
Case 2: BP – Data management
technique
• Unspecified
• Two different way to train a multi-layer
perceptrons with BP
– Online
– Batch
• Any data management technique can be chosen to
facilitate the two types of training strategy.
Data Mining: Chapter 5
Case 3: Apriori -- overview
Apriori is a popular algorithm that discovers association
rules from the data set
• General procedure:
1. Identifying all frequent itemsets
2. Generating strong association rules by splitting the frequent set
into two subsets and places the subsets on different side of “”
such that the rule is “accurate” and “frequently observed”
Data Mining: Chapter 5
Case 3: Apriori – Model structure
The model structure used in Apriori is a set of association
rules with confidence and support
A = 1 AND B=1  C = 1 [ps, pc]
• Support: how frequently the data items satisfying the precondition of rule appear in the database
• Confidence: how likely the discovered rule is “correct”
Data Mining: Chapter 5
Case 3: Apriori – Score function
• 0-1 Score function is used.
A pattern gets a score of 1 if it satisfies both of the threshold of
conditions and gets a score of 0 otherwise.
– The score function is computed for each rule.
– The support and confidence are estimated over the dataset.
Data Mining: Chapter 5
Case 3: Apriori – Search method
Breadth-First search with Pruning over the rule structure
(in fact, the frequent itemsets)
• Breadth-first search:
– Finding all frequent itemsets containing 2 items
– Generating all frequent itemsets containing 3 items from the frequent 2itemsets.
– Repeat until no frequent itemsets are found
Exponential complexity!
• Pruning
– Key idea: any non-empty subset of a frequent k-itemset is a frequent itemset.
– Do not generate the frequent k-itemset if some of its subset is not frequent.
– Since data are usually sparse, the pruning can be very effective.
Data Mining: Chapter 5
Case 3: Apriori – data management
techniques
• Multiple linear scans of database
– To find frequent k-itemsets, a linear scan of the records in database
is required in order to compute the support for each candidate
itemset. (k = 2, 3, …)
– To generate rule using the identified frequent
itemsets, a linear scan of the records in database
is required to compute the confidence of each
resulting association rule.
Data Mining: Chapter 5
Case 4: Vector-space Algorithm
• Tasks: Retrieval of the k most similar documents in the
database related to the query.
• Model: Vectors of term occurrences
• Score function: Cosine similarity (in fact, any similarity
that reveals the semantics of the users can be used )
• Search method: various method
• Data Management techniques: fast indexing
Data Mining: Chapter 5
In summary
• In the reductionist (component-based) point of view:
– Data mining algorithms are combination of different components
– All components are essential
– The relative importance of different component in the algorithm varies from
problem to problem, and from task to task.
• Such a point of view offer a platform for
– systematically describing data mining algorithms (same representation format)
– systematically comparing data mining algorithms (w.r.t. certain component).
– Systematically designing novel data mining algorithms for certain problems
(assembling different components)
Data Mining: Chapter 5
Let’s move to
Chapter 6
Data Mining: Chapter 5