Download Data mining Intro

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Introduction To Data Mining
What Is Data Mining?
• A tool
• Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) patterns or
knowledge from huge amount of
data
• Core of KDD
• Integration of Multiple technologies
Part of KDD
(Knowledge Discovery in Databases)
Interpretation/
Evaluation
Data Mining
Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), “From Knowledge Discovery to Data
Mining: An Overview,” Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Integration of Multiple
Technologies
Artificial
Intelligence
Machine
Learning
Database
Management
Statistics
Visualization
Algorithms
Data
Mining
Other knowledge
Why Data Mining?
• We are drowning in data (Data explosion problem ), but
starving for knowledge!
• Solution: Data warehousing and data mining
– Data warehousing and on-line analytical processing
– Mining interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
• A lot of potential applications
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Health care …
Data mining process
State the
problem
+
hypothesis
Operational
Database
Data
Warehouse
SQL Queries
Data Mining
Interpretation
&
Evaluation
Knowledge-base
Result
Application
Knowledge from Data Mining
• Association rules
• Sequential Association
• Classification rules
• Clustering
• Deviation Detection
•…
Association Rules
• Identify association in the data:
E.g.
market-basket analysis
• Indicate significance of each
“Find groups of items
commonly purchased
together”
(correlation [A,B] and causality[A->B])
association (only interesting if its
confidence exceed a certain measure)
• Not all the Association is
interesting
(too trivial, negative association)
– People who purchase
fish are likely to
purchase wine
Sequential Associations
• Find event sequences that are
unusually likely
• Requires “training” event list,
known “interesting” events
• Must be robust in the face of
additional “noise” events
Uses:
• Failure analysis and prediction
Technologies:
• Dynamic programming
(Dynamic time warping)
• “Custom” algorithms
“Find common sequences of
warnings/faults within 10
minute periods”
– Warn 2 on Switch C
preceded by Fault 21 on
Switch B
– Fault 17 on any switch
preceded by Warn 2 on any
switch
Time Switch Event
B
Fault 21
21:10
A
Warn 2
21:11
C
Warn 2
21:13
A
Fault 17
21:20
Classification rules
• Classify a set of data based on
their values in certain attributes
• Requires “training data”: have
predefined attributes
Uses:
• Profiling
Technologies:
• Generate decision trees (results
are human understandable)
• Neural Nets
“Route documents to
most likely interested
parties”
– English or nonenglish?
– Domestic or Foreign?
Training Data
tool
produces
Groups
classifier
Clustering
• Group a set of data base on
the conceptual clustering
principle(i.e. maximizing the
intraclass similarity and
minimizing the interclass
similarity)
• No “training data”: Without
predefined attributes
Uses:
• Demographic analysis
Technologies:
• Self-Organizing Maps
• Probability Densities
• Conceptual Clustering
“Group people with
similar travel profiles”
– George, Patricia
– Jeff, Evelyn, Chris
– Rob
Clusters
Deviation Detection
• Find unexpected values,
outliers
• “Find unusual
occurrences in IBM
stock prices”
Uses:
Sample date
Event
• Failure analysis
Market closed
• Anomaly discovery for analysis 58/07/04
Technologies:
• clustering/classification
methods
• Statistical techniques
• visualization
59/01/06
59/04/04
73/10/09
Date
58/07/02
58/07/03
58/07/04
58/07/07
Occurrences
317 times
2.5% dividend 2 times
50% stock split 7 times
not traded
1 time
Close Volume
369.50
314.08
369.25
313.87
Market Closed
370.00
314.50
Spread
.022561
.022561
.022561
Popular Data Mining Techniques
• Supervised
– Decision trees
– Rule induction
– Regression models
– Neural Networks
…
• Unsupervised
—K-means clustering
—Self organized maps
…
Supervised vs. Unsupervised
• Supervised algorithms
» Learning by example:
– Use training data which the value of the response variable
is already known
– Create a model by running the algorithm on the training
data
– Identify a class label for the incoming new data
» Driven by a real business problems and historical data
• Unsupervised algorithms
» Do not use training data.
» Patterns may not be known in advance
Supervised Algorithms
Decision Trees
• A tree structure where non-terminal nodes represent
tests on one or more attributes and terminal nodes
reflect decision outcomes.
• Advantages of decision trees
—Understandable
—Relatively fast
—Easy to translate into SQL queries
• Disadvantages of decision trees
— Limited to one output attribute
— Decision tree algorithms are not so stable
• Types of decision trees
—CHAID: Chi-Square Automatic Interaction Detection
—CART: Classification and Regression Trees
…
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
Patient
ID#
Sore
Throat
1
2
3
4
5
6
7
8
9
10
Yes
No
Yes
Yes
No
No
No
Yes
No
Yes
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Strep throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
Swollen
Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
Figure 1.1 A decision tree for the
data in Table 1.1
Table 1.2 • Data Instances with an Unknown Classification
Patient
ID#
Sore
Throat
11
12
13
No
Yes
No
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
No
Yes
No
Yes
No
No
Yes
No
No
Yes
Yes
Yes
?
?
?
Rule induction
IF =
Antecedent
THEN =
• The extraction of useful
•
•
independent if-then rules
from data based on
statistical significance
If rules cause prediction
confliction -> solve it
according to confidence
Advantage and disadvantage
—Understandable
—not cover all the possible
situation
Consequence
E.g.
IF Swollen Glands = Yes
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever
= Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever
= No
THEN Diagnosis = Allergy
Neural Networks
• Non-linear predictive models that learn through
•
•
training and resemble biological neural networks
in structure
Means of efficiently modeling large and complex
problems in which there may be hundreds of
predictor variables that have many interactions
Disadvantage
– Difficult understand
– Can require significant amounts of time to train, to
prepare data
–…
Input
Layer
Hidden
Layer
Figure 2.2 A multilayer fully
connected neural network
Output
Layer
Regression Models
• Statistical techniques
• Using existing values to forecast what
other values will be.
Y = a + b1(X1) + b2(X2) + b3(X3) +
b4(X4) + b5(X5) …
• A lot of types regression (linear
regression, logistic regression …)
K-Means Clustering
• Unsupervised algorithm
• Steps of algorithm
1. Choose a value for K, the total number of clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their closest
cluster center.
4. Calculate a new cluster center for each cluster.
5. Repeat steps 3-5 until the cluster centers do not
change.
Table 2.3 • The Credit Card Promotion Database
Income
Range ($)
Magazine
Promotion
40–50K
30–40K
40–50K
30–40K
50–60K
20–30K
30–40K
20–30K
30–40K
30–40K
40–50K
20–30K
50–60K
40–50K
20–30K
Yes
Yes
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
No
No
Watch
Life Insurance
Promotion
Promotion
No
Yes
No
Yes
No
No
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
Yes
No
Yes
Credit Card
Insurance
Sex
Age
No
No
No
Yes
No
No
Yes
No
No
No
No
No
No
No
Yes
Male
Female
Male
Male
Female
Female
Male
Male
Male
Female
Female
Male
Female
Male
Female
45
40
42
43
38
55
35
27
43
41
43
29
39
55
19
A Hypothesis for the Credit
Card Promotion Database
A combination of one or more of the dataset
attributes differentiate Acme Credit Card Company
card holders who have taken advantage of the life
insurance promotion and those card holders who
have chosen not to participate in the promotional
offer.
Cluster 1
# Instances: 3
Sex: Male => 3
Female => 0
Age: 43.3
Credit Card Insurance:
Yes => 0
No => 3
Life Insurance Promotion: Yes => 0
No => 3
Cluster 2
# Instances: 5
Sex: Male => 3
Female => 2
Age: 37.0
Credit Card Insurance:
Yes => 1
No => 4
Life Insurance Promotion: Yes => 2
No => 3
Cluster 3
# Instances: 7
Sex: Male => 2
Female => 5
Age: 39.9
Credit Card Insurance:
Yes => 2
No => 5
Life Insurance Promotion: Yes => 7
No => 0
Figure 2.3 An unsupervised
cluster of the credit card database
Choosing a Data Mining
Technique
• Know which kind knowledge you want to get
• Know your data
--What is the interaction between input and
output attributes?
--What is the Distribution of the Data?
--Which Attributes Best Define the Data?
• Know the difference among different data
mining techniques
Questions to Determine Data
Mining Applicability
1. Can the problem be clearly defined?
2. Does potentially meaningful data exist?
3. Does data contain hidden knowledge or
is it just filled with facts?
4. Is the “juice worth the squeeze?”
Data Mining vs. OLAP
• Discovery-based
•
•
(deductive process)
Mine data warehouse
and others
Can provide
information you didn’t
expect
• Verification-based
•
•
(inductive process)
DSS tool for data
warehouse
Pre-defined queries
Data Mining vs. Data Query
• For hidden knowledge
• Try to get the answer as
accurate as possible
• Results are the analysis
of the data
• Data need to be prepare
before producing results
• For specific question
• Answer to query is
•
•
100% accurate if data
correct
Results are subset of
data
Need not prepare data