Download Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Part I
Data Mining Fundamentals
Data Mining: A First View
Chapter 1
1 1 Data Mining: A Definition
1.1
Data Mining
The process of employing one or more
computer learning techniques to
automatically analyze and extract
knowledge from data.
Induction-based Learning
g
The process of forming general
concept definitions by observing
specific examples of concepts to be
learned.
Knowledge Discovery in
Databases (KDD)
The application of the scientific
method to data mining.
mining Data mining is
one step of the KDD process.
1.2 What Can Computers
p
Learn?
Four Levels of Learning
• Facts
• Concepts
• Procedures
• Principles
i i l
Facts
A fact is a simple statement of truth.
Concepts
A concept is a set of objects, symbols, or
events grouped together because they
share certain characteristics.
Procedures
A procedure is a step-by-step course
of action to achieve a goal.
Principles
A principles are general truths or laws
that are basic to other truths.
Computers & Learning
Computers are good at learning concepts.
Concepts are the output of a data mining
session.
Three Concept Views
• Classical View
• Probabilistic View
• Exemplar View
Classical View
All concepts have definite
g pproperties.
p
defining
Probabilistic View
People store and recall concepts
generalizations created byy
as g
observations.
Exemplar View
People store and recall likely
p exemplars
p
that are used
concept
to classify unknown instances.
Supervised Learning
• Build a learner model using
g data
instances of known origin.
• Use the model to determine the
outcome for new instances of
unknown origin.
Supervised Learning:
A Decision Tree Example
Decision Tree
A tree structure where non-terminal
nodes
d representt tests
t t on one or more
attributes and terminal nodes reflect
d i i outcomes.
decision
t
H
h i l Training
T i i Data
D
for
f Disease
Di
Diagnosis
Di
i
T bl 1.1
Table
1 1 • Hypothetical
Patient
ID#
Sore
Throat
1
2
3
4
5
6
7
8
9
10
Yes
No
Yes
Yes
No
No
No
Yes
No
Yes
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Strep throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
Swollen
Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
Figure 1.1 A decision tree for the data
in Table 1.1
Table 1.2 • Data Instances with an Unknown Classification
Patient
ID#
Sore
Throat
Fever
Swollen
Glands
Congestion
Headache
Diagnosis
11
12
13
No
Yes
Y
No
No
Yes
Y
No
Yes
N
No
No
Yes
N
No
No
Yes
Y
Yes
Yes
?
?
?
Production Rules
IF Swollen
S ll Glands
Gl d = Yes
Y
THEN Diagnosis = Strep Throat
IF Swollen Glands = No & Fever = Yes
THEN Diagnosis = Cold
IF Swollen Glands = No & Fever = No
THEN Diagnosis = Allergy
Unsupervised Clustering
A data mining method that builds
models
d l from
f
d t without
data
ith t predefined
d fi d
classes.
The
h Acme
A
Investors Dataset
Table 1.3 • Acme Investors Incorporated
Customer
ID
Account
T
Type
Margin
A
Account
t
Transaction
M th d
Method
Trades/
M th
Month
S
Sex
1005
1013
1245
2110
1001
Joint
Custodial
Joint
Individual
Individual
No
No
No
Yes
Yes
Online
Broker
Online
Broker
Online
12.5
0.5
3.6
22.3
5.0
F
F
M
M
M
A
Age
Favorite
R
Recreation
ti
Annual
I
Income
30–39
50–59
20–29
30–39
40–49
Tennis
Skiing
Golf
Fishing
Golf
40–59K
80–99K
20–39K
40–59K
60–79K
The Acme Investors Dataset &
S
Supervised
i dL
Learning
i
1.
2.
3.
4.
Can I develop a general profile of an online investor?
Can I determine if a new customer is likelyy to open
p a
margin account?
Can I build a model predict the average number of trades
pper month for a new investor?
What characteristics differentiate female and male
investors?
The Acme Investors Dataset &
Unsupervised Clustering
1. What attribute similarities group customers
of Acme Investors together?
2. What differences in attribute values
segmentt the
th customer
t
database?
d t b ?
1.3 Is Data Mining Appropriate
f My Problem?
for
bl
Data Mining or Data Query?
• Shallow Knowledge
• Multidimensional Knowledge
g
• Hidden Knowledge
• Deep Knowledge
Shallow Knowledge
Shallow knowledge is factual. It can
be easily stored and manipulated in a
database.
Multidimensional Knowledge
Multidimensional knowledge is also
factual. On-line analytical Processing
(OLAP) tools are used to manipulate
multidimensional knowledge.
Hidden Knowledge
Hidden knowledge represents patterns
or regularities in data that cannot be
easily found using database query.
However, data mining algorithms can
find such patterns with ease.
Deep Knowledge
Deep knowledge is knowledge stored
in a database that can only be found if
we are given some direction about what
we are looking for.
Data Mining vs.
vs Data Query: An
Example
p
• Use data query if you already
almost know what you are
looking for.
• Use data mining to find regularities
in data that are not obvious.
obvious
1.4 Expert Systems or Data
Mining?
i i
Expert System
A computer program that emulates
problem-solving
g skills of one or
the p
more human experts.
Knowledge Engineer
A person trained to interact with an
p
in order to capture
p
their
expert
knowledge.
Data
Data Mining Tool
If Swollen Glands = Yes
Then Diagnosis = Strep Throat
Human Expert
Knowledge Engineer
Expert System
Building Tool
If Swollen Glands = Yes
Then Diagnosis = Strep Throat
Figure 1.2 Data mining vs. expert
systems
1 5 A Simple Data Mining
1.5
Process Model
Operational
Database
Data
Warehouse
SQL Queries
Data Mining
Interpretation
&
E l ti
Evaluation
Figure 1.3 A simple data mining
process model
Result
Application
pp
Assembling
bli the
h Data
• The Data Warehouse
• Relational Databases and Flat Files
The
h Data Warehouse
h
The data warehouse is a historical
d t b
database
designed
d i d for
f decision
d ii
support.
Mining the Data
Interpreting the Results
Result Application
1 6 Why Not Simple Search?
1.6
• Nearest Neighbor Classifier
•K
K-nearest
nearest Neighbor Classifier
Nearest Neighbor Classifier
Classification is performed by searching
the training data for the instance closest
in distance to the unknown instance
instance.
1.7 Data Mining Applications
Customer Intrinsic Value
_
_
_
_
_
_
_
Intrinsic
(Predicted)
Value
_
_
X
X
_
_
X
X
X
X
X
Actual Value
Figure 1.4 Intrinsic vs. actual
customer value
X
X