Download OMEGA - LIACS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining:
Knowledge Discovery in Databases
Peter van der Putten
ALP Group, LIACS
Pre-University College
LAPP-Top Computer Science
February 2005
Topics
•
•
•
•
Lecture
Demo Data Mining tool
Exercises Data Mining tool
Breaks TBD
Data mining case study
DARPA’s Bio-surveillance Agent
Data mining case study
DARPA’s Bio-surveillance Agent
Data mining case study
Credit Scoring for Loan Acceptance
© Chordiant Software
All applications
Expert knowledge
29.8% accepted
12.7% infection
Prediction
model plus
rules
34.5% accepted
9.1% infection
Accepted volume
Data mining case study
Classifying Leukemia
• Problem:
– Leukemia (different types of Leukemia cells look
very similar)
– Given data for a number of samples (patients), can
we
• Accurately diagnose the disease?
• Predict outcome for given treatment?
• Recommend best treatment?
• Solution
– Data mining on micro-array data
Data mining case study
Classifying Leukemia
• 38 training patients, 34 test patients, ~ 7,000 patient attributes
(microarry gene data)
• 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid
Leukemia (AML)
• Use train data to build diagnostic model
ALL
• Results on test data:
33/34 correct, 1 error may be mislabeled
AML
5 million terabytes created in 2002
• UC Berkeley 2003 estimate: 5 exabytes (5
million terabytes) of new data was created in
2002.
• Twice as much information was created in 2002
as in 1999 (~30% growth rate)
• Other growth rate estimates even higher
• Very little data will ever be looked at by a human
• Knowledge Discovery is NEEDED to make
sense and use of data.
Dilbert puts data mining in perspective
Sources of (artificial) intelligence
• Reasoning versus learning
• Learning from data
–
–
–
–
–
–
–
–
Patient data
Customer records
Stock prices
Piano music
Criminal mugshots
Websites
Robot perceptions
Etc.
Some working definitions….
• ‘Data Mining’ and ‘Knowledge Discovery in Databases’
(KDD) are used interchangeably
• Data mining =
– The process of discovery of interesting, meaningful and
actionable patterns hidden in large amounts of data
• Multidisciplinary field originating from artificial
intelligence, pattern recognition, statistics, machine
learning, bioinformatics, econometrics, ….
Some working definitions….
•
Concepts: kinds of things that can be learned
–
–
•
Instances: the individual, independent examples of a
concept
–
•
Example: a patient, candidate drug etc.
Attributes: measuring aspects of an instance
–
•
Aim: intelligible and operational concept description
Example: the relation between patient characteristics
and the probability to be diabetic
Example: age, weight, lab tests, microarray data etc
Pattern or attribute space
Data mining tasks
• Predictive data mining
– Classification: classify an instance into a category
– Regression: estimate some continuous value
• Descriptive data mining
–
–
–
–
–
–
Matching & search: finding instances similar to x
Clustering: discovering groups of similar instances
Association rule extraction: if a & b then c
Summarization: summarizing group descriptions
Link detection: finding relationships
…
Data Mining Tasks: Search
Finding best matching instances
Every instance is a point in
pattern space. Attributes are the
dimension of an instance, f.e.
Age, weight, gender etc.
f.e. weight
Pattern spaces may be high
dimensional (10 to thousands of
dimensions)
f.e. age
Data Mining Tasks: Classification
Goal classifier is to seperate
classes on the basis of known
attributes
weight
The classifier can be applied
to an instance with unknow
class
age
For instance, classes are
healthy (circle) and sick
(square); attributes are age
and weight
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
In >3 dimensions this is not
possible
Examples of Classification Techniques
•
•
•
•
•
•
•
•
•
Majority class vote
Machine learning & AI
Decision trees
Nearest neighbor
Neural networks
Genetic algorithms / evolutionairy computing
Artificial Immune Systems
Good old statistics
…..
Example Classification Algorithm 1
Decision Trees
20000 patients
age > 67
yes
no
1200 patients
Weight > 85kg
yes
400 patients
Diabetic (%50)
18800 patients
gender = male?
no
800 customers
Diabetic (%10)
no
etc.
Decision Trees in Pattern Space
Goal classifier is to seperate
classes (circle, square) on the
basis of attribute age and
income
weight
Each line corresponds to a
split in the tree
Decision areas are ‘tiles’ in
pattern space
age
Decision Trees in Pattern Space
Goal classifier is to seperate
classes (circle, square) on the
basis of attribute age and
income
Each line corresponds to a
split in the tree
weight
Decision areas are ‘tiles’ in
pattern space
age
Special Cases of Decision Trees
• Depth = 0
– Majority class classifier (ZeroR)
• Depth = 1
– One question only
– Also known as decision stump
• Depth = n
– Any amount of branches
• Various algorithms exist to learn the tree from data
– Major difference is criterion to determine on what attribute
value to split
Example classification algorithm 2:
Nearest Neighbour
• Data itself is the classification model, so no
abstraction like a tree etc.
• For a given instance x, search the k instances
that are most similar to x
• Classify x as the most occurring class for the k
most similar instances
Nearest Neighbor in Pattern Space
Classification
= new instance
Any decision area possible
fe weight
Condition: enough data
available
fe age
Nearest Neighbor in Pattern Space
Voorspellen
Any decision area possible
bvb. weight
Condition: enough data
available
f.e. age
Example classification algorithm 3:
Neural Networks
• Inspired by neuronal computation in the brain (McCullough &
Pitts 1943 (!))
invoer:
bvb. klantkenmerken
uitvoer:
bvb. respons
• Input (attributes) is coded as activation on the input layer
neurons, activation feeds forward through network of weighted
links between neurons and causes activations on the output
neurons (for instance diabetic yes/no)
• Algorithm learns to find optimal weight using the training
instances and a general learning rule.
Neural Networks
• Example simple network (2 layers)
age
weightage
body_mass_index
Weightbody mass index
Probability of being diabetic
• Probability of being diabetic = f (age * weightage + body mass
index * weightbody mass index)
Neural Networks in Pattern Space
Classification
Simpel network: only a line
available (why?) to seperate
classes
Multilayer network:
f.e. weight
Any classification boundary
possible
f.e. age
e
Decision Tree Demo in WEKA,
An open source mining tool
Descriptive data mining:
association rules
• Discovery of interesting patters
• Rule format: if A (and B and C etc) then Z
• Example:
– If customer buys potatoes (A) and sauerkraut (B) then
customer buys sausage (Z)
• Quality measures for a rule
– Support condition: how often do potatoes and sauerkraut
occur together (A,B)
– Confidence rule: how often do sausages then occur / support
conditions (is A,B  C always true?)
e
Associatie rule demo in WEKA
Some examples of my research areas
(Jointly with students)
• Mix between applications and new algorithms
– Video mining: recognize settings, porn filtering
– Artificial Immune Systems: copying learning ability of immune
systems
– Predicting Survival Rate for Throat Cancer Patients
– Crime Data Mining
– Fusing Data from Multiple Sources
– Decisioning: offering the right product to the right customer
using predictions
– Bias variance evaluation: distinguish between different
sources of error for a classifier
What have we learned so far?
•
•
•
•
•
•
•
•
Case Studies
Learning versus reasoning
Data mining definitions
Data mining tasks
Example data mining techniques for classification
Example data mining techniques for association rules
WEKA Demos
And now: lab sessions