Download Attribute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
WEKA-application
Overview
Definitions of Data Mining
• Search for Valuable Information in Large Volumes of Data.
• Exploration & Analysis, by Automatic or Semi-Automatic
Means, of Large Quantities of Data in order to Discover
Meaningful Patterns & Rules.
• The automated extraction of predictive information from
large databases
Knowledge Discovery in Databases
Méthodologie
1.Identifier le problème
2.Préparer les données
3.Explorer des modèles
4.Utiliser le modèle
5.Suivre le modèle
Data Mining Is...
Data Mining Is Not...
Data Mining Tasks
•
•
•
•
•
•
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data
Mining, 1996
@relation escape.symbolic
@attribute age {junior, adult, senior}
@attribute from {Europe, America, Asia}
@attribute education {university, college}
@attribute occupation {TRUE, FALSE}
@attribute tourX {yes, no}
@data
junior,Europe,university,FALSE,no
junior,Europe,university,TRUE,no
adult,Europe,university,FALSE,yes
senior,America,university,FALSE,yes
senior,Asia,college,FALSE,yes
senior,Asia,college,TRUE,no
adult,Asia,college,TRUE,yes
junior,America,university,FALSE,no
junior,Asia,college,FALSE,yes
senior,America,college,FALSE,yes
junior,America,college,TRUE,yes
adult,America,university,TRUE,yes
adult,Europe,college,FALSE,yes
senior,America,university,TRUE,no
Construction automatique
d ’arbre de décision
• recherche à chaque niveau de l'attribut le plus discriminant
• partition (données P)
si tous les élément de P sont dans la même classe alors retour;
autrement , pour chaque attribut A faire
évaluer la qualité du partitionnement sur A
utiliser le meilleur partitionement pour diviser P en P1 P2, …, Pn
pour i=1 à n faire Partition (Pi)
Decision tree
Age
junior
adult
Education
university
NO
senior
college
YES
YES
Occupation
true
NO
false
YES
age = junior
| education = university: no (3.0)
| education = college: yes (2.0)
age = adult: yes (4.0)
age = senior
| occupation = TRUE: no (2.0)
| occupation = FALSE: yes (3.0)
Number of Leaves :
Size of the tree :
8
5
Association Rules
Another tool related to exploratory data analysis, knowledge
discovery and machine learning
Example:
LHS
RHS,
where both LHS and RHS are sets of items;
if every item in LHS is purchased in a transaction,
then it is likely that the items in RHS will also be purchased.
Apriori
Minimum support: 0.2
Minimum confidence: 0.9
Number of cycles performed: 17
Best rules found:
1. education=college occupation=FALSE 4 ==> tourX=yes 4 (1)
2. from=Asia 4 ==> education=college 4 (1)
3. age=adult 4 ==> tourX=yes 4 (1)
4. from=Asia tourX=yes 3 ==> education=college 3 (1)
5. age=senior occupation=FALSE 3 ==> tourX=yes 3 (1)
6. age=senior tourX=yes 3 ==> occupation=FALSE 3 (1)
7. age=junior education=university 3 ==> tourX=no 3 (1)
8. age=junior tourX=no 3 ==> education=university 3 (1)
9. from=Asia occupation=FALSE 2 ==> education=college tourX=yes 2 (1)
10. from=Asia education=college occupation=FALSE 2 ==> tourX=yes 2 (1)
Naïve Bayesian Classifiers
Bayes rule (or law or theorem)
P(Y¦X)=P(X¦Y)P(Y)/P(X)
Conditional independence
If A and B are probabilistically independent with P(A|C) and
P(B|C), then
P(A,B|C) = P(A|C)P(B|C)
Class yes: P(C) = 0.625
Class no: P(C) = 0.375
Attribute age
junior
adult
0.25
0.41666667
senior
0.33333333
Attribute age
junior adult
0.5
0.125
Asia
0.33333333
Attribute from
Europe
America
0.375
0.375
Attribute from
Europe
America
0.25
0.41666667
senior
0.375
Asia
0.25
Attribute education
university
college
0.36363636
0.63636364
Attribute education
university
college
0.71428571
0.28571429
Attribute occupation
TRUE
FALSE
0.36363636 0.63636364
Attribute occupation
TRUE
FALSE
0.57142857
0.42857143
Clustering
Clustering
Aim: discover regularities in data = find
clusters of instances
Distance measures
• Similarity is usually defined by distance
• Two basic distance measures are
Number of clusters: 2
Cluster: 0 Prior probability: 0.5643
Cluster: 1 Prior probability: 0.4357
Attribute: age
Discrete Estimator. Counts =
Attribute: from
Discrete Estimator. Counts =
Attribute: education
Discrete Estimator. Counts =
Attribute: occupation
Discrete Estimator. Counts =
Attribute: tourX
Discrete Estimator. Counts =
Attribute: age
Discrete Estimator. Counts =
9.1)
Attribute: from
Discrete Estimator. Counts =
9.1)
Attribute: education
Discrete Estimator. Counts =
Attribute: occupation
Discrete Estimator. Counts =
Attribute: tourX
Discrete Estimator. Counts =
2.9 3.63 4.37 (Total = 10.9)
2.34 3.79 4.78 (Total = 10.9)
2.56 7.34 (Total = 9.9)
4.22 5.68 (Total = 9.9)
7.74 2.16 (Total = 9.9)
=== Clustering stats for training data ===
Cluster Instances
0 7 ( 50%)
1 7 ( 50%)
Log likelihood: -4.00994
4.1 2.37 2.63 (Total =
3.66 4.21 1.22 (Total =
6.44 1.66 (Total = 8.1)
3.78 4.32 (Total = 8.1)
3.26 4.84 (Total = 8.1)
Example: WEKA
Reference
Theory:
Berry,Linoff .Data Mining Techniques, 1997
http://www3.shore.net/~kht/dmintro/dmintro.htm
Software:
http://www.cs.waikato.ac.nz/ml/weka/
Related documents