Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
WEKA-application Overview Definitions of Data Mining • Search for Valuable Information in Large Volumes of Data. • Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules. • The automated extraction of predictive information from large databases Knowledge Discovery in Databases Méthodologie 1.Identifier le problème 2.Préparer les données 3.Explorer des modèles 4.Utiliser le modèle 5.Suivre le modèle Data Mining Is... Data Mining Is Not... Data Mining Tasks • • • • • • Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive] From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 @relation escape.symbolic @attribute age {junior, adult, senior} @attribute from {Europe, America, Asia} @attribute education {university, college} @attribute occupation {TRUE, FALSE} @attribute tourX {yes, no} @data junior,Europe,university,FALSE,no junior,Europe,university,TRUE,no adult,Europe,university,FALSE,yes senior,America,university,FALSE,yes senior,Asia,college,FALSE,yes senior,Asia,college,TRUE,no adult,Asia,college,TRUE,yes junior,America,university,FALSE,no junior,Asia,college,FALSE,yes senior,America,college,FALSE,yes junior,America,college,TRUE,yes adult,America,university,TRUE,yes adult,Europe,college,FALSE,yes senior,America,university,TRUE,no Construction automatique d ’arbre de décision • recherche à chaque niveau de l'attribut le plus discriminant • partition (données P) si tous les élément de P sont dans la même classe alors retour; autrement , pour chaque attribut A faire évaluer la qualité du partitionnement sur A utiliser le meilleur partitionement pour diviser P en P1 P2, …, Pn pour i=1 à n faire Partition (Pi) Decision tree Age junior adult Education university NO senior college YES YES Occupation true NO false YES age = junior | education = university: no (3.0) | education = college: yes (2.0) age = adult: yes (4.0) age = senior | occupation = TRUE: no (2.0) | occupation = FALSE: yes (3.0) Number of Leaves : Size of the tree : 8 5 Association Rules Another tool related to exploratory data analysis, knowledge discovery and machine learning Example: LHS RHS, where both LHS and RHS are sets of items; if every item in LHS is purchased in a transaction, then it is likely that the items in RHS will also be purchased. Apriori Minimum support: 0.2 Minimum confidence: 0.9 Number of cycles performed: 17 Best rules found: 1. education=college occupation=FALSE 4 ==> tourX=yes 4 (1) 2. from=Asia 4 ==> education=college 4 (1) 3. age=adult 4 ==> tourX=yes 4 (1) 4. from=Asia tourX=yes 3 ==> education=college 3 (1) 5. age=senior occupation=FALSE 3 ==> tourX=yes 3 (1) 6. age=senior tourX=yes 3 ==> occupation=FALSE 3 (1) 7. age=junior education=university 3 ==> tourX=no 3 (1) 8. age=junior tourX=no 3 ==> education=university 3 (1) 9. from=Asia occupation=FALSE 2 ==> education=college tourX=yes 2 (1) 10. from=Asia education=college occupation=FALSE 2 ==> tourX=yes 2 (1) Naïve Bayesian Classifiers Bayes rule (or law or theorem) P(Y¦X)=P(X¦Y)P(Y)/P(X) Conditional independence If A and B are probabilistically independent with P(A|C) and P(B|C), then P(A,B|C) = P(A|C)P(B|C) Class yes: P(C) = 0.625 Class no: P(C) = 0.375 Attribute age junior adult 0.25 0.41666667 senior 0.33333333 Attribute age junior adult 0.5 0.125 Asia 0.33333333 Attribute from Europe America 0.375 0.375 Attribute from Europe America 0.25 0.41666667 senior 0.375 Asia 0.25 Attribute education university college 0.36363636 0.63636364 Attribute education university college 0.71428571 0.28571429 Attribute occupation TRUE FALSE 0.36363636 0.63636364 Attribute occupation TRUE FALSE 0.57142857 0.42857143 Clustering Clustering Aim: discover regularities in data = find clusters of instances Distance measures • Similarity is usually defined by distance • Two basic distance measures are Number of clusters: 2 Cluster: 0 Prior probability: 0.5643 Cluster: 1 Prior probability: 0.4357 Attribute: age Discrete Estimator. Counts = Attribute: from Discrete Estimator. Counts = Attribute: education Discrete Estimator. Counts = Attribute: occupation Discrete Estimator. Counts = Attribute: tourX Discrete Estimator. Counts = Attribute: age Discrete Estimator. Counts = 9.1) Attribute: from Discrete Estimator. Counts = 9.1) Attribute: education Discrete Estimator. Counts = Attribute: occupation Discrete Estimator. Counts = Attribute: tourX Discrete Estimator. Counts = 2.9 3.63 4.37 (Total = 10.9) 2.34 3.79 4.78 (Total = 10.9) 2.56 7.34 (Total = 9.9) 4.22 5.68 (Total = 9.9) 7.74 2.16 (Total = 9.9) === Clustering stats for training data === Cluster Instances 0 7 ( 50%) 1 7 ( 50%) Log likelihood: -4.00994 4.1 2.37 2.63 (Total = 3.66 4.21 1.22 (Total = 6.44 1.66 (Total = 8.1) 3.78 4.32 (Total = 8.1) 3.26 4.84 (Total = 8.1) Example: WEKA Reference Theory: Berry,Linoff .Data Mining Techniques, 1997 http://www3.shore.net/~kht/dmintro/dmintro.htm Software: http://www.cs.waikato.ac.nz/ml/weka/