Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 711: DATA MINING Sargur N. Srihari E-mail: [email protected] Phone: 645-6164, ext. 113 1 CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, AddisonWesley,1998. 2 Input for Data Mining/Machine Learning • Concepts • result of learning process • intelligible • operational • Instances • Attributes 3 Concept Learning • Four styles of learning in data mining • classification learning • supervised • association learning • association between features • clustering • numeric prediction 4 Iris Data–Clustering Problem 1 2 3 4 5 … 51 52 53 54 55 … 101 102 103 104 105 Sepal Length Sepal Width Petal Length Petal Width 5.1 3.5 1.4 0.2 4.9 3 1.4 0.2 4.7 3.2 1.3 0.2 4.6 3.1 1.5 0.2 5 3.6 1.4 0.2 7 6.4 6.9 5.5 6.5 3.2 3.2 3.1 2.3 2.8 4.7 4.5 4.9 4 4.6 1.4 1.5 1.5 1.3 1.5 6.3 5.8 7.1 6.3 6.5 3.3 2.7 3 2.9 3 6 5.1 5.9 5.6 5.8 2.5 1.9 2.1 1.8 2.2 5 Weather Data–Numeric Class Outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy Temperature 85 80 83 70 68 65 64 72 69 75 75 72 81 71 Humidity 85 90 86 96 80 70 65 95 70 80 70 90 75 91 Windy false true false false false true true false false false true true false true Play-time 5 0 55 40 65 45 60 0 70 45 50 55 75 10 6 Instances • Input to machine learning scheme is a set of instances • Matrix of examples versus attributes is a flat file • Input data as instances is common but also restrictive in representing relationships between objects 7 Family Tree Example Peter M Steven M = Grace = F Peggy F Pam F Graham M Anna F = Ian M Pippa F Ray M Brian M Nikki F 8 Two ways of expressing sister-of relation (a) First Person Peter Peter … Steven Steven Steven Steven … Ian … Anna … Nikki Second Person Peggy Steven … Peter Graham Pam Grace … Pippa … Nikki … Anna (b) Sister-of ? no no no no yes no First Second Person Person Steven Pam Graham Pam Ian Pippa Brian Pippa Anna Nikki Nikki Anna All the rest Sister-of ? yes yes yes yes yes yes no yes yes yes 9 Family Tree As Table Name Peter Peggy Steven Graham Pam Ian Gender male female male male female male Parent1 ? ? Peter Peter Peter Grace Parent2 ? ? Peggy Peggy Peggy Ray 10 Sister-of As Table (combines 2 tables) Name Steven Graham Ian Ian Annna Nikki First Person Gender Parent1 male Peter male Peter male Grace male Grace female Pam female Pam Parent2 Name Peggy Pam Peggy Pam Ray Pippa Ray Pippa Ian Nikki Ian Anna All the rest Second Person Gender Parent 1 female Peter female Peter female Grace female Grace female Pam female Pam Sister of? Parent2 Peggy Peggy Ray Ray Ian Ian yes yes yes yes yes yes no 11 Rule for sister-of relation If second person’s gender = female and first person’s parent1 = second person’s parent1 then sister-of = yes 12 Denormalization • Relationship between different nodes of a tree recast into set of independent instances • Join two records and make into one by process of flattening • Relationship among more would be combinatorially large 13 Denormalization can produce spurious discoveries • Supermarket database • customers and products bought relation • products and supplier relation • suppliers and their address relation • Denormalizing produces flat file • each instance has: • customer, product, supplier, supplier address • Database mining tool discovers: • customers that buy beer also buy chips • supplier address can be “discovered” from supplier! 14 Relations need not be finite • Relation ancestor-of involves arbitrarily long paths through tree • Inductive logic programming learns rules such as: If person-1 is a parent of person-2 then person-1 is an ancestor of person-2 If person-1 is a parent of person-2 and person-2 is an ancestor of person-3 then person-1 is an ancestor of person-3 15 Inductive Logic Programming can learn recursive rules from set of relation instances Name Peter Peter Peter Peter Pam Grace Grace First Person Gender Parent1 male ? male ? male ? male ? female Peter female ? female ? Parent2 Name ? Steven ? Pam ? Anna ? Nikki Peggy Nikki ? Ian ? Nikki Other examples here All the rest Second Person Gender Parent 1 male Peter female Peter female Pam female Pam female Pam male Grace female Pam Ancestor of? Parent2 Peggy Peggy Ian Ian Ian Ray Ian yes yes yes yes yes yes yes yes no Drawback of such techniques: do not cope with noisy data, so slow as to be unusable, not covered in book 16 Summary of Data-mining Input • Input is table of independent instances of concept to be learned (file-mining!) • Relational data is more complex than a flat file • Finite set of relations can be recast into a single table • Denormalizaion can result in spurious data 17 Attributes • Each instance is characterized by a set of predefined features, eg, iris data • different instances may have different features • instances are transportation vehicles • no. of wheels useful for land vehicles but not to ships • no. of masts is applicable to ships but not land vehicles • one feature may depend on value of another • eg spouse’s name depends on married/unmarried • use “irrelevant value” flag 18 Attribute Values • Nominal • outlook = sunny, overcast, rainy • Ordinal • temperature = hot, mild, cool • hot > mild > cool • Interval • ordered and measured in fixed units eg, temp. in • differences are meaningful, not sums F • Ratio • • inherently defines zero point, eg, distance between points real nos, all mathematical operations 19 Preparing the Input • Denormalization • Integrate data from different sources • marketing study: sales dept, billing dept, service dept • Each source may have varying conventions,error, etc • Enterprise-wide database integration is data warehousing 20 ARFF File for Weather Data % ARFF file for the weather data with some numeric features % @relation weather @attribute rainy} @attribute @attribute @attribute @attribute outlook {sunny, overcast, temperature numeric humidity numeric windy {true, false} play? {yes, no} @data % %14 instances % sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes rainy, 70, 96, false, yes rainy, 68, 80, false, yes rainy, 65, 70, true, no overcast, 64, 65, true, yes sunny, 72, 95, false, no sunny, 69, 70, false, yes rainy, 75, 80, false, yes sunny, 75, 70, true, yes overcast, 72, 90, true, yes overcast, 81, 75, false, yes rainy, 71, 91, true, no 21 Simple Disjunction a y b c y d y x c n y n y x n d y n n x n 22 Exclusive-Or Problem 1 0 a b X =1? b a 0 1 no Y =1? no b yes Y =1? no yes a If x = 1 and y = 0 then class = a If x = 0 and y = 1 then class = a If x = 0 and y = 0 then class = b If x = 1 and y = 1 then class = b a yes b 23 Replicated Subtree If x = 1 and y = 1 then class = a If z = 0 and w = 1 then class = a Otherwise class = b X 2 1 3 y 3 1 a 2 z 1 2 w 2 1 a b b 3 b 3 b 24 New Iris Flower Sepal Length 5.1 Sepal Width 3.5 Petal Length 2.6 Petal Width 0.2 Type ? 25 Rules for Iris Data Default: Iris-setosa except if petal-length 2.45 and petal-length < 5.355 and petal-width < 1.75 then Iris-versicolor except if petal-length 4.95 and petal-width < 1.55 then Iris-virginica else if sepal-length < 4.95 and sepal-width 2.45 then Iris-virginica else if petal-length 3.35 then Iris-virginica except if petal-length < 4.85 and sepal-length < 5.95 then Iris-versicolor 1 2 3 4 5 6 7 8 9 10 11 12 26 The Shapes Problem Shaded: Standing Unshaded: Lying 27 Training Data for Shapes Problem Width 2 3 4 7 7 2 9 10 Height 4 6 3 8 6 9 1 2 Sides 4 4 4 3 3 4 4 3 Class standing standing lying standing lying standing lying lying 28 CPU Performance Data PRP = -56.1 +0.049 MYCT +0.015 MMIN +0.006MMAX +0.630CACH -0.270CHMIN +1.46 CHMAX CHMIN >7.5 7.5 CACH 8.5 64.6 (24/19.2%) MMAX (a) linear regression 2500 19.3 (28/8.7%) (2500, 4250] >4250 29.8 (37/8.18%) CACH 0.5 37.3 (19/11.3%) 1000 75.7 (10/24.6%) 28000 >28000 157 (21/73.7%) CHMAX >10000 58 133 (16/28.8%) (0.5,8.5] 59.3 (24/16.9%) MYCT 550 (8.5, >28 28] MMAX MMAX MMIN 12000 281 (11/56%) >58 783 (5/359%) >12000 492 (7/53.9%) >550 18.3 (7/3.83%) (b) regression tree 29 CPU Performance Data CHMIN 7.5 >7.5 CACH 8.5 MMAX 4250 MMAX >8.5 LM4 (50/22.17%) 28000 LM5 (21/45.5%) >28000 LM6 (23/63.5%) >4250 LM1 (65/7.32%) CACH 0.5 LM2 (26/6.37%) (0.5,8.5] LM3 (24/14.5%) LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMIN LM2 PRP = 20.3 + 0.004 MMIN - 3.99 CHMIN + 0.946 CHMAX LM3 PRP = 38.1 + 0.012 MMIN LM4 PRP = 10.5 + 0.002 MMAX + 0.698 CACH +0.969 CHMAX LM5 PRP = 285 - 1.46 MYCT + 1.02 CACH - 9.39 CHMIN LM6 PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN = 4.98 CHMAX (c) model tree 30 Partitioning Instance Space 31 Ways to Represent Clusters 32