Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 711: DATA MINING Sargur N. Srihari E-mail: [email protected] Phone: 645-6164, ext. 113 1 CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, AddisonWesley,1998. 2 CSE 711 Texts 2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997. 3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998. 4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998. 3 Introduction • Challenge: How to manage everincreasing amounts of information • Solution: Data Mining and Knowledge Discovery Databases (KDD) 4 Information as a Production Factor • Most international organizations produce more information in a week than many people could read in a lifetime 5 Data Mining Motivation • Mechanical production of data need for mechanical consumption of data • Large databases = vast amounts of information • Difficulty lies in accessing it 6 KDD and Data Mining • KDD: Extraction of knowledge from data • Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data” • Data Mining: Discovery stage of the KDD process 7 Data Mining • Process of discovering patterns, automatically or semi-automatically, in large quantities of data • Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic 8 KDD and Data Mining Machine learning Export systems KDD Database Statistics Visualization Figure 1.1 Data mining is a multi-disciplinary field. 9 Data Mining vs. Query Tools • SQL: When you know exactly what you are looking for • Data Mining: When you only vaguely know what you are looking for 10 Practical Applications • KDD more complicated than initially thought • 80% preparing data • 20% mining data 11 Data Mining Techniques • Not so much a single technique • More the idea that there is more knowledge hidden in the data than shows itself on the surface 12 Data Mining Techniques • Any technique that helps to extract more out of data is useful • • • • • Query tools Statistical techniques Visualization On-line analytical processing (OLAP) Case-based learning (k-nearest neighbor) 13 Data Mining Techniques • • • • Decision trees Association rules Neural networks Genetic algorithms 14 Machine Learning and the Methodology of Science Analysis Observation Theory Prediction Empirical cycle of scientific research 15 Machine Learning... Reality: Infinite number of swans Analysis Limited number of observation Theory formation Theory ‘All swans are white’ 16 Reality: Infinite number of swans Machine Learning... Theory “All swans are white” Single observation Prediction Theory falsification 17 A Kangaroo in Mist a.) b.) c.) d.) e.) f.) Complexity of search spaces 18 Association Rules Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression XY, where X and Y are sets of an item. 19 Association Rules Intuitive meaning of such a rule: transactions in the database which contain the items in X tend also to contain the items in Y. 20 Association Rules Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services. Here, 98% is called the confidence of the rule. The support of the rule X Y is the percentage of transactions that contain both X and Y. 21 Association Rules Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns. 22 Example Data Sets • • • • • • • Contact Lens (symbolic) Weather (symbolic data) Weather ( numeric +symbolic) Iris (numeric; outcome:symbolic) CPU Perf.(numeric; outcome:numeric) Labor Negotiations (missing values) Soybean 23 Contact Lens Data age young young young young young young young young pre-presbyopic pre-presbyopic pre-presbyopic pre-presbyopic pre-presbyopic pre-presbyopic pre-presbyopic pre-presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic presbyopic spectacle prescription myope myope myope myope hypermetrope hypermetrope hypermetrope hypermetrope myope myope myope myope hypermetrope hypermetrope hypermetrope hypermetrope myope myope myope myope hypermetrope hypermetrope hypermetrope hypermetrope astigmatism no no yes yes no no yes yes no no yes yes no no yes yes no no yes yes no no yes yes tear production rate reduced normal reduced normal reduced normal reduced normal reduced normal reduced normal reduced normal reduced normal reduced normal reduced normal reduced normal reduced normal recommendation lenses none soft none hard none soft none hard none soft none hard none soft none none none none none hard none soft none none 24 Structural Patterns • Part of structural description If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft • Example is simplistic because all combinations of possible values are represented in table 25 Structural Patterns • In most learning situations, the set of examples given as input is far from complete • Part of the job is to generalize to other, new examples 26 Weather Data outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild humidity high high high high normal normal normal high normal normal normal high normal high windy false true false false false true true false false false true true false true play no no yes yes yes no yes no yes yes yes yes yes no 27 Weather Problem • This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes 28 Weather Data with Some Numeric Attributes outlook sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy temperature 85 80 83 70 68 65 64 72 69 75 75 72 81 71 humidity 85 90 86 96 80 70 65 95 70 80 70 90 75 91 windy false true false false false true true false false false true true false true play no no yes yes yes no yes no yes yes yes yes yes no 29 Classification and Association Rules • Classification Rules: rules which predict the classification of the example in terms of whether to play or not If outlook = sunny and humidity = >83, then play = no 30 Classification and Association Rules • Association Rules: rules which strongly associate different attribute values • Association rules which derive from weather table If temperature = cool then humidity = normal If humidity = normal and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no and humidity = high then outlook = sunny 31 Rules for Contact Lens Data If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none 32 Decision Tree for Contact Lens Data tear production rate astigmatism none soft spectacle prescription hard none 33 Iris Data 1 2 3 4 5 … 51 52 53 54 55 101 102 103 104 105 sepal length 5.1 4.9 4.7 4.6 5.0 sepal width 3.5 3.0 3.2 3.1 3.6 pedal lenth 1.4 1.4 1.3 1.5 1.4 pedal width 0.2 0.2 0.2 0.2 0.2 type Iris setosa Iris setosa Iris setosa Iris setosa Iris setosa 7.0 6.4 6.9 5.5 6.5 6.3 5.8 7.1 6.3 6.5 3.2 3.2 3.1 2.3 2.8 3.3 2.7 3.0 2.9 3.0 4.7 4.5 4.9 4.0 4.6 6.0 5.1 5.9 5.6 5.8 1.4 1.5 1.5 1.3 1.5 2.5 1.9 2.1 1.8 2.2 Iris Iris Iris Iris Iris Iris virginica Iris virginica Iris virginica Iris virginica Iris virginica 34 Iris Rules Learned • If petal-length <2.45 then Iris-setosa • If sepal-width <2.10 then Iris-versicolor • If sepal-width < 2.45 and petal-length <4.55 then Irisversicolor • ... 35 CPU Performance Data cycle time (ns) 1 2 3 4 5 … 207 208 209 main memory (Kb) min max cache (Kb) channels min MYCT 125 29 29 29 29 MMIN 256 8000 8000 8000 8000 MMAX 6000 32000 32000 32000 16000 CACH 256 32 32 32 32 125 480 480 2000 512 1000 8000 8000 4000 0 32 0 performance max CHMIN CHMAX 16 128 8 32 8 32 8 32 8 16 2 0 0 14 0 0 PRP 198 269 220 172 132 52 67 45 36 CPU Performance • Numerical Prediction: outcome as linear sum of weighted attributes • Regression equation: • PRP=-55.9+.049MYCT+.+1.48CHMAX • Regression can discover linear relationships, not non-linear ones 37 Linear Regression Regression Line Debt Income A simple linear regression for the loan data set 38 Labor Negotiations Data attribute duration wage increase first year wage increase second year wage increase third year cost of living adjustment working hours per week pension standby pay shift-work supplement education allowance statutory holidays vacation long-term disablity dental plan contribution bereavement assistance health plan contribution acceptablity of contract type (number of years) persentage persentage persentage {none, tcf, tc} (number of hours {none, ret-allw, persentage persentage {yes, no} (number of days) {below-avg, avg, {yes, no} {none, half, full} {yes, no} {none, half, full} {good, bad} 1 1 2% ? ? none 28 none ? ? yes 11 avg no none no none bad 2 2 4% 5% ? tcf 35 ? 13% 5% ? 15 gen ? ? ? ? good 3 3 4.3% 4.4% ? ? 38 ? ? 4% ? 12 gen ? full ? full good … 40 2 4.5 4.0 ? none 40 ? ? 4 ? 12 avg yes full yes half good 39 Decision Trees for ... Wage increase first year 2.5 > 2.5 Bad Statutory holidays 10 > 10 Good Wage increase first year 4 Bad <4 Good 40 … Labor Negotiations Data Wage increase first year 2.5 > 2.5 Working hours per week Statutory holidays 36 > 36 > 10 Health plan contribution Bad none Bad half Good 10 Wage increase first year Good full Bad 4 <4 Bad Good 41 Soy Bean Data Environment Attribute time of occurrence precipitation temperature Number of Values 7 3 3 Sample Value July above normal normal Seed condition mold growth discoloration 2 2 2 normal absent absent Fruit condition of fruit pods 4 normal Leaves condition yellow leaf spot halo leaf spot margins 2 3 3 abnormal absent no data Stem condition stem lodging stem cankers 2 2 4 abnormal yes above the soil line Roots condition 3 normal Diagnosis 19 diaporthe stem canker 42 Two Example Rules If [leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot If [leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot 43 Classification Debt No loan Loan Income A simple linear classification boundary for the loan data set; shaded region denotes class “no loan” 44 Clustering Debt Cluster 1 Cluster 2 Cluster 3 Income A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s 45 Non-Linear Classification Debt No Loan Loan Income An example of classification boundaries learned by a non-linear classifier (such as a neural network) for the loan data set 46 Nearest Neighbor Classifier Debt No Loan Loan Income Classification boundaries for a nearest neighbor classifier for the loan data set 47