Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of Helsinki/Dept. of CS Pirjo Moen Concept description Descriptive vs. predictive data mining Clustering Descriptive mining: 28.2. ∗ describe concepts or task-relevant data sets in Introduction 17.1. University of Helsinki/Dept. of CS Pirjo Moen Association rules 14.3. 20.1. KDD Process Concept description 31.1. Conclusions 4.4. Data mining methods – Spring 2005 informative, or discriminative Predictive mining: ∗ based on data and analysis, construct models for the database, and predict the trend and properties of unknown data Page 1 University of Helsinki/Dept. of CS Pirjo Moen summarative, form Exam 12.4. Classification 14.2. concise, Concept description – Contents Data mining methods – Spring 2005 Page 3 University of Helsinki/Dept. of CS Pirjo Moen What is concept description? Characterization: What is concept description? ∗ provide a concise summarization of the given collection of data Data generalization Analytical characterization Mining class comparisons Comparison: ∗ provide descriptions comparing two or more collections of data Mining descriptive statistical measures Summary Data mining methods – Spring 2005 Page 2 Data mining methods – Spring 2005 Page 4 University of Helsinki/Dept. of CS Pirjo Moen University of Helsinki/Dept. of CS Pirjo Moen Concept description vs. OLAP Data cube approach Computations and results in data cubes Concept description: Strengths: ∗ can handle complex data types of the attributes and their aggregations ∗ an efficient implementation of data generalization ∗ a more automated process ∗ computation of various kinds of measures, e.g., count( ) or sum( ) ∗ roll-up and drill-down OLAP: Limitations: ∗ restricted to a small number of dimension and measure types ∗ only dimensions of simple non-numeric data and measures of simple aggregated numeric values ∗ user-controlled process ∗ lack of intelligent analysis Data mining methods – Spring 2005 Page 5 University of Helsinki/Dept. of CS Pirjo Moen Data generalization Data mining methods – Spring 2005 University of Helsinki/Dept. of CS Pirjo Moen Attribute-oriented induction Data focusing: Data generalization: ∗ collect the task-relevant data (initial relation) ∗ summarization-based characterization ∗ a process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones Data generalization: 1 Conceptual levels ∗ attribute removal or 2 3 ∗ attribute generalization 4 Data aggregation 5 ∗ approaches: ∗ by merging identical, generalized tuples and accumulating their respective counts data cube approach (OLAP approach) attribute-oriented induction (AOI) Data mining methods – Spring 2005 Page 7 Presentation of the generalized relation Page 6 Data mining methods – Spring 2005 Page 8 University of Helsinki/Dept. of CS Pirjo Moen Data generalization in AOI Example of class characterization DMQL: Describe general characteristics of graduate students in the Big-University database Two methods for generalization: ∗ attribute-removal: use B ig_U niv ersity _ D B no generalization operator on A m ine charact erist ics as “ Scien ce_Stu dents” A’s higher level concepts are expressed in terms of other attributes in relev ance t o nam e, gender, major, birth_place, birt h_dat e, residence, phone# , gpa ∗ attribute-generalization University of Helsinki/Dept. of CS Pirjo Moen f rom student a generalization operator exists w here st atus in “ graduate” Attribute generalization control: Corresponding SQL statement: ∗ attribute threshold control select nam e, gen der, m ajor, birth_place, birth_date, residence, phone# , gpa ∗ generalized relation threshold control from st udent w here stat us in {“ M sc” , “ M B A ” , “ PhD ” } Data mining methods – Spring 2005 Page 9 University of Helsinki/Dept. of CS Pirjo Moen Presentation of generalized results Data mining methods – Spring 2005 Page 11 University of Helsinki/Dept. of CS Pirjo Moen Example of class characterization (2) Initial relation Generalized relation: ∗ a relation where some or all attributes are generalized, with counts or other aggregation values accumulated Cross tabulation: ∗ mapping results into cross tabulation form Name Gender Jim Woodman M Scott Lachance M Laura Lee F ... .. Major CS CS Physics ... Generalized relation Visualization techniques: Birth_place Vancouver, BC, Canada Montreal, Que, Canada Seattle, WA, USA ... Birth_date 08.12.76 28.07.75 25.08.70 ... Residence 3511 Main St., Richmond 345 1st Ave., Richmond 125 Austin Ave., Burnaby ... Gender Major Birth_region Age_range M Science Canada 20-25 F Science Foreign 25-30 ... ... ... ... Phone# 687-4598 253-9106 420-5232 ... Residence Gpa Richmond Very-good Burnaby Excellent ... ... Gpa 3,67 3,7 3,83 ... Count 16 22 ... ∗ pie charts, bar charts, curves, cubes, and other visual forms Cross tabulation Data mining methods – Spring 2005 t:w .... X t:w Canada 16 10 26 Foreign 14 22 36 Total 30 32 62 m 1 1 Gender / Birth_region M F Total condition X condition X, target_class X Quantitative characteristic rules: m Page 10 Data mining methods – Spring 2005 Page 12 University of Helsinki/Dept. of CS Pirjo Moen University of Helsinki/Dept. of CS Pirjo Moen Example of class characterization (3) Analytical characterization (2) Bar chart Cross tabulation G / B_p Canada Foreign M 16 14 F 10 22 Total 26 36 Gender and birth place Total 30 32 62 What is attribute relevance analysis? 65 60 ∗ Statistical method for preprocessing data 55 50 45 40 Canada Foreign Total 35 30 25 15 ∗ Analytical characterization, analytical comparison 10 5 0 F Total M retain or rank the relevant attributes ∗ Relevance related to dimensions and levels 20 Quantitative characteristic rule filter out irrelevant or weakly relevant attributes birth region x grad x male x birthregion x Canada t :53 % Foreign t : 47 % Data mining methods – Spring 2005 Page 13 University of Helsinki/Dept. of CS Pirjo Moen Analytical characterization Data mining methods – Spring 2005 Page 15 University of Helsinki/Dept. of CS Pirjo Moen Analytical characterization (3) How attribute relevance analysis is done? Preprocessing for class characterization or comparison ∗ Data collection ∗ Analytical generalization ∗ Need for attribute relevance analysis Why attribute relevance analysis? ∗ Which dimensions should be included? ∗ Relevance analysis ∗ How high level of generalization? ∗ Automatic vs. interactive sort and select the most relevant dimensions and levels ∗ Attribute-oriented induction for class description ∗ Reduce number of attributes Data mining methods – Spring 2005 use, for example, information gain analysis to identify highly relevant dimensions and levels Page 14 Data mining methods – Spring 2005 on selected dimension/level Page 16 University of Helsinki/Dept. of CS Pirjo Moen Example of analytical characterization Relevance measures Quantitative relevance measure determines the classifying power of an attribute within a set of data. ∗ information gain (ID3) ∗ concept hierarchies of attributes ∗ gini index ∗ attribute analytical threshold for each attribute contingency table statistics ∗ attribute generalization threshold for each attribute ∗ uncertainty coefficient ∗ attribute relevance threshold Data mining methods – Spring 2005 Page 17 University of Helsinki/Dept. of CS Pirjo Moen Entropy and information gain i m 2 m i 1 s log i 2 s I s ,... , s 1j mj 1 2 m ∗ attribute generalization generalize major, birth_ place, birt h_date and gpa accumulate counts I s ,s , . . . ,s remove nam e and phone# mj s 1 j 1 ∗ contrasting class: undergraduate students ∗ attribute removal Information gained by partitioning on the attribute A Gain A ∗ target class: graduate students ... s s E A j University of Helsinki/Dept. of CS Pirjo Moen 2. Analytical generalization using attribute analytical thresholds Entropy of the attribute A with values {a1,a2,…,av} v Page 19 1. Data collection Information measures expected information required to classify any arbitrary tuple s s 1 Data mining methods – Spring 2005 Example of analytical characterization (2) A data set S contains si tuples of class Ci for i = {1, …, m} I s ,s , . . . ,s Given ∗ attributes: nam e, gender, m ajor, birt h_place, birth_date, p hone# , and gpa ∗ gain ratio (C4.5) 2 Task ∗ mine general characteristics describing graduate students using analytical characterization Methods: ∗ University of Helsinki/Dept. of CS Pirjo Moen E A ∗ candidate relation: gender, major, birth_count ry , age_ran ge and gpa Data mining methods – Spring 2005 Page 18 Data mining methods – Spring 2005 Page 20 University of Helsinki/Dept. of CS Example of analytical characterization (3) 120) 11 Major Science Business Business Science Engineering Engineering Birth_country Foreign Canada Canada Canada Foreign Canada Age_range < 20 < 20 < 20 20-25 20-25 < 20 Gpa Very good Fair Fair Fair Very good Excellent 13 42 I s ,s 250 22 23 0. 7873 1 I s ,s 2 E major 0. 2115 information gain for all the attributes: Gain(gender) = 0.0003 Gain(gpa) = 0.4490 Gain(birth_country) = 0.0407 Gain(age_range) = 0.5971 Gain(major) = 0.2115 Page 21 University of Helsinki/Dept. of CS Pirjo Moen 12 Gain major Count 18 20 22 24 22 24 Data mining methods – Spring 2005 82 I s ,s 250 information gain for attribute major: 130) Gender M F M F M F 21 ∗ calculate information gain for each attribute Undergraduate students ( 126 I s ,s 250 E major Count 16 22 18 25 21 18 Gpa Very good Excellent Excellent Excellent Excellent Excellent Gender Major Birth_country Age_range M Science Canada 20-25 F Science Foreign 25-30 M Engineering Foreign 25-30 F Science Foreign 25-30 M Science Canada 20-25 F Engineering Canada 20-25 Example of analytical characterization (5) ∗ calculate expected information required to classify a given sample, if the data set is partitioned according to the attribute (i.e., entropy) Graduate students ( University of Helsinki/Dept. of CS Pirjo Moen Pirjo Moen Data mining methods – Spring 2005 Page 23 University of Helsinki/Dept. of CS Pirjo Moen Example of analytical characterization (4) Example of analytical characterization (6) 3. Relevance analysis 4. Initial working relation (W0) derivation ∗ attribute relevance treshold 0.1 1 2 2 130 130 log 250 250 120 120 log 250 250 I 120 , 130 I s ,s ∗ calculate expected information required to classify an arbitrary tuple 2 ∗ remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country 0 . 9988 ∗ remove contrasting class candidate relation ∗ calculate entropy of each attribute; start by calculating expected information for each value of the attribute For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183 For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892 For major=”Business”: S13=0 S23=42 I(s13,s23)=0 Data mining methods – Spring 2005 Major Age_range Gpa Science 20-25 Very_good Science 25-30 Excellent Science 20-25 Excellent Engineering 20-25 Excellent Engineering 25-30 Excellent Count 16 47 21 18 18 W0: graduate students 5. Perform attribute-oriented induction on W0 using attribute generalization thresholds Page 22 Data mining methods – Spring 2005 Page 24 University of Helsinki/Dept. of CS Pirjo Moen Example of analytical comparison (2) Comparison: comparing two or more classes Method: Mining class comparisons University of Helsinki/Dept. of CS Pirjo Moen ∗ Partition the set of relevant data into the target class and the contrasting class(es) ∗ attributes: nam e, gender, m ajor, birt h_place, birt h_dat e, resid ence, phone# and gpa ∗ Generalize both classes to the same high level concepts ∗ concept hierarchies on all attributes ∗ Compare tuples with the same high level descriptions ∗ attribute analytical threshold for each attribute support: distribution within single class ∗ Present for every tuple its description and two measures: comparison: distribution between classes ∗ attribute generalization threshold for each attribute ∗ attribute relevance threshold ∗ Highlight the tuples with strong discriminant features Relevance analysis: find attributes (features) which best distinguish different classes Data mining methods – Spring 2005 Page 25 University of Helsinki/Dept. of CS Pirjo Moen Example of analytical comparison Given Data mining methods – Spring 2005 Page 27 University of Helsinki/Dept. of CS Pirjo Moen Example of analytical comparison (3) Task: Compare graduate and undergraduate students using discriminant rule DMQL query: 1. Data collection ∗ target and contrasting classes use B ig_U niv ersity _D B 2. Attribute relevance analysis mine com parison as “ grad_v s_undergrad_st udent s” in relev ance to nam e, gend er, m ajor, birth_place, birth_ date, residence, ∗ remove attributes name, gender, m ajor, phone# phone#, gpa for “ g raduate_st udent s” w here st atus in “ gradu ate” 3. Synchronous generalization v ersus “ undergraduat e_st udent s” ∗ controlled by user-specified dimension thresholds w here st atus in “ undergraduat e” analy ze count % ∗ prime target and contrasting class(es) relations from st udent Data mining methods – Spring 2005 Page 26 Data mining methods – Spring 2005 Page 28 University of Helsinki/Dept. of CS University of Helsinki/Dept. of CS Pirjo Moen Example of analytical comparison (4) Quantitative discriminant rules Generalized relation for graduate students Birth_country Age_range Gpa Count% Canada 20-25 Good 5.53% Canada 25-30 Good 2.32% Canada Over_30 Very_good 5.86% … … … … Foreign Over_30 Excellent 4.68% Cj = target class qa = a generalized tuple that covers some tuples of class Cj d_weight count q d_weight C a j m count q a C i 1 i Page 29 University of Helsinki/Dept. of CS quantitative discriminant rules: X, target_class X condition X d: d_weight Data mining methods – Spring 2005 Page 31 University of Helsinki/Dept. of CS Pirjo Moen Quantitative discriminant rules (2) Count distribution for graduate and undergraduate students Status Birth_country Age_range Gpa Graduate Canada 25-30 Good Undergraduate Canada 25-30 Good Count 90 210 Birth_country X Canada Age_range X 25-30 Gpa X Good d: 30% 5. Drill down, roll up and other OLAP operations on target and contrasting classes to adjust levels of abstractions of resulting description Data mining methods – Spring 2005 graduate X, Status X Quantitative discriminant rule ∗ contrasting measures to reflect comparison between target and contrasting classes, e.g. count% ∗ as generalized relations, crosstabs, bar charts, pie charts, or rules 4. Presentation Example of analytical comparison (5) Data mining methods – Spring 2005 Pirjo Moen ∗ range: [0, 1] Count% 5.53% 4.53% … 5.02% … 0.68% Birth_country Age_range Gpa Canada 15-20 Fair Canada 15-20 Good … … … Canada 25-30 Good … … … Foreign Over_30 Excellent Generalized relation for undergraduate students ∗ can also cover some tuples of contrasting class Pirjo Moen where 90/(90+210) = 30% Page 30 Data mining methods – Spring 2005 Page 32 University of Helsinki/Dept. of CS Median m ∗ numerical data: m 1 m ∗ other types of data: estimated by interpolation Mode t : w , d : w' .... X t : w , d : w' 1 1 1 m ∗ value that occurs most frequently in the data condition X condition otherwise, average of the middle two values d:w .... X d:w 1 condition X condition Quantitative description rule (necessary and sufficient) X, target_class X middle value, if odd number of values, Quantitative discriminant rule (sufficient) m 1 Mean or weighted arithmetic mean t:w .... X t:w condition X condition 1 X, target_class X Measuring the central tendency Quantitative characteristic rule (necessary) Class descriptions X, target_class X University of Helsinki/Dept. of CS Pirjo Moen Pirjo Moen m m ∗ unimodal, bimodal, trimodal; multimodal Data mining methods – Spring 2005 Page 33 University of Helsinki/Dept. of CS Pirjo Moen Motivation Measuring the dispersion of data ∗ to better understand the data: central tendency, variation and spread ∗ inter-quartile range: IQR = Q3 – Q1 Central tendency measures: ∗ five number summary: min, Q1, M, Q3, max ∗ boxplot: ends of the box are the quartiles, median is marked, whiskers; outliers plotted individually Data dispersion measures: ∗ quartiles, outliers, variance, etc. Quartiles, outliers and boxplots ∗ quartiles: Q1 (25th percentile), Q3 (75th percentile) ∗ mean, median, max, min etc. Data mining methods – Spring 2005 Page 35 University of Helsinki/Dept. of CS Pirjo Moen Mining descriptive statistical measures Data mining methods – Spring 2005 ∗ outlier: usually, a value higher/lower than 1.5 x IQR Graphical presentation of statistical class descriptions Page 34 Data mining methods – Spring 2005 Page 36 University of Helsinki/Dept. of CS Pirjo Moen Example of a boxplot i 2 x 2 i i 1 1 n 2 1 n 1 x x n 1 n 1 2 s Variance Measuring the dispersion of data (2) University of Helsinki/Dept. of CS Pirjo Moen x i Standard deviation: the square root of the variance ∗ measures spread about the mean ∗ it is zero if and only if all the values are equal ∗ both the deviation and the variance are algebraic databases scalable in large Data mining methods – Spring 2005 Page 37 University of Helsinki/Dept. of CS Pirjo Moen Boxplot analysis Data mining methods – Spring 2005 Page 39 University of Helsinki/Dept. of CS Pirjo Moen Visualization of data dispersion Five-number summary of a distribution: ∗ minimum, Q1, M, Q3, maximum Boxplot ∗ data is represented with a box ∗ the ends of the box are at the first and third quartiles, i.e., the height of the box is IQR ∗ the median is marked by a line within the box ∗ whiskers: two lines outside the box extend to minimum and maximum Data mining methods – Spring 2005 Page 38 Data mining methods – Spring 2005 Page 40 University of Helsinki/Dept. of CS Pirjo Moen University of Helsinki/Dept. of CS Pirjo Moen Quantile-quantile plot Scatter plot Loess curve Data mining methods – Spring 2005 Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information ∗ For a data xi sorted in increasing order, a precentage fi indicates that approximately 100 fi% of the data are below or equal to the value xi Page 41 University of Helsinki/Dept. of CS Pirjo Moen Quantile plot Quantile plot Histograms Presentation of class descriptions Data mining methods – Spring 2005 Page 42 Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data Graphs the quantiles of one univariate distribution against the corresponding quantiles of another Quantile-quantile plot A univariate graphical method Data mining methods – Spring 2005 University of Helsinki/Dept. of CS Pirjo Moen Histograms Page 43 Allows the user to view whether there is a shift in going from one distribution to another Data mining methods – Spring 2005 Page 44 University of Helsinki/Dept. of CS Pirjo Moen University of Helsinki/Dept. of CS Pirjo Moen Provides a first look at bivariate data to see clusters of points, outliers, etc. Typical machine learning methods for concept description follow a learning-from-examples paradigm. Data mining vs. machine learning Scatter plot Each pair of values is treated as a pair of coordinates and plotted as points in the plane. Difference in philosophies and basic assumptions: ∗ in learning-from-examples: positive used for generalization, negative for specialization The size of the set of training examples ∗ in data mining: generalization-based; specialization implemented by backtracking the generalization to a previous state Difference in methods of generalizations ∗ machine learning generalizes on a tuple by tuple basis ∗ data mining generalizes on an attribute by attribute basis Data mining methods – Spring 2005 Page 45 University of Helsinki/Dept. of CS Pirjo Moen Page 47 University of Helsinki/Dept. of CS Pirjo Moen Incremental mining of concept description Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence. Loess (local regression) curve is fitted by setting two parameters: Incremental mining: revision based on newly added data DB ∗ Generalize DB to the same level of abstraction as in the generalized relation R to derive R Loess curve Data mining methods – Spring 2005 ∗ a smoothing parameter, and ∗ the degree of the polynomials that are fitted by the regression. R, i.e., merge counts and other ∗ Union R statistical information to produce a new relation R’ Data mining methods – Spring 2005 Page 46 Data mining methods – Spring 2005 Similar philosophy can be applied to data sampling, parallel and/or distributed mining, etc. Page 48 University of Helsinki/Dept. of CS Pirjo Moen University of Helsinki/Dept. of CS Pirjo Moen References (2) Summary Concept description: characterization and discrimination Analytical characterization and comparison OLAP-based and attribute-oriented induction Mining descriptive statistical measures in large database E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB'98, New York, NY, Aug. 1998. H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. R. S. Michalski. A theory and methodology of inductive learning. In Michalski et al., editor, Machine Learning: An Artificial Intelligence Approach, Vol. 1, Morgan Kaufmann, 1983. T. M. Mitchell. Version spaces: A candidate elimination approach to rule learning. IJCAI'97, Cambridge, MA. T. M. Mitchell. Generalization as search. Artificial Intelligence, 18:203-226, 1982. T. M. Mitchell. Machine Learning. McGraw Hill, 1997. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. D. Subramanian and J. Feigenbaum. Factorization in experiment generation. AAAI'86, Philadelphia, PA, Aug. 1986. Presentation of descriptions Data mining methods – Spring 2005 Page 49 University of Helsinki/Dept. of CS Pirjo Moen Data mining methods – Spring 2005 University of Helsinki/Dept. of CS Pirjo Moen References Thanks Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational databases. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 213-228. AAAI/MIT Press, 1991. C. Carter and H. Hamilton. Efficient attribute-oriented generalization for knowledge discovery from large databases. IEEE Trans. Knowledge and Data Engineering, 10:193-208, 1998. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997 W. Cleveland. Visualizing Data. Hobart Press, Summit NJ, 1993. J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed. Duxbury Press, 1995. T. G. Dietterich and R. S. Michalski. A comparative review of selected methods for learning from examples. In Michalski et al., editor, Machine Learning: An Artificial Intelligence Approach, Vol. 1, pages 41-82. Morgan Kaufmann, 1983. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997. J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in relational databases. IEEE Trans. Knowledge and Data Engineering, 5:29-40, 1993. J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data mining. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 399-421. AAAI/MIT Press, 1996. R. A. Johnson and D. A. Wichern. Applied Multivariate Statistical Analysis, 3rd ed. Prentice Hall, 1992. Data mining methods – Spring 2005 Page 51 Page 50 Thank you for Jiawei Han from Simon Fraser University for his slides which greatly helped in preparing this lecture! Data mining methods – Spring 2005 Page 52