Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T) Two categories of data mining Descriptive mining: describes concepts or taskrelevant data sets in concise, summarative, informative, discriminative forms Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) What is Concept Description? Concept description (or class description): generates descriptions for characterization and comparison of data Characterization: provides a concise and succinct summarization of the given collection of data Class comparison (or discrimination): provides descriptions comparing two or more collections of data 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Data Generalization A process which abstracts a large set of taskrelevant data in a database from a low conceptual levels to higher ones. 1 2 3 4 5 Conceptual levels Approaches: • Data cube approach(OLAP approach) • Attribute-oriented induction approach 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Concept Description vs OLAP Similarities: Data generalization Presentation of data summarization at multiple levels of abstraction. Interactive drilling, pivoting, slicing and dicing. Differences: Complex data types of the attributes and their aggregations Automated process to find relevant attributes and generalization degree Dimension relevance analysis and ranking when there are many relevant dimensions. 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Attribute-Oriented Induction Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures. How it is done? Collect the task-relevant data (initial relation) using a relational database query Perform data generalization by attribute removal or attribute generalization, based on the nb. of distinct values of each attribute. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interactive presentation with users 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Principles (1) Data focusing: task-relevant data, including dimensions, and the result is the initial (working) relation. Attribute-removal: remove attribute A if there is a large set of distinct values for A but: (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes. Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize Sistemas de Apoio à Decisão (LEIC Tagus) 2003/04 A. Basic Principles (2) Two methods to control a generalization process: Attribute-threshold control: typical 2-8, specified/default if the number of distinct values in an attribute is greater than the att. threshold, then removal or generalization applies Generalized relation threshold control: sets a threshold for the generalized (final) relation/rule size If the number of distinct tuples in the generalized relation is greater than the threshold, then further generalization applies 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Principles (3) Acummulate count or other aggregate values : to provide statistical information about the data at diff. levels of abstraction Ex: Count value for a tuple in the initial relation is 1, When generalizing data, n tuples in the initial relation result in groups of identical tuples merged into a single generalized tuple (count is n) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Algorithm 1. InitialRel: Query processing of task-relevant data, deriving the initial relation. 2. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? 3. PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. 4. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: Example (1) Describe general characteristics of graduate students in the Big-University database (in DMQL) use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate” Corresponding SQL statement: select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“Msc”, “MBA”, “PhD” } 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: An Example (2) Initial Relation Name Gender Jim Woodman Scott Lachance Laura Lee … F … Removed Retained 2003/04 M M Major Birth-Place Birth_date Residence Phone # GPA Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … … 3511 Main St., Richmond 345 1st Ave., Richmond 687-4598 3.67 253-9106 3.70 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Sci,Eng, Bus City Removed Excl, VG,.. CS Country Age range Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: An Example (3) Prime Generalized Relation Gender M F … Cross-tab 2003/04 Major Birth_region Age_range Residence Science Canada 20-25 Richmond Science Foreign 25-30 Burnaby … … … … GPA Count Very-good 16 Excellent 22 … … Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation of Generalized Results (1) Generalized relation: Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation: Mapping results into cross tabulation form (similar to contingency tables). Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation— Generalized Relation 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation—Crosstab 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation of Generalized Results (2) A generalized relation may also be represented in the form of logic rules Cj = target class qa = a generalized tuple describing the target class t-weight for qa: percentage of tuples of the target class from the initial working class that are covered by qa range: [0, 1] t weight count(q a ) n count(q i 1 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) i ) Presentation of Generalized Results (3) Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it grad( x) male( x) birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%]. The disjunction of the conditions forms a necessary condition of the target class, i.e., all tuples of the target class must satisfy the condition Not a sufficient condition of the target class, since a tuple satisfying the same condition could belong to another class 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Attribute Relevance Analysis (1) Why? Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns What? statistical method for preprocessing data 2003/04 filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes relevance related to dimensions and levels analytical characterization, analytical comparison Sistemas de Apoio à Decisão (LEIC Tagus) Attribute relevance analysis (2) How? 1. 2. 3. Data Collection Preliminary relevance analysis using conservative AOI Analytical Generalization 4. Attribute-oriented Induction for class description 2003/04 Use information gain analysis (e.g., entropy or other measures) to identify highly relevant dimensions and levels. Sort and select the most relevant dimensions and levels. Using a less conservative threshold for AOI Sistemas de Apoio à Decisão (LEIC Tagus) Relevance Measures Quantitative relevance measure: determines the classifying power of an attribute within a set of data. Methods: information gain (ID3) gain ratio (C4.5) gini index 2 contingency table statistics uncertainty coefficient 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Entropy and Information Gain S contains si tuples of class Ci for i = {1, …, m} Entropy or expected information measures info required to classify any arbitrary tuple m E ( S ) I( s1,s2,...,sm) i 1 si si log 2 s s Entropy of attribute A with values {a1,a2,…,av} s1 j ... smj I (s1 j,..., smj) s j 1 v E(A) Information gained by branching on attribute A Gain(A) I(s 1, s 2 ,..., sm) E(A) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example of Analytical Characterization (1) Task Mine general characteristics describing graduate students using analytical characterization Given 2003/04 attributes name, gender, major, birth_place, birth_date, phone#, and gpa Gen(ai) = concept hierarchies on ai Ui = attribute analytical thresholds for ai Ti = attribute generalization thresholds for ai R = attribute relevance threshold Sistemas de Apoio à Decisão (LEIC Tagus) Example of Analytical Characterization (2) 1. Data collection 2. target class: graduate student contrasting class: undergraduate student Analytical generalization using Ui attribute removal attribute generalization 2003/04 remove name and phone# generalize major, birth_place, birth_date and gpa accumulate counts candidate relation: gender, major, birth_country, age_range and gpa Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical characterization (3) gender major birth_country age_range gpa count M F M F M F Science Science Engineering Science Science Engineering Canada Foreign Foreign Foreign Canada Canada 20-25 25-30 25-30 25-30 20-25 20-25 16 22 18 25 21 18 Very_good Excellent Excellent Excellent Excellent Excellent Candidate relation for Target class: Graduate students (=120) gender major birth_country age_range gpa count M F M F M F Science Business Business Science Engineering Engineering Foreign Canada Canada Canada Foreign Canada <20 <20 <20 20-25 20-25 <20 18 20 22 24 22 24 Very_good Fair Fair Fair Very_good Excellent Candidate relation for Contrasting class: Undergraduate students (=130) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical characterization (4) 3. Relevance analysis Calculate expected info required to classify an arbitrary tuple I(s 1, s 2 ) I( 120,130 ) 120 120 130 130 log 2 log 2 0.9988 250 250 250 250 Calculate entropy of each attribute: e.g. major For major=”Science”: S11=84 For major=”Engineering”:S12=36 For major=”Business”: S13=0 Number of grad students in “Science” 2003/04 S21=42 S22=46 S23=42 I(s11,s21)=0.9183 I(s12,s22)=0.9892 I(s13,s23)=0 Number of undergrad students in “Science” Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical Characterization (5) Calculate expected info required to classify a given sample if S is partitioned according to the attribute E(major) 126 82 42 I ( s11, s 21 ) I ( s12, s 22 ) I ( s13, s 23 ) 0.7873 250 250 250 Gain(major) I(s 1, s 2 ) E(major) 0.2115 Calculate information gain for each attribute 2003/04 Information gain for all attributes Sistemas de Apoio à Decisão (LEIC Tagus) Gain(gender) = 0.0003 Gain(birth_country) = 0.0407 Gain(major) Gain(gpa) = 0.2115 = 0.4490 Gain(age_range) = 0.5971 Example: Analytical characterization (5) 4. Initial working relation (W0) derivation R = 0.1 remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country remove contrasting class candidate relation major Science Science Science Engineering Engineering age_range 20-25 25-30 20-25 20-25 25-30 gpa Very_good Excellent Excellent Excellent Excellent count 16 47 21 18 18 Initial target class working relation W0: Graduate students 5. 2003/04 Perform attribute-oriented induction on W0 using Ti Sistemas de Apoio à Decisão (LEIC Tagus) Mining Class Comparisons Comparison: Comparing two or more classes Method: Partition the set of relevant data into the target class and the contrasting class(es) Generalize both classes to the same high level concepts Compare tuples with the same high level descriptions Present for every tuple its description and two measures support - distribution within single class comparison - distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis: 2003/04 Find attributes (features) which best distinguish different classes Sistemas de Apoio à Decisão (LEIC Tagus) Quantitative Discriminant Rules Cj = target class qa = a generalized tuple covers some tuples of target class but can also cover some tuples of contrasting class d-weight range: [0, 1] d weight count(qa Cj ) m count(q a Ci ) i 1 quantitative discriminant rule form X, target_class(X) condition(X) [d : d_weight] 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example (1) Compare the general properties between the graduate students and the undergraduate students at the BigUniversity database, given the attributes: name, gender, etc (in DMQL) use Big_University_DB mine comparison as “Grad-vs-Undergrad” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from “graduate_students” where status in “graduate” versus “undergraduate_students” where status in “undergraduate” analyze count% from student Sistemas de Apoio à Decisão 2003/04 (LEIC Tagus) Example (2) Count distribution between graduate and undergraduate students for a generalized tuple Status Birth_country Age_range Gpa Count Graduate Canada 25-30 Good 90 Undergraduate Canada 25-30 Good 210 Quantitative discriminant rule X , graduate _ student ( X ) birth _ country( X ) " Canada" age _ range( X ) "25 30" gpa( X ) " good " [d : 30%] 2003/04 where 90/(90+210) = 30% Sistemas de Apoio à Decisão (LEIC Tagus) Class Description Quantitative characteristic rule X, target_class(X) condition(X) [t : t_weight] necessary Quantitative discriminant rule X, target_class(X) condition(X) [d : d_weight] sufficient Quantitative description rule X, target_class(X) condition 1(X) [t : w1, d : w 1] ... conditionn(X) [t : wn, d : w n] 2003/04 necessary and sufficient Sistemas de Apoio à Decisão (LEIC Tagus) Example: Quantitative Description Rule Loc./item TV Computer Both_items Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt Europe 80 25% 40% 240 75% 30% 320 100% 32% N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68% Both_ regions 200 20% 80% 100% 1000 100% 100% 100% 800 Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and computers sold at AllElectronics in 1998 Quantitative description rule for target class Europe X, Europe(X) (item(X) " TV" ) [t : 25%, d : 40%] (item(X) " computer" ) [t : 75%, d : 30%] 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Bibliografia (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft) (Livro) Machine Learning, T. Mitchell, McGraw-Hill, 1997 (Secção 3.4) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Information-Theoretic Approach Decision tree each internal node tests an attribute each branch corresponds to an attribute value each leaf node assigns a classification ID3 algorithm build decision tree based on training objects with known class labels to classify testing objects rank attributes with information gain measure minimal height 2003/04 the least number of tests to classify an object Sistemas de Apoio à Decisão (LEIC Tagus) Top-Down Induction of Decision Tree Attributes = {Outlook, Temperature, Humidity, Wind} PlayTennis = {yes, no} Outlook sunny Humidity high no 2003/04 rain overcast Wind yes normal yes strong no Sistemas de Apoio à Decisão (LEIC Tagus) weak yes