Download Quantitative Discriminant Rules

Concept Description and Data Generalization (baseado nos slides do livro: Data Mining: C & T) Two categories of data mining Descriptive mining: describes concepts or taskrelevant data sets in concise, summarative, informative, discriminative forms Predictive mining: Based on data and analysis, constructs models for the database, and predicts the trend and properties of unknown data 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) What is Concept Description? Concept description (or class description): generates descriptions for characterization and comparison of data  Characterization: provides a concise and succinct summarization of the given collection of data  Class comparison (or discrimination): provides descriptions comparing two or more collections of data 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Data Generalization  A process which abstracts a large set of taskrelevant data in a database from a low conceptual levels to higher ones. 1 2 3 4 5 Conceptual levels Approaches: • Data cube approach(OLAP approach) • Attribute-oriented induction approach 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Concept Description vs OLAP Similarities:  Data generalization  Presentation of data summarization at multiple levels of abstraction.  Interactive drilling, pivoting, slicing and dicing. Differences:  Complex data types of the attributes and their aggregations  Automated process to find relevant attributes and generalization degree  Dimension relevance analysis and ranking when there are many relevant dimensions. 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Attribute-Oriented Induction  Proposed in 1989 (KDD ‘89 workshop)  Not confined to categorical data nor particular measures.  How it is done?  Collect the task-relevant data (initial relation) using a relational database query  Perform data generalization by attribute removal or attribute generalization, based on the nb. of distinct values of each attribute.  Apply aggregation by merging identical, generalized tuples and accumulating their respective counts  Interactive presentation with users 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Principles (1) Data focusing: task-relevant data, including dimensions, and the result is the initial (working) relation. Attribute-removal: remove attribute A if there is a large set of distinct values for A but: (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes. Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize Sistemas de Apoio à Decisão (LEIC Tagus) 2003/04 A. Basic Principles (2)  Two methods to control a generalization process: Attribute-threshold control: typical 2-8, specified/default  if the number of distinct values in an attribute is greater than the att. threshold, then removal or generalization applies Generalized relation threshold control: sets a threshold for the generalized (final) relation/rule size  If the number of distinct tuples in the generalized relation is greater than the threshold, then further generalization applies 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Principles (3) Acummulate count or other aggregate values : to provide statistical information about the data at diff. levels of abstraction Ex: Count value for a tuple in the initial relation is 1, When generalizing data, n tuples in the initial relation result in groups of identical tuples merged into a single generalized tuple (count is n) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Basic Algorithm 1. InitialRel: Query processing of task-relevant data, deriving the initial relation. 2. PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? 3. PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. 4. Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: Example (1)  Describe general characteristics of graduate students in the Big-University database (in DMQL) use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate”  Corresponding SQL statement: select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“Msc”, “MBA”, “PhD” } 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: An Example (2) Initial Relation Name Gender Jim Woodman Scott Lachance Laura Lee … F … Removed Retained 2003/04 M M Major Birth-Place Birth_date Residence Phone # GPA Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … … 3511 Main St., Richmond 345 1st Ave., Richmond 687-4598 3.67 253-9106 3.70 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Sci,Eng, Bus City Removed Excl, VG,.. CS Country Age range Sistemas de Apoio à Decisão (LEIC Tagus) Class Characterization: An Example (3) Prime Generalized Relation Gender M F … Cross-tab 2003/04 Major Birth_region Age_range Residence Science Canada 20-25 Richmond Science Foreign 25-30 Burnaby … … … … GPA Count Very-good 16 Excellent 22 … … Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation of Generalized Results (1) Generalized relation:  Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation:  Mapping results into cross tabulation form (similar to contingency tables).  Visualization techniques: Pie charts, bar charts, curves, cubes, and other visual forms. 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation— Generalized Relation 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation—Crosstab 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Presentation of Generalized Results (2)  A generalized relation may also be represented in the form of logic rules Cj = target class qa = a generalized tuple describing the target class t-weight for qa: percentage of tuples of the target class from the initial working class that are covered by qa range: [0, 1] t  weight  count(q a ) n  count(q i 1 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) i ) Presentation of Generalized Results (3) Quantitative characteristic rules: Mapping generalized result into characteristic rules with quantitative information associated with it grad( x)  male( x)  birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].  The disjunction of the conditions forms a necessary condition of the target class, i.e., all tuples of the target class must satisfy the condition  Not a sufficient condition of the target class, since a tuple satisfying the same condition could belong to another class 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Attribute Relevance Analysis (1)  Why?      Which dimensions should be included? How high level of generalization? Automatic vs. interactive Reduce # attributes; easy to understand patterns What?  statistical method for preprocessing data     2003/04 filter out irrelevant or weakly relevant attributes retain or rank the relevant attributes relevance related to dimensions and levels analytical characterization, analytical comparison Sistemas de Apoio à Decisão (LEIC Tagus) Attribute relevance analysis (2)  How? 1. 2. 3. Data Collection Preliminary relevance analysis using conservative AOI Analytical Generalization   4. Attribute-oriented Induction for class description  2003/04 Use information gain analysis (e.g., entropy or other measures) to identify highly relevant dimensions and levels. Sort and select the most relevant dimensions and levels. Using a less conservative threshold for AOI Sistemas de Apoio à Decisão (LEIC Tagus) Relevance Measures Quantitative relevance measure: determines the classifying power of an attribute within a set of data. Methods: information gain (ID3)  gain ratio (C4.5)  gini index  2 contingency table statistics  uncertainty coefficient  2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Entropy and Information Gain S contains si tuples of class Ci for i = {1, …, m} Entropy or expected information measures info required to classify any arbitrary tuple m E ( S )  I( s1,s2,...,sm)   i 1 si si log 2 s s Entropy of attribute A with values {a1,a2,…,av} s1 j  ...  smj I (s1 j,..., smj) s j 1 v E(A)   Information gained by branching on attribute A Gain(A)  I(s 1, s 2 ,..., sm)  E(A) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example of Analytical Characterization (1) Task  Mine general characteristics describing graduate students using analytical characterization Given      2003/04 attributes name, gender, major, birth_place, birth_date, phone#, and gpa Gen(ai) = concept hierarchies on ai Ui = attribute analytical thresholds for ai Ti = attribute generalization thresholds for ai R = attribute relevance threshold Sistemas de Apoio à Decisão (LEIC Tagus) Example of Analytical Characterization (2) 1. Data collection   2. target class: graduate student contrasting class: undergraduate student Analytical generalization using Ui  attribute removal   attribute generalization    2003/04 remove name and phone# generalize major, birth_place, birth_date and gpa accumulate counts candidate relation: gender, major, birth_country, age_range and gpa Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical characterization (3) gender major birth_country age_range gpa count M F M F M F Science Science Engineering Science Science Engineering Canada Foreign Foreign Foreign Canada Canada 20-25 25-30 25-30 25-30 20-25 20-25 16 22 18 25 21 18 Very_good Excellent Excellent Excellent Excellent Excellent Candidate relation for Target class: Graduate students (=120) gender major birth_country age_range gpa count M F M F M F Science Business Business Science Engineering Engineering Foreign Canada Canada Canada Foreign Canada <20 <20 <20 20-25 20-25 <20 18 20 22 24 22 24 Very_good Fair Fair Fair Very_good Excellent Candidate relation for Contrasting class: Undergraduate students (=130) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical characterization (4)  3. Relevance analysis  Calculate expected info required to classify an arbitrary tuple I(s 1, s 2 )  I( 120,130 )    120 120 130 130 log 2  log 2  0.9988 250 250 250 250 Calculate entropy of each attribute: e.g. major For major=”Science”: S11=84 For major=”Engineering”:S12=36 For major=”Business”: S13=0 Number of grad students in “Science” 2003/04 S21=42 S22=46 S23=42 I(s11,s21)=0.9183 I(s12,s22)=0.9892 I(s13,s23)=0 Number of undergrad students in “Science” Sistemas de Apoio à Decisão (LEIC Tagus) Example: Analytical Characterization (5)  Calculate expected info required to classify a given sample if S is partitioned according to the attribute E(major)  126 82 42 I ( s11, s 21 )  I ( s12, s 22 )  I ( s13, s 23 )  0.7873 250 250 250 Gain(major)  I(s 1, s 2 )  E(major)  0.2115  Calculate information gain for each attribute  2003/04 Information gain for all attributes Sistemas de Apoio à Decisão (LEIC Tagus) Gain(gender) = 0.0003 Gain(birth_country) = 0.0407 Gain(major) Gain(gpa) = 0.2115 = 0.4490 Gain(age_range) = 0.5971 Example: Analytical characterization (5) 4. Initial working relation (W0) derivation    R = 0.1 remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country remove contrasting class candidate relation major Science Science Science Engineering Engineering age_range 20-25 25-30 20-25 20-25 25-30 gpa Very_good Excellent Excellent Excellent Excellent count 16 47 21 18 18 Initial target class working relation W0: Graduate students 5. 2003/04 Perform attribute-oriented induction on W0 using Ti Sistemas de Apoio à Decisão (LEIC Tagus) Mining Class Comparisons  Comparison: Comparing two or more classes  Method:  Partition the set of relevant data into the target class and the contrasting class(es)  Generalize both classes to the same high level concepts  Compare tuples with the same high level descriptions  Present for every tuple its description and two measures    support - distribution within single class  comparison - distribution between classes Highlight the tuples with strong discriminant features Relevance Analysis:  2003/04 Find attributes (features) which best distinguish different classes Sistemas de Apoio à Decisão (LEIC Tagus) Quantitative Discriminant Rules   Cj = target class qa = a generalized tuple covers some tuples of target class   but can also cover some tuples of contrasting class d-weight  range: [0, 1] d  weight  count(qa  Cj ) m  count(q a  Ci ) i 1  quantitative discriminant rule form  X, target_class(X)  condition(X) [d : d_weight] 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Example (1)  Compare the general properties between the graduate students and the undergraduate students at the BigUniversity database, given the attributes: name, gender, etc (in DMQL) use Big_University_DB mine comparison as “Grad-vs-Undergrad” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from “graduate_students” where status in “graduate” versus “undergraduate_students” where status in “undergraduate” analyze count% from student Sistemas de Apoio à Decisão 2003/04 (LEIC Tagus) Example (2) Count distribution between graduate and undergraduate students for a generalized tuple  Status Birth_country Age_range Gpa Count Graduate Canada 25-30 Good 90 Undergraduate Canada 25-30 Good 210 Quantitative discriminant rule X , graduate _ student ( X )  birth _ country( X ) " Canada" age _ range( X ) "25  30" gpa( X ) " good " [d : 30%]  2003/04 where 90/(90+210) = 30% Sistemas de Apoio à Decisão (LEIC Tagus) Class Description  Quantitative characteristic rule  X, target_class(X)  condition(X) [t : t_weight]   necessary Quantitative discriminant rule  X, target_class(X)  condition(X) [d : d_weight]   sufficient Quantitative description rule  X, target_class(X)  condition 1(X) [t : w1, d : w 1]  ...  conditionn(X) [t : wn, d : w n]  2003/04 necessary and sufficient Sistemas de Apoio à Decisão (LEIC Tagus) Example: Quantitative Description Rule Loc./item TV Computer Both_items Count t-wt d-wt Count t-wt d-wt Count t-wt d-wt Europe 80 25% 40% 240 75% 30% 320 100% 32% N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68% Both_ regions 200 20% 80% 100% 1000 100% 100% 100% 800 Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and computers sold at AllElectronics in 1998  Quantitative description rule for target class Europe  X, Europe(X)  (item(X) " TV" ) [t : 25%, d : 40%]  (item(X) " computer" ) [t : 75%, d : 30%] 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Bibliografia   (Livro) Data Mining: Concepts and Techniques, J. Han & M. Kamber, Morgan Kaufmann, 2001 (Capítulo 5 – livro 2001, Secção 3.7 – draft) (Livro) Machine Learning, T. Mitchell, McGraw-Hill, 1997 (Secção 3.4) 2003/04 Sistemas de Apoio à Decisão (LEIC Tagus) Information-Theoretic Approach  Decision tree     each internal node tests an attribute each branch corresponds to an attribute value each leaf node assigns a classification ID3 algorithm    build decision tree based on training objects with known class labels to classify testing objects rank attributes with information gain measure minimal height  2003/04 the least number of tests to classify an object Sistemas de Apoio à Decisão (LEIC Tagus) Top-Down Induction of Decision Tree Attributes = {Outlook, Temperature, Humidity, Wind} PlayTennis = {yes, no} Outlook sunny Humidity high no 2003/04 rain overcast Wind yes normal yes strong no Sistemas de Apoio à Decisão (LEIC Tagus) weak yes

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Quantitative Discriminant Rules