Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Outline Motivation Data mining primitives Data mining query languages Designing GUI for data mining systems Architectures Data Mining Primitives CS 5331 by Rattikorn Hewett Texas Tech University 1 Motivations: Why primitives? Data mining primitives Data mining systems uncover a large set of patterns – not all are interesting Data mining should be an interactive process User directs what to be mined Users need data mining primitives to communicate with the data mining system by incorporating them in a data mining query language Benefits: 2 Data mining tasks can be specified in the form of data mining queries by five data mining primitives: Task-relevant data input The kinds of knowledge to be mined function & output Background knowledge interpretation Interestingness measures evaluation Visualization of the discovered patterns presentation More flexible user interaction Foundation for design of graphical user interface Standardization of data mining industry and practice 3 4 1 Task-relevant data Knowledge to be mined Specify data to be mined Database, data warehouse, relation, cube Condition for selection & grouping Relevant attributes Specify data mining “functions”: Characterization/discrimination Association Classification/prediction Clustering 5 6 Background Knowledge Interestingness Typically, in the form of concept hierarchies Objective measures: Simplicity: Schema hierarchy Set-grouping hierarchy Operation-derived hierarchy E.g., email address: [email protected] login-name < department < university < organization 7 Rule A => B has support, #(A and B)/ sample size noise threshold (description) Novelty E.g., 87 ≤ temperature < 90 normal_temperature Rule A => B has confidence, P(A|B) = #(A and B)/ #(B) classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility: potential usefulness Rule-based hierarchy (association) rule length, (decision) tree size Certainty: validity of the rule E.g., {low, high} all, {30..49} low, {50..100} high simpler rules are easier to understand and likely to be interesting E.g., street < city < state < country not previously known, surprising (used to remove redundant rules) 8 2 Visualization of Discovered Patterns DMQL(data mining query language) Specify the form to view the patterns E.g., rules, tables, chart, decision trees, cubes, reports etc. Specify operations for data exploration in multiple levels of abstraction E.g., drill-down, roll-up etc. A DMQL can provide the ability to support ad-hoc and interactive data mining By providing a standardized language Hope to achieve a similar effect like that SQL has on relational database Foundation for system development and evolution Facilitate information exchange, technology transfer, commercialization and wide acceptance DMQL is designed with the primitives described earlier 9 Languages & Standardization Efforts Association rule language specifications MSQL (Imielinski & Virmani’99) MineRule (Meo Psaila and Ceri’96) Query flocks based on Datalog syntax (Tsur et al’98) collection and data mining query composition Presentation of discovered patterns Hierarchy specification and manipulation Manipulation of data mining primitives Interactive multilevel mining Other information Based on OLE, OLE DB, OLE DB for OLAP Integrating DBMS, data warehouse and data mining CRISP-DM (CRoss-Industry Standard Process for Data Mining) What tasks should be considered in the design GUIs based on a data mining query language? Data OLEDB for DM (Microsoft’2000) Designing GUI based on DMQL 10 Providing a platform and process structure for effective data mining Emphasizing on deploying data mining technology to solve business problems 11 12 3 Architectures Coupling data mining system with DB/DW system No coupling - Flat file processing, not recommended Loose coupling - Fetching data from DB/DW Semi-tight coupling - Enhanced DM performance Provide efficient implementation of a few data mining primitives in a DB/DW system, e.g., sorting, indexing, aggregation, histogram analysis, multiway join, precomputation of some stat functions Tight coupling - A uniform information processing environment DM is smoothly integrated into a DB/DW system, mining query is optimized based on mining query, indexing, query processing methods, etc. Concept Description CS 5331 by Rattikorn Hewett Texas Tech University 13 14 Review terms Outline Review terms Characterization Descriptive vs. predictive data mining Descriptive: describes the data set in concise, summarative, informative, discriminative forms Predictive: constructs models representing the data set, and uses them to predict behaviors of unknown data Summarization Hierarchical generalization Attribute relevance analysis Concept description: involves Characterization: provides a concise and succinct summarization of the given collection of data Comparison (discrimination): provides descriptions comparing two or more collections of data Comparison/discrimination Descriptive statistical measures 15 16 4 Outline Concept Description vs. OLAP Concept description: can handle complex data types (e.g., text, image) of the attributes and their aggregations a more automated process Review terms Characterization Summarization Hierarchical generalization Attribute relevance analysis OLAP: restricted to a small number of dimension and measure data types user-controlled process Comparison/discrimination Descriptive statistical measures 17 18 Characterization methods Summarization by OLAP One approach for characterization is to transform data from low conceptual levels to high ones “data generalization” E.g., daily sales annual sales Biology Science Two Methods: Summarization – as in Data Cube’s OLAP Hierarchical generalization – Attribute-oriented induction Data generalization? 19 Data are stored in data cubes Identify summarization computations e.g., count( ), sum( ), average( ), max( ) Perform computations and store results in data cubes Generalization and specialization can be performed on a data cube by roll-up and drill-down An efficient implementation of data generalization Limitations: Can handle only simple non-numeric data type of dimensions Can handle only summarization of numeric data Do not guide users which dimensions to explore or which levels to reach 20 5 Outline Attribute-Oriented Induction Review terms Characterization Summarization Hierarchical generalization Attribute relevance analysis Comparison/discrimination Descriptive statistical measures Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures. How is it done? Collect the task-relevant data (initial relation) using a relational database query Perform generalization by attribute removal or attribute generalization. Apply aggregation by merging identical, generalized tuples and accumulating their respective counts Interactive presentation with users 21 Basic Elements 22 General Steps Data focusing: task-relevant data, including dimensions, and the result is the initial relation. Attribute-removal and Attribute-generalization: InitialRel: Query processing of task-relevant data, deriving the initial relation. Attribute A has a large set of distinct values If there is no generalization operator on A, or A’s higher level concepts are expressed in terms of other attributes (giving redundancy) Remove A If there exists a set of generalization operators on A Select an operator to generalize A PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize? PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation”, accumulating the counts. Generalization threshold controls Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabs, visualization presentations. Attribute generalization: controls size of attribute values for generalization or removal (~ 2-8, specified/default) Relation generalization: controls the final relation/rule size (~ 10-30). 23 24 6 Example Example (cont.) Initial Relation DMQL: Describe general characteristics of graduate students in the Big-University database use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate” Name Gender Major Birth_date Residence Phone # GPA Jim Woodman M CS Vancouver,BC, Canada 8-12-76 3511 Main St., Richmond 687-4598 3.67 Scott Lachance M CS Montreal, Que, Canada 28-7-75 345 1st Ave., Richmond 253-9106 3.70 Laura Lee … F … Physics … Seattle, WA, USA … 25-8-70 … 420-5232 … 3.83 … Removed Retained Generalized to Sci,Eng,Bus Removed Generalized to Excl, VG,.. Gender Major Age_range Residence GPA M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … Prime Generalized Relation Transform to corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“Msc”, “MBA”, “PhD” } … … Birth-Place Generalized to Country Birth_ country 125 Austin Ave., Burnaby … Generalized to City Generalized to Age range … Canada Foreign Total 16 10 26 14 22 36 30 32 62 25 Presentation of results Summarization Mapping results into cross tabulation Visualization techniques: Review terms Characterization Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Cross tabulation: 26 Outline Generalized relation: Hierarchical Pie charts, bar charts, curves, cubes, and other visual forms. Attribute Quantitative characteristic rules: … Birth_Region Gender M F Total Presentation Count Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., t = typical generalization relevance analysis Comparison/discrimination Descriptive statistical measures grad ( x) Ù male( x) Þ birth _ region( x) ="Canada"[t :53%]Ú birth _ region( x) =" foreign"[t : 47%]. 27 28 7 Analysis of Attribute Relevance Methods To filter out statistically irrelevant attributes or rank attributes for mining Idea: Compute a measure that quantifies the relevance of an attribute with respect to a given class or concept Irrelevant attributes inaccurate/unnecessary complex patterns An attribute is highly relevant for classifying/predicting a class, if it is likely that its values can be used to distinguish the class from others E.g., to describe cheap vs. expensive cars Is “color” a relevant attribute? What about using “color” to compare banana and apple? These measures can be: Information gain The Gini index Uncertainty Correlation coefficients 29 Example Example (cont) How much attribute “major” is relevant to classification of graduate/undergraduate students? Relevance measure: Information gain Review formulae: For an attribute value set S, each labeled with a class in C and pi is a probability that class i is in S, then Ent ( S ) = -å pi log 2 pi Expected iÎC information needed to classify a sample if it is partitioned into Si’s for data point that has A’s value Si i I ( A) = å Ent ( Si ) iÎdom ( A ) S Information 30 gain: Gain(A) = Ent(S) – I(A) Gender Major Birth_ country Age_range M F M F M F M F M F M F Science Science Eng Science Science Eng Science Business Business Science Eng Eng Canada Foreign Foreign Foreign Canada Canada Foreign Canada Canada Canada Foreign Canada 20-25 25-30 …. Very-good Excellent GPA …… ….. … ….. ….. …… Count 16 22 18 25 21 18 18 20 22 24 22 24 120 Graduates 130 Undergraduates Dom(Major) = {Science, Eng, Business} Partition the data into Sc, Eng, Bus representing a set of data points whose “Major” is Science, Eng and Business, respectively 31 32 8 Ent ( S ) = -å pi log 2 pi Example (cont) iÎC I ( A) = å iÎdom ( A ) Gender Major Birth_ country Age_range M F M F M F M F M F M F Science Science Eng Science Science Eng Science Business Business Science Eng Eng Canada Foreign Foreign Foreign Canada Canada Foreign Canada Canada Canada Foreign Canada 20-25 25-30 …. Very-good Excellent GPA …… ….. … ….. ….. …… Si S Ent ( Si ) iÎC I ( A) = å iÎdom ( A ) Count 16 22 18 25 21 18 18 20 22 24 22 24 Ent ( S ) = -å pi log 2 pi Example (cont) 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Gender Major Birth_ country Age_range M F M F M F M F M F M F Science Science Eng Science Science Eng Science Business Business Science Eng Eng Canada Foreign Foreign Foreign Canada Canada Foreign Canada Canada Canada Foreign Canada 20-25 25-30 …. Very-good Excellent GPA …… ….. … ….. ….. …… Si S Ent ( Si ) Count 16 22 18 25 21 18 18 20 22 24 22 24 120 Graduates: Science = 84 (= 16+22+25+21) Eng = 36 Business = 0 130 Undergraduates Science = 42 Eng = 46 Business = 42 Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115 Similarly, find Gain(gender), Gain(Birth_country), Gain(Age_range), Gain(GPA) Ent(S) = 120/250 log2 (120/250) 130/250 log2 (130/250) = 0.9988 Ent(Sc) = 84/126 log2 (84/126) 42/126 log2 (42/126) = …. Ent(Eng) = 36/82log2 (36/82) 46/82 log2 (46/82) = …. Ent(Bus) = 0/42 log2 (0/42) 42/42 log2 (42/42) = …. I(Major) = 126/250Ent(Sc) + 82/250Ent(Eng) + 42/250Ent(Bus) = 0.7873 Gain(Major) = Ent(S) – I(Major) = 0.9988 – 0.7873 = 0.2115 • We can rank “importance” or degree of “relevance” by Gain values • We can use a threshold to prune out attributes that are less “relevant” Class Information captured from S Expected class information induced by attribute “Major” 33 34 Outline Class comparison Review terms Characterization Goal: mine properties (or rules) to compare a target class with a contrasting class The two classes must be comparable E.g., address and gender are not comparable store_address and home_address are comparable CS students and Eng students are comparable Summarization Hierarchical generalization Attribute relevance analysis Comparable classes should be generalized to the same conceptual level Approaches Use attribute-oriented induction or data cube to generalize data for two contrasting classes and then compare the results --- !!!! Pattern Recognition approach –Approximate discriminating rules from a data set, repeatedly fine-tune until errors are small enough Comparison/discrimination Descriptive statistical measures 35 36 9 Outline Descriptive statistical measures Review terms Characterization Data Characteristics that can be computed Central Tendency Summarization Hierarchical generalization Attribute relevance analysis mean median Dispersion When is “mean” not an appropriate measure? For a very large data set, how do we compute median ? five number summary: Min, Quartile1, Median, Quartile3, Max variance, standard deviation Spread about the mean. What does var = 0 mean? Outliers Detected by rules of thumb: values falling at Comparison/discrimination Descriptive statistical measures least 1.5 of (Q3-Q1) above Q3 or below Q1 Useful displays 37 Boxplots, quantile-quantile plot (q-q plot), scatter plot, loess curve 38 References E. Baralis and G. Psaila. Designing templates for mining association rules. Journal of Intelligent Information Systems, 9:7-32, 1997. Microsoft Corp., OLEDB for Data Mining, version 1.0, http://www.microsoft.com/data/oledb/dm, Aug. 2000. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane, “DMQL: A Data Mining Query Language for Relational Databases”, DMKD'96, Montreal, Canada, June 1996. T. Imielinski and A. Virmani. MSQL: A query language for database mining. Data Mining and Knowledge Discovery, 3:373-408, 1999. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov. 1994. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, pages 122-133, Bombay, India, Sept. 1996. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, Seattle, Washington, June 1998. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998. 39 10