Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 8392 SPRING 1999 DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: [email protected] www: http://www.seas.smu.edu/~mhd January 1999 CSE8392 SPRING 1999 OUTLINE • Course Objective: To examine Data Mining concepts. A database perspective (rather than AI or statistics) is taken. • • • • • • I. Introduction and Related Topics II. Core Topics III. Advanced Topics IV. Case Studies V. Student Presentations VI. Summary and Future Trends CSE 8392 Spring 1999 2 INTRODUCTION AND RELATED TOPICS • Section Objective: Provide an introduction of data mining concepts. Briefly examine related concepts and background topics. • Historical Perspective – Gleaning Knowledge from the Data – User Expectations increase as amount/sophistication of collected data increases. – Reality vs Extracted Data Physical View Database View Reality Data Information Need Query CSE 8392 Spring 1999 3 Related Topics (to be covered) – – – – – Knowledge Discovery Information Retrieval Fuzzy Sets Data Warehousing and OLAP Dimensional Modeling CSE 8392 Spring 1999 4 Data Mining Overview • What is Data Mining? – Definition: Fayyad, p. 9 – A.k.a. • Exploratory data analysis • Unsupervised pattern recognition • Data driven discovery • Deductive learning • Data Mining determines patterns in the data – Non-trivial – Valid – Novel – Potentially useful – Interesting – General and simple – Understandable CSE 8392 Spring 1999 5 DM Techniques (R[1]) • DM involves many different algorithms to accomplish different things. All have the following techniques in common. – Model(Must fit a model to the data.) • Function/Purpose • Representation – Preference Criteria (How to choose one model over another?) – Search Algorithm (How to search the data) • Example (Loan Data, fig 1.1 p6 in Fayyad): – Model: Classification, Linear Function – Preference: What best fits data? (Fig 1.2 or 1.4) – Search Algorithm: Linear search of database CSE 8392 Spring 1999 6 DM Model Functions (R[1]) • Classification - Map data into predefined groups • Regression - Map data to real valued predicate variable • Clustering - Map data into groups defined by data itself • Summarization - Map subsets of data into simple description • Dependency Modeling - Identify dependencies among data items • Link Analysis - Identify other relationships among data (association rules) • Sequence Analysis - Identify sequential patterns in data CSE 8392 Spring 1999 7 DM Historical Perspective • Late 70’s: Spreadsheet analysis • 80’s: Transactional databases support data storage and retrieval • Early 90’s: Growing interest in end user support (a.k.a. decision support) – Issue: transactional databases are not designed for decision support • Mid 90’s: Dedicated data warehouses for decision support and multidimensional analysis • Late 90’s: Proliferation; new concepts (data marts) • DM Tools: Neovista, Red Brick CSE 8392 Spring 1999 8 Data Mining Metrics • • • • • • • • • • Berson, Tables 17-1,17-2,17-3, p 347 Accuracy Clarity Dirty Data Dimensionality Raw Data (Preprocessing) RDBMS embedding Scalability Speed Validation CSE 8392 Spring 1999 9 DM Issues • • • • • • • • • • • Overfitting Outliers Closed World Assumption Database schemas and database models Algorithms for data mining Interpretation and visualization of results Size of databases Multimedia data, Spatio-Temporal Data Changing data Integration DM Applications – Basket market analysis Stock analysis and selection – Fraud detection and prevention – Crisis prediction and prevention CSE 8392 Spring 1999 10 KNOWLEDGE DISCOVERY IN DATABASES (KDD) • “Overall process of discovering useful knowledge from data.” (p28 in R[1]) • Defn: R[1] p 30 • Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad) • Data Mining is one step in KDD process • KDD objective not usually clear or exact. May require time with customer understanding needs. • Data usually has problems - needs cleaning – Incorrect/missing data – Extract from multiple sources and compare – Delete anomalous data and sources – Different data types/metrics CSE 8392 Spring 1999 11 FUZZY SETS and LOGIC • Set membership described by a real valued (0,1) membership function • Ex: Set of all tall people • Set membership function: f(x)=x is tall iff height(x)>6 ft. • Note that this is a simple classification problem. Just as the Loan example, the results are not exact. • Basis of many classification and clustering approaches • In a conventional DB how do you retrieve all tall people? – Three valued logic: True, False, Maybe – Multi-valued logic: More than 2 values CSE 8392 Spring 1999 12 Fuzzy Logic • Reasoning with uncertainty • Extends multivalued logic; allows user to communicate using imprecise concepts, i.e. – “good” and “bad” – “close to” and “far away” • Avoids brittleness of rule based reasoning by introducing probability of set membership – Allows for smoother transition between classification sets in the domain – Example • Berson figure 16.2, page 325 CSE 8392 Spring 1999 13 INFORMATION RETRIEVAL • Store and retrieve documents based on fuzzy queries • Predecessor of web based access • Ex: Store information about all articles in all IEEE Transactions journals and Retrieve all documents dealing with heaps. • Overview – Conventional IR Systems – Query Structures(Keywords) – Matching(Multivalued logic) – Measures – Text Analysis Techniques – IR Related Topics CSE 8392 Spring 1999 14 Conventional IR Systems • Library card catalogs • Documents (Library Science) – Formatted – Unformatted (Text) – Mixed • Document Surrogates – Identifiers – Titles, names, and dates – Abstracts, extracts, reviews – Summaries of Numerical Data – Image Descriptions CSE 8392 Spring 1999 15 IR Queries • Query Structures – Matching Criteria – Boolean Queries – Vector – Fuzzy – Natural Language • Logical combination of keywords • Weight associated with keywords • Similarity measures CSE 8392 Spring 1999 16 Similarity Measures – Document Vector: Di di1 , di 2 ,..., din – Different Measures: n Sim ( Di , D j ) d ik d jk k 1 – Salton and McGill, Introduction to Modern Information Retrieval, 1984, McGraw-Hill, pp201-204. – Similarity uses: • Document-Document • Query-Query • Document-Query CSE 8392 Spring 1999 17 IR Document/Query Matching • Matching Process – Relevance and Similarity Measures – Boolean based matching • Logical match – Vector based matching • Threshold match – Probabilistic Match n documents relevant • P(relevant) = N total documents – Fuzzy Matching – Proximity Matching – Weighting – Relative Importance of Items CSE 8392 Spring 1999 18 IR Matching • Scaling – Impact of Sample Size – Clustering – Centroids • Measures – Precision – Recall CSE 8392 Spring 1999 19 IR Indexing • Text Analysis – Indexing is the assignment of keywords or terms that represent document content • Originally a library science problem that has grown with the advent of web based searches – Indexing types • Automated vs. manual • Controlled vs. uncontrolled • Single term vs. terms in context • Deep vs. shallow CSE 8392 Spring 1999 20 IR Indexing • General Steps – 1. Assignment of terms or concepts capable of representing content – 2. Assignment to each term a weight or value • Indexing – Vector based • Start with excerpts, remove high frequency words – Stop list – Thesaurus • Compute discrimination values of terms CSE 8392 Spring 1999 21 IR Retrieval • Retrieval or Classification – Vector based • Same starting point as with indexing • Compute weighting factors • Assign to each document a weighted term vector – Similarity Measures • Measure similarity between document/query • Results normalized to range between 0-1 CSE 8392 Spring 1999 22 IR Retrieval – Inverse Document Frequency • Assumes importance is proportional to standard occurrence frequency, and inversely proportional to the total number of documents. • Also used for similarity measurement – Inverted Indexing of Document – Concept Hierarchy • DAG of concepts • Follow nodes from general to more specific • Tag articles with low level concepts so that each may be distinguished from ancestors CSE 8392 Spring 1999 23 IR Related Topics • Information Retrieval Related Topics – Text Analysis – Fuzzy Sets – Extending Databases – Hypertext – Digital Libraries – Data Mining • Web based browsers CSE 8392 Spring 1999 24 DATA WAREHOUSING AND OLAP – Preparations for Mining: Data Warehousing • Extracting the data (from RDBMS) • Storing the data – Data warehouse or data mart • Cleansing the data • Mining the data – Often with multidimensional queries • Definition – Blend of technologies – Integration – Enables Strategic Use of Data • Architecture – Figure 6.1, page 116 CSE 8392 Spring 1999 25 DW Migration • Migration from Relational Database to Data Warehouse – Differences (Relational vs. Data Warehouse) – Procedure for Migration • Extraction • Cleanup • Transformation • Migration • Issues – Multiple sources – Database Heterogeneity – Data Heterogeneity CSE 8392 Spring 1999 26 DW Design • Data Warehouse Design Considerations Nine Step Method: – Subject Matter – Fact Table contents – Dimensioning – Fact Selection – Precalculations – Rounding out dimension table – Duration selection – What about change? – Query priorities • Technical Considerations – Hardware – Communications Infrastructure CSE 8392 Spring 1999 – Data Structures 27 More on DW • Benefits – Development of strategic information and resources – Hypothesis testing – Knowledge discovery • Data Marts – Definition: a mini data warehouse for data mining – Directed at a partition of data – Dedicated user group – May be physically separate – Drivers • Urgent user requirements • Small budget • Absence of sponsor • Decentralization • Smaller project size CSE 8392 Spring 1999 28 DIMENSIONAL MODELING • Dimensional Modeling – Describes relationships in the data that will be mined – Relatively new concept, still developing – A technique for visualizing data models – Schema (Star and Snowflake) – Facts - A collection of related data items, consisting of measures and context data – Dimensions - A collection of members or units of the same type of view. Axis for modeling. Sets the context for the facts. – Measures - Numeric attribute of fact (What is stored about sales data) • Focus - Tends to be on numeric data 1999 29 • MD Analysis CSE vs. 8392 DMSpring - Figure 4, R[3] Data Cube • • • • • Way to visualize facts and dimensions Hypercube (more than 3 dimensions) May be nested Figure 13.1, p249, Berson Figure 15,R[3] CSE 8392 Spring 1999 30 Star Schema – Contains large fact table and a surrounding set of dimension tables – A.k.a. constellation or multistar model – Figure 9.1, p171,Berson – Following from Figure 18, R[3] Time Dimension Customer Sales Part No. Dimension Facts Dimension Salesperson Product Dims Dimension CSE 8392 Spring 1999 31 Snowflake Schema • Sometimes dimensions have hierarchies among themselves • N:1 relationships among members of a dimension may be subdivided • Decomposition yields a snowflake like schema Week Month Dimension Dimension Time Dimension Customer Sales Part No. Dimension Facts Dimension Salesperson Product Dimension Dimension Location Manager Dimension Dimension CSE 8392 Spring 1999 32 OLAP (On Line Analytic Processing) • Multidimensional database • Allows user to analyze data using elaborate, multidimensional, complex views • MOLAP - Multidimensional OLAP. Supported by specialized DBMS/software systems. (Data structures, temporal) – May not be general enough for other uses – Access limited and optimized for OLAP processing – Fig 13.3 p 253, Berson • ROLAP - Underlying data stored in traditional (relational) DBMS and accessed by traditional query language (SQL). – Layer on top of DBMS. Middleware. – May have poor performance for OLAP applications – Fig 13.4 p 254, Berson CSE 8392 Spring 1999 33 OLAP Operations • Move view of facts down/up dimensions – Drill Down – Roll Up – Figure 3, R[3] – Figure 16,R[3] • Look at data by partitioning the cube – Slice - Look at subcube to get more specific data – Dice - Rotate cube to look at another dimension – Figure 17,R[3] CSE 8392 Spring 1999 34