Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses Information Age Produces Large Amounts of Data • Data collected on almost everything • WWW rich data resource • Data warehouses required to hold data 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 2 The problem: How do we turn information into useful knowledge? Solution: Data mining & knowledge discovery 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 3 Data Mining & Knowledge Discovery This class provides • Tools & techniques for producing useful knowledge from information • Experience in using these tools 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 4 Data Mining & Knowledge Discovery in CS 404 • We will study – – – – Data warehouses Classification & Association rule miners (C4.5) Neural networks (BP, SOM) Classical tools • Correlation • Regression • Clustering • We will do several projects requiring mining knowledge from “real” data 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 5 CS 404 Class Information Prerequisites: CS 347 (Artificial Intelligence) or CS 304 (Database Systems) and Stat 215 Texts: • Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. • Quinlan, J., C4.5 Programs for Machine Learning, Morgan Kaufmann, 1988. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 6 CS 404 Class Information Reference: (This or a similar Matlab reference is recommended.) Hanselman, D. and Littlefield, B., Mastering Matlab 6: A Comprehensive Tutorial and Reference, Prentice Hall, 2001. Software: • C4.5 – provided to class w/o charge • Matlab – Can purchase from Mathworks or can login to UMR. • Microsoft Excel (provided on UMR CLC computers) 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 7 CS 404 Class Information Instructor: D.C. St. Clair, Ph.D. 325 Computer Science Phone: (573) 341-6352 e-mail: [email protected] (Cont.d) Fax: (573) 341-4501 Class web page: www.umr.edu/~stclair or http://web.umr.edu/~stclair/class/classfiles/cs404_fs02/ Things you will find on the class web page: • • • • Syllabus Schedule Homework assignments Lecture notes 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 8 Who am I? • Professor and Chair UMR Computer Science Dept. • Research area -- Data mining, machine intelligence, neural networks diagnostics intelligent graphics data mining pattern recognition & analysis system monitoring & assessment • “Applied” experience – – – – – Union Pacific Technologies Intelligent Systems Advisor Visiting Principal Scientist McDonnell Douglas Research Laboratories NASA’s Johnson Space Center Defense: Navy, Army, and Air Force Co-founder & former Chief Scientist of intelligent software systems company 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 9 Even More CS 404 Class Information Han, one of the authors of the data mining text has a web page at: www.cs.sfu.ca/~han/DM_Book.html Which contains several interesting things including: 1. A list of errata for the data mining book 2. A set of slides he uses in the data mining course he teaches. [I will be using some of these slides in my lectures.] You may want to check these out. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 10 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair We just finished this. Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 11 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 12 Data -- Information -- Knowledge The set of values: 12345 67890 1000.00 2846.92 SA CK has no meaning. It is data but it is NOT information. Information: Information is the result of organizing data into meaningful quantities. The following relational table helps turns the data into information since it associates meaning with the data: Account Number 12345 67890 Balance 1000.00 2846.92 type SA CK A database is a “structured” collection of data stored and operated on within a management environment known as a Database Management Systems (DBMS) or database system. The DBMS helps to transform data into information. Knowledge can be created from information. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 13 What Is Data Mining? How Does It Differ From Existing Database Technologies? Data Sources: Databases, data warehouses, Internet Decision Support Systems Tools for asking questions & doing analyses when you know what you want to ask and where you are going. (Ex. OLAP tools) Data Mining Process of discovering knowledge (meaningful new correlations, patterns, and trends) in data by sifting through large amounts of data (100M-10G) using pattern recognition as well as statistical and mathematical techniques. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 14 Other Names Used in Conjunction With Data Mining • • • • • • • Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Data dredging Information harvesting What is not data mining – (Deductive) query processing – Expert systems or small ml/statistical programs Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 15 Potential-Customer* Person Age Sex Ann Smith 32 F Joan Gray 53 F Mary Blythe 27 F Jane Brown 55 F Bob Smith 30 M Jack Brown 50 M Data Mining Example Married-To Husband Bob Smith Jack Brown Knowledge Within A Relation Income 10,000 1,000,000 20,000 20,000 100,000 200,000 Customer yes yes no yes yes yes Wife Ann Smith Jane Brown IF Income(Person) 100,000 THEN Potential-Customer(Person) IF Sex(Person) = F AND Age(Person) 32 THEN Potential-Customer(Person) Knowledge From Multiple Relations IF Married-To(Person,Spouse) AND Income(Person) 100 000 THEN Potential-Customer(Spouse) IF Married-To(Person,Spouse) AND Potential-Customer(Person) THEN Potential-Customer(Spouse). * Dzeroski, Saso, Inductive Logic Programming and Knowledge Discovery in Databases, Advances in Knowledge Discovery and Data Mining, Ed. U. Fayyad, G.Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy, AAAI Press, 1996, pp. 117-152. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 17 Simple Concept Learning -- Example “Routine”, “well-understood” chemistry experiment performed numerous times. • Expected result occurred about half the time • Unexpected result occurred remainder of the time Numerous repetitions of experiment produced similar results Careful analysis determined: • One result produced when setup was in sunlight • Second result produced when setup was in shade Careful investigation showed: Experiment sensitive to ultraviolet radiation Result: Patented method for determining presence of ultraviolet radiation 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 18 The Knowledge Discovery Process Interpretation/ Evaluation Data Mining Transformation Preprocessing Selection Data Sources Knowledge Patterns / Models Transformed Data Preprocessed Data Target Data 2002 by D. C. St. Clair 404 Data Mining & Knowledge Discovery 19 Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 20 Data Sources • • • • • • Relational Databases Data Warehouses WWW Audio Video Printed Materials : : 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 21 Relational Databases 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 22 Multidimensional Data Cube 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 23 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 24 Data Mining Tasks • Predictive – Perform inference on current data • Descriptive (KDD) – Characterize general properties of data Notes: – A measure of certainty or “belief” must be associated with each pattern – “Interesting” patterns must be identified 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 25 Kinds of Data Patterns to Be “Mined” • Concept/class description • Association analyses • Classification & prediction • Cluster analysis • Outlier analysis 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 26 Concept/class Descriptions Example 1 Produce a description summarizing characteristics of customers who purchase diapers • Objective: produce a description of those in the target class • Characterizes class/concept Example 2 What properties identify diaper buyers from other store customers? • Discriminates class/concept • Leads to other questions – What else do they buy – When do they purchase these items? 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 27 Association Analysis Assoc. Anal. -- discovery of association relationships between attribute-value conditions. Such relationships may be expressed in many ways. On common way is through association rules. X => Y 2002 by D. C. St. Clair A1^.....^ Am B1^....^ Bn CS 404 Data Mining & Knowledge Discovery 28 Association Rules Example age (X, “20 .. 29”) ^ income (X, “20K..29K”) => buys (X, “CD changer) [support = 2% confidence = 60% ] % of data instances satisfying all three components of rule 2002 by D. C. St. Clair % of data instances where hypothesis is satisfied and conclusion is predicted correctly CS 404 Data Mining & Knowledge Discovery 29 Classification & Prediction o Debt o x o o x x o x o o o x x x o x x o o x o o Income 2002 by D. C. St. Clair Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge CS 404 Data Mining & Knowledge Discovery Discovery In Databases, AI Magazine, Fall 1996. 30 Classification (nonlinear) o No Loan Debt o x o o x x o x o o o x x x o x o x x o o o Loan Income 2002 by D. C. St. Clair Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge CS 404 Data Mining & Knowledge Discovery Discovery In Databases, AI Magazine, Fall 1996. 31 Cluster Analysis + Debt + + + + + + + + + + + + + + + + + + + + + + Income 2002 by D. C. St. Clair Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge CS 404 Data Mining & Knowledge Discovery Discovery In Databases, AI Magazine, Fall 1996. 32 Some Major Data Mining Issues • Mining methodologies • User interaction • Performance (accuracy, robustness) • Heterogeneous databases • Interestingness 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 33 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 34 The Knowledge Discovery Process Interpretation/ Evaluation Data Mining Transformation Preprocessing Selection Data Sources Knowledge Patterns / Models Transformed Data Preprocessed Data Target Data 2002 by D. C. St. Clair 404 Data Mining & Knowledge Discovery 35 Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Chapter 2: Data Warehousing and OLAP Technology for Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • From data warehousing to data mining 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 36 What Is a Data Warehouse? DWs provide architectures and tools to support the systematic –organization, –understanding, and –use of data. Note: DWs may consist of data from numerous sources including business, scientific, as well as engineering data. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 37 Features of a Data Warehouse • Subject-oriented -- organized around major subjects • Integrated -- integrates multiple heterogeneous data sources – Relational databases – Flat files – On-line transaction records • Consistency is enforced • Time-variant -- data stored to provide historical data • Nonvolatile – Physically separate from operational environment – Operations on data: initial loading & retrieval 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 38 OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 39 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 40 Multidimensional Data Models Figure 2.1 3-D data cube AllElectronics sales data 2002 by D. C. St. Clair 404 Data Mining Knowledge Discovery Allfigure references in this lecture are to the text: Han, CS J. & Kamber, M., &Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 41 4-D Data Cube of AllElectronics Sales Data Figure 2.2 4-D data cube AllElectronics sales data 2002 by D. C. St. Clair 404 Data Mining Knowledge Discovery Allfigure references in this lecture are to the text: Han, CS J. & Kamber, M., &Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 42 Fig. 2.3 A Lattice of Cuboids all time 0-D(apex) cuboid item time,location location item,location time,supplier time,item supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,supplier time,item,location item,location,supplier 4-D(base) cuboid time, item, location, supplier 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 43 Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 44 Fig. 2.4 Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures 2002 by D. C. St. Clair Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: ConceptsDiscovery and CS 404 Data Mining & Knowledge Techniques, Morgan Kaufmann, 2000. item_key item_name brand type supplier_type location location_key street city province_or_street country 45 Fig. 2.5 Example of Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures 2002 by D. C. St. Clair Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: ConceptsDiscovery and CS 404 Data Mining & Knowledge Techniques, Morgan Kaufmann, 2000. item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city province_or_street country 46 Fig 2.6 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key shipper_key location to_location location_key street city province_or_street country dollars_cost Measures 2002 by D. C. St. Clair time_key from_location branch_key branch Shipping Fact Table Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: ConceptsDiscovery and CS 404 Data Mining & Knowledge Techniques, Morgan Kaufmann, 2000. units_shipped shipper shipper_key shipper_name location_key 47 shipper_type A Data Mining Query Language, DMQL: Language Primitives • Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> • Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) • Special Case (Shared Dimension Tables) – First time as “cube definition” – define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 48 Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 49 CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses Program Completed University of Missouri-Rolla Copyright 2001 Curators of University of Missouri