Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CENG 770 • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Definition by Gartner Group • “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” • (Deductive) query processing • Expert systems or small ML/statistical programs Mining Large Data Sets - Motivation • There is often information “hidden” in the data that is not readily evident • Human analysts may take weeks to discover useful information • Much of the data is never analyzed at all 4,000,000 3,500,000 The Data Gap 3,000,000 2,500,000 2,000,000 1,500,000 Total new disk (TB) since 1995 1,000,000 Number of analysts 500,000 0 1995 1996 1997 1998 1999 From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Artificial Intelligence Machine Learning Database Management Statistics Visualization Algorithms Data Mining Data Mining: History of the Field • Knowledge Discovery in Databases workshops started ‘89 – Now a conference under the auspices of ACM SIGKDD – IEEE conference series started 2001 7 A Brief History of Data Mining Society • 1989 IJCAI Workshop on Knowledge Discovery in Databases (PiatetskyShapiro) – Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) – Journal of Data Mining and Knowledge Discovery (1997) • 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations • More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. CS490D 8 • Market Analysis, Customer Relationships Management (CRM) • Churn Analysis • Risk Analysis and Management • Fraud Detection, Counter Terrorism • Network Intrusion Detection • Web Site Restructuring • Recommendation • Scientific Applications What Can Data Mining Do? • Cluster • Classify – Categorical, Regression • Summarize – Summary statistics, Summary rules • Link Analysis / Model Dependencies – Association rules • Sequence analysis – Time-series analysis, Sequential associations • Detect Deviations 10 Clustering • Find groups of similar data items • Statistical techniques require some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions “Group people with similar travel profiles” – George, Patricia – Jeff, Evelyn, Chris – Rob Clusters 11 Classification • Find ways to separate data items into pre-defined groups • Requires “training data”: Data items where group is known “Route documents to most likely interested parties” – English or non-english? – Domestic or Foreign? Training Data tool produces Groups classifier 12 Association Rules • Identify dependencies in the data: – X makes Y likely • Indicate significance of each dependency “Find groups of items commonly purchased together” – People who purchase fish are extraordinarily likely to purchase wine – People who purchase Turkey are extraordinarily likely to purchase cranberries D a t e / T i m e / R e g i s t e r F i s h T u r k e y C r a n b e r r i e s W i n e … YYY … 1 2 / 6 1 3 : 1 5 2 N NNY … 1 2 / 6 1 3 : 1 6 3 Y 13 Sequential Associations • Find event sequences that are unusually likely “Find common sequences of warnings/faults within 10 minute periods” – Warn 2 on Switch C preceded by Fault 21 on Switch B – Fault 17 on any switch preceded by Warn 2 on any switch T im e S w itc hE v e n t B F a u lt2 1 2 1 :1 0 A W a rn2 2 1 :1 1 C W a rn2 2 1 :1 3 A F a u lt1 7 2 1 :2 0 14 Recommendation Techniques • Given database of user preferences, predict preference of new user • Example: – Predict what new movies you will like based on • your past preferences • others with similar past preferences • their preferences for the new movies – Predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer) 15 Knowledge Discovery in Databases: Process Interpretation/ Evaluation Data Mining Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press 16 Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA • Learning the application domain – relevant prior knowledge and goals of application • • • Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation – Find useful features, dimensionality/variable reduction, invariant representation • Choosing functions of data mining – • • • summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. • Use of discovered knowledge What is Data? Attributes • Collection of data objects and their attributes • An attribute is a property or characteristic of an object – Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, or Objects feature • A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income high high high medium low low low medium low medium medium medium high medium student credit_rating buys_computer no fair no no excellent no no fair yes no fair yes yes fair yes yes excellent no yes excellent yes no fair no yes fair yes yes fair yes yes excellent yes no excellent yes yes fair yes no excellent no Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values – Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters – Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different – ID has no limit but age has a maximum and minimum value Types of data sets • Record – Data Matrix – Document Data – Transaction Data • Graph – World Wide Web – Molecular Structures • Ordered – – – – Spatial Data Temporal Data Sequential Data Genetic Sequence Data Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income high high high medium low low low medium low medium medium medium high medium student credit_rating buys_computer no fair no no excellent no no fair yes no fair yes yes fair yes yes excellent no yes excellent yes no fair no yes fair yes yes fair yes yes excellent yes no excellent yes yes fair yes no excellent no Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multidimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute Document Data • Each document becomes a `term' vector, – each term is a component (attribute) of the vector, – the value of each component is the number of times the corresponding term occurs in the document. team coach pla y ball score game wi n lost timeout season Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 Transaction Data • A special type of record data, where – each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Graph Data • Examples: Generic graph and HTML Links 2 1 5 2 5 <a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers Ordered Data • Sequences of transactions Items/Events An element of the sequence Discrete and Continuous Attributes • Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes • Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floating-point variables. Important Characteristics of Structured Data – Dimensionality • Curse of Dimensionality – Sparsity • Only presence counts – Resolution • Patterns depend on the scale