Download data quality

第四部分数据库新应用与新技术 60年代中后期至90年代末，发展速度之快，应用范围之广，是许多其他技术远不能及的。发展的动力  计算环境的发展  与其他相关技术发展相结合  受应用需求推动  . . . 数据库技术发展和研究方向            并行数据库分布式数据库移动式、嵌入式数据库面向对象数据库，对象-关系数据库演绎数据库，知识库，主动数据库多媒体数据库空间数据库数据仓库和OLAP，数据挖掘信息集成 Web环境中的数据库技术 ... 数据库技术发展和研究方向            并行数据库分布式数据库移动式、嵌入式数据库面向对象数据库，对象-关系数据库演绎数据库，知识库，主动数据库多媒体数据库空间数据库数据仓库和OLAP，数据挖掘信息集成 Web环境中的数据库技术 ... 第十一章数据库中的知识发现 Slides are basically from： Data Mining: Concepts and Techniques — Slides for Textbook — Jiawei Han and Micheline Kamber Simon Fraser University, Canada http://www.cs.sfu.ca  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP  data mining Motivation: “Necessity is the Mother of Invention”  Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories  We are drowning in data, but starving for knowledge!  Solution: Data warehousing and data mining   Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases Evolution of Database Technology     1960s:  Data collection, database creation, IMS and network DBMS 1970s:  Relational data model, relational DBMS implementation 1980s:  RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s—2000s:  Data mining and data warehousing, multimedia databases, and Web databases  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP   What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse application Data mining What is Data Warehouse?    Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing:  The process of constructing and using data warehouses Data Warehouse—Subject-Oriented  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Data Warehouse—Integrated   Constructed by integrating multiple, heterogeneous data sources  relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources   E.g., Hotel price: currency, tax, breakfast covered, etc. When data is moved to the warehouse, it is converted. Data Warehouse—Time Variant  The time horizon for the data warehouse is significantly longer than that of operational systems.    Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse   Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain “time element”. Data Warehouse—Non-Volatile  A physically separate store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing:  initial loading of data and access of data. Data Warehouse vs. Heterogeneous DBMS  Traditional heterogeneous DB integration:   Build wrappers/mediators on top of heterogeneous databases Query driven approach:    When a query is posed to a client site, a metadictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set. Complex information filtering, compete for sources Data warehouse: update-driven, high performance. Data Warehouse vs. Operational DBMS  OLTP (on-line transaction processing)     Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing)  Major task of data warehouse system  Data analysis and decision making Distinct features (OLTP vs. OLAP):  User and system orientation: customer vs. market  Data contents: current, detailed vs. historical, consolidated  Database design: ER + application vs. star + subject  View: current, local vs. evolutionary, integrated  Access patterns: update vs. read-only but complex queries OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query Why Separate Data Warehouse?   High performance for both systems  DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data:  missing data: Decision support requires historical data which operational DBs do not typically maintain  data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources  data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP   What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse application Data mining Multidimensional Data Model  A data warehouse is based on a multidimensional data model which views data in the form of a data cube    Measures attributes: attributes which measure some value and can be aggregated upon. such as dollars_sold. Dimension attributes: attributes on which measure attributes and summaries of measure attributes are viewed. such as item (item_name, brand, type), or time(day, week, month, quarter, year). Data cube: multidimensional view of data. such as sales. A concept hierarchy: dimension (location) all all Europe region country city office Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... ... Mexico Toronto M. Young Multidimensional Data  Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Office Month Month Week Day A Sample Data Cube 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. cuboid and Data Cubes  In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube. Cuboids Corresponding to the Cube all 0-D(apex) cuboid product product,date date country product,country 1-D cuboids date, country 2-D cuboids 3-D(base) cuboid product, date, country More Example of Cube and Cuboids all time time,item 0-D(apex) cuboid item time,location location item,location time,supplier time,item,location supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier Conceptual Modeling of Data Warehouses  Modeling data warehouses: dimensions & measures  Star schema: A fact table in the middle connected to a set of dimension tables  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_type location location_key street city province_or_state country Example of Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city province_or_state country Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location location_key street city province_or_state country dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type Typical OLAP Operations  Roll up (drill-up): summarize data   Drill down (roll down): reverse of roll-up   project and select Pivot (rotate):   from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice:   by climbing up hierarchy or by dimension reduction reorient the cube, visualization, 3D to series of 2D planes. Other operations   drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its backend relational tables (using SQL)  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP   What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse application Data mining The Process of Data Warehouse Design   Top-down, bottom-up approaches or a combination of both  Top-down: Starts with overall design and planning (mature)  Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view  Waterfall: structured and systematic analysis at each step before proceeding to the next.  Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around. Multi-Tiered Architecture other Metadata sources Operational DBs Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools Three Data Warehouse Models   Enterprise warehouse  collects all of the information about subjects spanning the entire organization Data Mart  a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart   Independent vs. dependent (directly from warehouse) data mart Virtual warehouse  A set of views over operational databases  Only some of the possible summary views may be materialized Data Warehouse Development: A Recommended Approach Multi-Tier Data Warehouse Distributed Data Marts Data Mart Data Mart Model refinement Enterprise Data Warehouse Model refinement Define a high-level corporate data model OLAP Server Architectures     Relational OLAP (ROLAP)  Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces  Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services  greater scalability Multidimensional OLAP (MOLAP)  Array-based multidimensional storage engine (sparse matrix techniques)  fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP)  User flexibility, e.g., low level: relational, high-level: array Specialized SQL servers  specialized support for SQL queries over star/snowflake schemas  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP   What is a data warehouse?  A multi-dimensional data model  Data warehouse architecture  Data warehouse application Data mining Data Warehouse Usage  Three kinds of data warehouse applications  Information processing    Analytical processing  multidimensional analysis of data warehouse data  supports basic OLAP operations, slice-dice, drilling, pivoting Data mining    supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs knowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks From On-Line Analytical Processing to On Line Analytical Mining  Why online analytical mining?  High quality of data in data warehouses   Available information processing structure surrounding data warehouses   mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks. Architecture of OLAM   ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis   DW contains integrated, consistent, cleaned data An OLAM Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Filtering&Integration Database API Filtering Layer1 Data cleaning Databases Data Data integration Warehouse Data Repository  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP  data mining  Introduction to data mining  Characterization  Association rule mining  Classification  Cluster Analysis What Is Data Mining?   Data mining (knowledge discovery in databases):  Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their “inside stories”:  Data mining: a misnomer?  Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archaeology, data dredging, information harvesting, business intelligence, etc. Why Data Mining? — Potential Applications  Database analysis and decision support  Market analysis and management   Risk analysis and management    target marketing, customer relation management, market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications  Text mining (news group, email, documents) and Web analysis.  Intelligent query answering Data Mining: A KDD Process Pattern Evaluation  Data mining: the core of knowledge discovery Data Mining process. Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases Selection Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse Data Mining: On What Kind of Data?     Relational databases Data warehouses Transactional databases Advanced DB and information repositories  Object-oriented and object-relational databases  Spatial databases  Time-series data and temporal data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW Data Mining Functionalities (1)  Concept description: Characterization and discrimination   Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Association    contains(x, “computer”)  contains(x, “software”) [support = 1%, confidence = 75%] age(X, “20..29”) ^ income(X, “20..29K”)  buys(X, “PC”) [2%, 60%] Multi-dimensional vs. single-dimensional association Data Mining Functionalities (2)  Classification and Prediction    Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage  Presentation: decision-tree, classification rule, neural network  Prediction: Predict some unknown or missing numerical values Cluster analysis   Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Data Mining Functionalities (3)  Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis   Trend and evolution analysis  Trend and deviation: regression analysis  Sequential pattern mining, periodicity analysis  Similarity-based analysis Other pattern-directed or statistical analyses Are All the “Discovered” Patterns Interesting?  A data mining system/query may generate thousands of patterns, not all of them are interesting.    Suggested approach: Human-centered, query-based, focused mining Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures:   Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data Mining Visualization Other Disciplines  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP  data mining  Introduction to data mining  Characterization  Association rule mining  Classification  Cluster Analysis Data Generalization and Summarizationbased Characterization  Data generalization  A process which abstracts a large set of task-relevant data in a database from a low conceptual levels to higher ones. 1 2 3 4 Conceptual levels 5  Approaches:  Data cube approach(OLAP approach)  Attribute-oriented induction approach Characterization: Data Cube Approach (without using AO-Induction)  Perform computations and store results in data cubes  Strength  An efficient implementation of data generalization  Computation of various kinds of measures    e.g., count( ), sum( ), average( ), max( ) Generalization and specialization can be performed on a data cube by roll-up and drill-down Limitations   handle only dimensions of simple nonnumeric data and measures of simple aggregated numeric values. Lack of intelligent analysis, can’t tell which dimensions should be used and what levels should the generalization reach Attribute-Oriented Induction    Proposed in 1989 (KDD ‘89 workshop) Not confined to categorical data nor particular measures. How it is done?  Collect the task-relevant data( initial relation) using a relational database query  Perform generalization by attribute removal or attribute generalization.  Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.  Interactive presentation with users. Basic Principles of AttributeOriented Induction      Data focusing: task-relevant data, including dimensions, and the result is the initial relation. Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A’s higher level concepts are expressed in terms of other attributes. Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A. Attribute-threshold control: typical 2-8, specified/default. Generalized relation threshold control: control the final relation/rule size. see example Class Characterization: An Example Name Gender Jim Initial Woodman Relation Scott M Major M F … Removed Retained Residence Phone # GPA Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … … 3511 Main St., Richmond 345 1st Ave., Richmond 687-4598 3.67 253-9106 3.70 125 Austin Ave., Burnaby … 420-5232 … 3.83 … Sci,Eng, Bus City Removed Excl, VG,.. Gender Major M F … Birth_date CS Lachance Laura Lee … Prime Generalized Relation Birth-Place Science Science … Country Age range Birth_region Age_range Residence GPA Canada Foreign … 20-25 25-30 … Richmond Burnaby … Very-good Excellent … Birth_Region Canada Foreign Total Gender M 16 14 30 F 10 22 32 Total 26 36 62 Count 16 22 … Presentation of Generalized Results  Generalized relation:   Cross tabulation:   Relations where some or all attributes are generalized, with counts or other aggregation values accumulated. Mapping results into cross tabulation form (similar to contingency tables).  Visualization techniques:  Pie charts, bar charts, curves, cubes, and other visual forms. Quantitative characteristic rules:  Mapping generalized result into characteristic rules with quantitative information associated with it, e.g., grad( x)  male( x)  birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%]. Presentation—Generalized Relation Presentation—Crosstab  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP  data mining  Introduction to data mining  Characterization  Association rule mining  Classification  Cluster Analysis What Is Association Rule Mining?    Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications:  Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Examples.  Rule form: “antecedent consequent [support, confidence]”.  buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]  major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%] Association Rules   Given  A database of customer transactions  Each transaction is a list of items (purchases by a customer in a visit) Find all rules that correlated the presence of one set of items with another set of items  Example: 98% of people who purchase tires and auto accessories also get automotive services done   Any number of items in the consequent/antecedent of rule Possible to specify constraints on rules (e.g. find only rules involving Home Laundry Appliances). Characteristics Number of items: Hundreds of thousands Number of transactions: Millions Data set sizes: High Gigabytes    Discovering all rules rather than verifying if a rule holds Performance Scalability Association Rules -- Problem Statement      I = {i1, i2 ,…, im} : a set of literals, called items. Transaction T: a set of items such that T  I. Database D : a set of transactions. A transaction T contains X, a set of some items, if X  T. An association rule is an implication of the form X  Y, where X,Y  I and X  Y = . Association Rules – Confidence and Support  confidence of rule X  Y is the percentage of transactions in D containing X that also contain Y. c = (number of transactions containing both X and Y) / (total number of transactions containing X )  support of rule X  Y is the percentage of transactions in D that contain both X and Y. s = (number of transactions containing both X and Y) / (total number of transactions in D) Discover all rules that have minimum user-specified confidence and support. Mining Association Rules   Find the frequent itemsets : sets of items that have minimum support .  A subset of a frequent itemset must also be a frequent itemset, i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset  Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate the desired rules. Mining Association Rules -- Example Transaction ID 2000 1000 4000 5000 Items Bought A,B,C A,C A,D B,E,F Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% For rule A  C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori Algorithm — Example Database D TID 100 200 300 400 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Using frequent itemsets to generate rules Association rules can be generated as follows:   For each frequent itemset l, generate all nonempty subsets of l. For each nonempty subset s of l, output the rule “s  (l-s)” if support_count(l)/support_count(s) >=min_conf, where min_conf is the minimum confidence threshold. Summary  Association rule mining   probably the most significant contribution from the database community in KDD A large number of papers have been published  Many interesting issues have been explored  An interesting research direction  Association analysis in other types of data: spatial data, multimedia data, time series data, etc.  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP  data mining  Introduction to data mining  Characterization  Association rule mining  Classification  Cluster Analysis Classification   Classification:  predicts categorical class labels  classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Typical Applications  credit approval  target marketing  medical diagnosis  treatment effectiveness analysis Classification—A Two-Step Process   Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction: training set  The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set, otherwise over-fitting will occur Classification Process (1): Model Construction Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Tenured? Supervised vs. Unsupervised Learning  Supervised learning (classification)    Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering)   The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Classification by Decision Tree Induction    Decision tree  A flow-chart-like tree structure  Internal node denotes a test on an attribute  Branch represents an outcome of the test  Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases  Tree construction  At start, all the training examples are at the root  Partition examples recursively based on selected attributes  Tree pruning  Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample  Test the attribute values of the sample against the decision tree Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Output: A Decision Tree for “buys_computer” age? >40 <=30 30..40 student? no no yes yes yes credit rating? excellent no fair yes Another Example  An Example from Quinlan O u tlo o k su n n y su n n y o verc ast rain rain rain o verc ast su n n y su n n y rain su n n y o verc ast o verc ast rain T em p reatu re hot hot hot m ild cool cool cool m ild cool m ild m ild m ild hot m ild H u m id ity h ig h h ig h h ig h h ig h n o rm al n o rm al n o rm al h ig h n o rm al n o rm al n o rm al h ig h n o rm al h ig h W in d y false tru e false false false tru e tru e false false false tru e tru e false tru e C lass N N P P P N P N P P P P P N A Sample Decision Tree Outlook sunny overcast humidity rain windy P high normal true false N P N P Algorithm for Decision Tree Induction   Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and-conquer manner  At start, all the training examples are at the root  Attributes are categorical (if continuous-valued, they are discretized in advance)  Examples are partitioned recursively based on selected attributes  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning  All samples for a given node belong to the same class  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf  There are no samples left Summary  Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks)  Classification is probably one of the most widely used data mining techniques with a lot of extensions  Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic  Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..  Motivation: Why Knowledge discovery in databases?  Data Warehouse and OLAP  data mining  Introduction to data mining  Characterization  Association rule mining  Classification  Cluster Analysis Examples of Clustering Applications      Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults What Is Good Clustering?    A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Major Clustering Approaches  Partitioning algorithms: Construct various partitions and then evaluate them by some criterion  Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion  Density-based: based on connectivity and density functions  Grid-based: based on a multiple-level granularity structure  Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other The K-Means Clustering Method  Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) AGNES (Agglomerative Nesting)  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Use the Single-Link method and the dissimilarity matrix.  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 DIANA (Divisive Analysis)  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Inverse order of AGNES  Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download data quality