Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
tutorial based on the book: Data Mining Concepts and Tehniques by Jiawei Han and Micheline Kamber This material was developed with financial help of the WUSA fund of Austria. made by Radmilo Pesic & Branko Golubovic 1/74 Introduction made by Radmilo Pesic & Branko Golubovic 2/74 What motivated data mining? Necessity is the mother of invention. Data Collection and Database Creation (1960s and earlier) Database Management Systems (1970s-early 1980s) Advanced Databases Systems (mid-1980s-present) Web-based Databases Systems (1990s-present) Data Warehousing and Data Mining (mid-1980s-present) New Generation of Integrated Information Systems (2000-…) made by Radmilo Pesic & Branko Golubovic 3/74 What Is Data Mining? Extracting or “mining” knowledge from large amounts of data. 1. Data cleaning 2. Data integration 3. Data selection 4. Data transformation 5. Data mining 6. Pattern evaluation 7. Knowledge presentation Evaluation and Presentation Knowledge Data Mining Patterns Selection and Transformation Cleaning and Integration Databases Data warehouse Flat files made by Radmilo Pesic & Branko Golubovic 4/74 Components of a typical data mining system: • Database, data warehouse, or other information repository • Database or data warehouse server • Knowledge base • Data mining engine • Pattern evaluation module • Graphical user interface Graphical user interface Pattern evaluation Data mining engine Knowledge base Database or Data warehouse server Database Data warehouse made by Radmilo Pesic & Branko Golubovic 5/74 Data mining – On What Kind of Data? • • • • Relational Databases Data Warehouses Transactional Databases Advanced Database Systems and Advanced Database Applications (object-oriented, object-relational, spatial, temporal, time-series, text, multimedia, heterogeneus, legacy databases and the world wide web) made by Radmilo Pesic & Branko Golubovic 6/74 Relational Databases customer cust_ID name address age income credit_info … C1 … … Smith, Sandy … … 5463 E Hastings, Burnaby, BC V5A 4S9, Canada … 21 … … $27000 … … 1 … … … … … item item_ID name brand category type price place_made supplier cost I3 I8 … high_res_TV multidiscCDplay Toshiba Sanyo … high resolution multidisc … TV CD player … $988.00 $369.00 … Japan Japan … NikoX Music Front … $600.00 $120.00 … employee empl_ID name category group salary commission E55 … Jones, Jane … home entertainment … manager … $18,000 … 2% … branch branch_ID name address B1 … City Square … 369 Cambie St., Vancouver, BC V5L 3A2, Canada … purchases trans_ID cust_ID empl_ID date time method_paid amount T100 … C1 … E55 … 09/21/98 … 15:45 … Visa … $1357.00 … item_sold works_at trans_ID item_ID qty empl_ID branch_ID T100 T100 I3 I8 1 2 E55 … B1 … made by Radmilo Pesic & Branko Golubovic 7/74 Data Warehouses Data source in Chicago Data source in New York Client Clean Transform Integrate Load Data warehouse Query and analysis tools Data source in Toronto Client Data source in Vancouver Typical architecture of a data warehouse for AllElectronics made by Radmilo Pesic & Branko Golubovic 8/74 time (quarters) Chicago 440 New York 1560 Toronto 395 Vancouver Q1 605 825 14 <Vancouver,Q1,security> 400 Q2 Q3 Q4 computer home entertainment security phone item (types) Drill-down on time data for Q1 time (months) Jan 150 Feb 100 March 150 computer time (quarters) USA 2000 Canada 1000 Chicago New York Toronto Vancouver home entertainment Roll-up on address Q1 Q2 Q3 Q4 security phone computer home entertainment item (types) security phone item (types) made by Radmilo Pesic & Branko Golubovic 9/74 Text Databases and Multimedia Databases • • • Text databases can be: highly unstructured, semistructured or well structured Multimedia databases store image, audio, and video data Such data require a lot of storage space; it’s continuous-media data Heterogeneus Databases and Legacy Databases The World Wide Web • mining path traversal patterns made by Radmilo Pesic & Branko Golubovic 10/74 Data Mining Functionalities What Kinds of Patterns Can Be Mined? • Concept/Class Description: Characterization and Discrimination • Association Analysis • Classification and Prediction • Cluster Analysis • Outlier Analysis • Evolution Analysis made by Radmilo Pesic & Branko Golubovic 11/74 Are All of the Patterns Interesting? A pattern is interesiting if it is: • easily understood • valid • (potentially) useful • novel or if it • confirms user’s hypothesis Interesting pattern represents knowledge! made by Radmilo Pesic & Branko Golubovic 12/74 Objective measures of pattern interestingness: • support • confidence Subjective measures of pattern interestingness: • data is unexpected • data is actionable • data is expected Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns? made by Radmilo Pesic & Branko Golubovic 13/74 Classification of Data Mining Systems Database technology Statistics Data Mining Information science Visualization Machine learning Other disciplines • according to kinds of databases mined (relational, data warehouse, object-oriented…) • according to kinds of knowledge mined (association, classification, clustering…; generalized, primitive-level or knowledge at multiple levels; regularities or irregularities) • according to the kinds of techniques utilized (autonomous, interactive exploratory or query-driven systems; data warehouse oriented, statistics…) • according to the applications adapted (for finance, DNA, etc.) made by Radmilo Pesic & Branko Golubovic 14/74 Major Issues in Data Mining Mining methodology and user interaction issues: • Mining different kinds of knowledge in databases • Interactive mining of knowledge at multiple levels of abstraction • Incorporation of background knowledge • Data mining query languages and ad hoc data mining • Presentation and visualization of data mining results • Handling noisy or incomplete data • Pattern evaluation – the interestingness problem Performance issues: • Efficiency and scalability of data mining algorithms • Parallel, distributed, and incremental mining algorithms Issues relating to the diversity of database types: • Handling of relational and complex types of data • Mining information from heterogeneous databases and global information systems made by Radmilo Pesic & Branko Golubovic 15/74 Data Warehouse and OLAP Technology for Data Mining made by Radmilo Pesic & Branko Golubovic 16/74 What Is a Data Warehouse? “A datawarehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management’s decision making process.” W.H. Inmon • • • • Subject-oriented Integrated Time-variant Nonvolatile made by Radmilo Pesic & Branko Golubovic 17/74 How are organizations using the information from data warehouse? • Increasing customer focus • Repositioning products and managing product portfolios • Analyzing operations and looking for sources of profit • Managing the customer relationships, making environmental corrections, and managing the cost of corporate assets Different approach to heterogeneous database integration: • Query-driven approach (wrappers and integrators) • Update-driven approach made by Radmilo Pesic & Branko Golubovic 18/74 Differences Between Operational Database Systems and Data Warehouse • • • • • Users and system orientation Data contents Database design View Access patterns Why have a separate data warehouse? made by Radmilo Pesic & Branko Golubovic 19/74 A Multidimensional Data Model From Tables and Spreadsheets to Data Cubes • A data cube is defined by dimensions and facts • Dimension table • Fact table made by Radmilo Pesic & Branko Golubovic 20/74 location = “Chicago” location = “New York” location = “Toronto” location = “Vancouver” item item item item home time ent. comp. phone sec. home comp. phone ent. sec. Q1 Q2 Q3 Q4 623 698 789 870 1087 968 1130 1024 1034 1048 1142 1091 872 925 1002 984 854 943 1032 1129 882 890 924 992 89 64 59 63 38 41 45 54 home comp. phone ent. 818 894 940 978 746 769 795 864 43 52 58 59 sec. 591 682 728 784 home comp. phone ent. 605 680 812 927 825 952 1023 1038 14 31 30 38 sec. 400 512 501 580 time (quarters) Chicago 440 882 89 623 New York 1560 968 38 872 Toronto 395 746 43 591 Vancouver Q1 605 825 14 400 Q2 680 952 31 512 Q3 812 1023 30 501 Q4 927 1038 38 580 computer home entertainment security phone item (types) A 2-D view of sales data for AllElectronics, and it’s 3-D data cube representation made by Radmilo Pesic & Branko Golubovic 21/74 supplier=“SUP1” supplier=“SUP2” supplier=“SUP1” time (quarters) Chicago New York Toronto Vancouver Q1 605 825 14 400 Q2 Q3 Q4 computer home entertainment security phone item (types) computer home entertainment security phone computer home entertainment item (types) security phone item (types) A 4-D data cube representation of sales data for AllElectronics made by Radmilo Pesic & Branko Golubovic 22/74 all item time 0-D (apex) cuboid location time, supplier time, item time, location supplier 1-D cuboid item, supplier item, location location, supplier 2-D cuboid time, location, supplier time, item, location time, item, supplier time, item, location, supplier item, location, supplier 3-D cuboid 4-D (base) cuboid Lattice of cuboids, making up a 4-D data cube made by Radmilo Pesic & Branko Golubovic 23/74 Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases Star schema: • a large central table (fact table) • a set of smaller attendant tables (dimension tables), one for each dimension time dimension table time_key day day_of_week month quarter year sales fact table branch dimension table branch_key branch_name branch_type time_key item_key branch_key location_key dollars_sold units_sold item dimension table item_key item_name brand type supplier_type location dimension table location_key street city province_or_state country made by Radmilo Pesic & Branko Golubovic 24/74 Snowflake schema: • a variant of star schema, where some dimension tables are normalized • reduce redundancies, but reduce the effectivness of browsing time dimension table time_key day day_of_week month quarter year branch dimension table branch_key branch_name branch_type sales fact table time_key item_key branch_key location_key dollars_sold units_sold item dimension table item_key item_name brand type supplier_key location dimension table location_key street city_key made by Radmilo Pesic & Branko Golubovic supplier dimension table supplier_key supplier_type city dimension table city_key city province_or_state country 25/74 Fact constelation: • multiple fact tables share dimension tables time dimension table time_key day day_of__week month quarter year branch dimension table branch_key branch_name branch_type sales fact table time_key item_key branch_key location_key dollars_sold units_sold item dimension table item_key item_name brand type supplier_type shipping fact table item_key time_key shipper_key from_location to_location dollars_sold units_shipped shipper dimension table shipper_key shipper_name location_key shipper_type location dimension table location_key street city province_or_state country made by Radmilo Pesic & Branko Golubovic 26/74 Defining multidimensional schema • DMQL – data mining query language • Syntax: cube definition: define cube <cube_name> [<dimension_list>]: <measure_list> dimension definition: define dimension <dimension_name> as (<atribute_or_subdimension_list>) made by Radmilo Pesic & Branko Golubovic 27/74 Example: • Constellation schema defined in DMQL: define cube sales [time, item, branch, location]: dollars_sold=sum(sales_in_dollars), units_sold=count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollars_cost=sum(cost_in_dollars), unit_shipped=count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales made by Radmilo Pesic & Branko Golubovic 28/74 Measures: Their Categorization and Computation Measures, based on the aggregate function: • Distributive • Algebraic • Holistic made by Radmilo Pesic & Branko Golubovic 29/74 Introducing Concept Hierarchies • A concept hierarchy defines a sequence of mappings from a set of low-level to higher-level concepts. location all all country province_or_state city Canada British Columbia Vancouver Victoria USA Ontario Toronto New York Ottawa New York made by Radmilo Pesic & Branko Golubovic Illinois Buffalo Chicago 30/74 • Hierarchial and lattice structures of atributes in warehouse dimensions: country year province_or_state quarter city month week day street Hierarchy for location Lattice for time made by Radmilo Pesic & Branko Golubovic 31/74 OLAP Operations in the Multidimensional Data Model • • • • • Roll-up Drill-down Slice and dice Pivot (rotate) Other (drill-across, drill-through) made by Radmilo Pesic & Branko Golubovic 32/74 time (quarters) Chicago 440 New York 1560 Toronto 395 Vancouver 825 14 400 Q2 Q3 computer home entertainment security phone time (months) item (types) USA Canada time (quarters) 605 Q4 roll-up on location (from cities to countries) Q1 Q1 drill-down on time (from quarters to months) 2000 1000 Q2 Chicago New York Toronto Vancouver January 150 February 100 March 150 April May June July August September Q3 October November Q4 December computer home entertainment security phone computer home entertainment item (types) security phone item (types) made by Radmilo Pesic & Branko Golubovic 33/74 Chicago 440 New York 1560 Toronto 395 Vancouver 605 825 14 time (quarters) time (quarters) Q1 USA Canada 400 Q2 Q1 605 Q2 computer Q3 Q4 computer home entertainment security home dice for entertainment (location=“Toronto” or “Vancouver”) item (types) and (time=“Q1”or “Q2”) and (item=“home entertainment” or “computer”) phone item (types) slice for time=“Q1” Chicago New York Toronto Vancouver pivot 605 825 14 computer home entertainment 400 item (types) location (cities) 395 home entertainment 605 computer 825 phone 14 security 400 security phone New York Chicago item (types) Vancouver Toronto location (cities) made by Radmilo Pesic & Branko Golubovic 34/74 A Starnet Query Model for Querying Multidimensional Databases location customer continent group country province_or_state category city street day name name brand category type item month quarter year time made by Radmilo Pesic & Branko Golubovic 35/74 Data Warehouse Architecture Steps for the Design and Construction of Data Warehouse The Design of a Data Warehouse: A Business Analysis Framework • top-down view • data source view • data warehouse view • business query view made by Radmilo Pesic & Branko Golubovic 36/74 The Process of Data Warehouse Design • top-down approach • bottom-up approach • combined approach • • waterfall method spiral method Steps of the warehouse design: 1) Choosing a business proces to model; 2) Choosing the grain of the business proces; 3) Choosing the dimensions; 4) Choosing the measures. made by Radmilo Pesic & Branko Golubovic 37/74 A Three-Tier Data Warehouse Architecture Query/report Analysis Data mining Top tier: front-end tools OLAP server Output OLAP server Middle tier: OLAP server Monitoring Administration Data warehouse Data marts Bottom tier: data warehouse server Metadata repository Extract Clean Transform Load Refresh Operational databases Data External sources made by Radmilo Pesic & Branko Golubovic 38/74 There are three data warehouse models: • Enterprise warehouse • Data mart • Virtual warehouse made by Radmilo Pesic & Branko Golubovic 39/74 Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP Relational OLAP (ROLAP) servers: • use of relational or extended-relational DBMS • greater scalability Multidimensional OLAP (MOLAP) servers: • use of data cube – fast indexing • possible low storage utilization – use of compression Hybrid OLAP (HOLAP) servers: • scalability of ROLAP and faster computation of MOLAP • Microsoft SQL Server 7.0 OLAP Services supports HOLAP server made by Radmilo Pesic & Branko Golubovic 40/74 Data Warehouse Implementation • SQL group by Data cube computation extends SQL with compute cube • Example: “Compute the sum of sales, grouping by item and city.” “Compute the sum of sales, grouping by item.” “Compute the sum of sales, grouping by city.” • The possible group by’s are the following: {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()} made by Radmilo Pesic & Branko Golubovic 41/74 () (city) (item) (city,item) (city,year) 0-D (apex) cuboid (year) 1-D cuboids (item,year) 2-D cuboids (city,item,year) 3-D (base) cuboids Lattice of cuboids define cube sales [item, city, year]: sum(sales_in_dollars) compute cube sales made by Radmilo Pesic & Branko Golubovic 42/74 • Number of cuboids in an n-dimensional data cube is 2n • Number of cuboids in an n-dimensional data cube where we have a concept hihierarchy (day<week<month<quarter<year) is: n T ( Li 1) i 1 • Example: if the cube has 10 dimensions and each dimension has 4 levels, the total number of cuboids that can be generated will be 510 9.8 x 106 made by Radmilo Pesic & Branko Golubovic 43/74 Partial Materialization: Selected Computation of Cuboids There are three choices for data cube materialization given a base cuboid: (1) do not precompute any of the “nonbase” cuboids (no materialization) (2) precompute all of the cuboids (full materialization) (3) selectively compute a proper subset of the whole set of possible cuboids (partial materialization); the partial materialization of cuboids shoul consider three factors: •identify the subset of cuboids to materialize, •exploit the materialized cuboids during query processing, and •efficiently update the materialized cuboids during load and refresh. made by Radmilo Pesic & Branko Golubovic 44/74 Multiway Array Aggregation in the Computation of Data Cubes ROLAP: • Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. • Grouping is performed on some subaggregates as a “partial grouping step”. These “partial groupings” may be used to speed up the computation of other subaggregates. • Aggregates may be computed from previously computed aggregates, rather than from the base fact tables. MOLAP: • Partitition the array into chunks. • Compute aggregates by visiting cube cells. made by Radmilo Pesic & Branko Golubovic 45/74 c3 c2 C c1 61 45 29 62 46 30 63 47 31 64 48 32 60 c0 b3 44 13 14 15 16 56 28 40 b2 9 52 24 B 36 b1 b0 5 20 1 2 3 4 a0 a1 a2 a3 A A 3-D array for the dimensions A, B, and C, organized into 64 chunks made by Radmilo Pesic & Branko Golubovic 46/74 Indexing OLAP Data • Bitmap indexing • Join indexing made by Radmilo Pesic & Branko Golubovic 47/74 Bitmap Indexing Base table Item bitmap index table City bitmap index table RID item city RID H C P S RID V T R1 R2 R3 R4 R5 R6 R7 R8 H C P S H C P S V V V V T T T T R1 R2 R3 R4 R5 R6 R7 R8 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 R1 R2 R3 R4 R5 R6 R7 R8 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 Indexing OLAP data using bitmap indices made by Radmilo Pesic & Branko Golubovic 48/74 Join Indexing Join index table for location/sales sales T57 location Main Street item T238 Sony-TV Join index table for item/sales location sales_key item sales_key … Main Street Main Street Main Street … … T57 T238 T884 … … Sony-TV Sony-TV … … T57 T459 … Join index table linking two dimensions location/item/sales T459 T884 Linkages between a sales fact table and dimension tables for location and item location item sales_key … Main Street … … Sony-TV … … T57 … Join index tables based on the linkages between the sales fact table and dimension tables for location and item made by Radmilo Pesic & Branko Golubovic 49/74 Efficient Processing of OLAP Queries 1. Determine which operations should be performed on the available cuboids 2. Determine to which materialized cuboid(s) the relevant operations should be applied made by Radmilo Pesic & Branko Golubovic 50/74 Metadata Repository • A description of the structure of the data warehouse • Operational metadata • The algorythms used for summarization • The mapping from the operational environment to the data warehouse • Data related to system performance • Business metadata made by Radmilo Pesic & Branko Golubovic 51/74 Data Warehouse Back-End Tools and Utilities • • • • • Data extraction Data cleaning Data transformation Load Refresh made by Radmilo Pesic & Branko Golubovic 52/74 Further Development of Data Cube Technology Discovery-Driven Exploration of Data Cubes • SelfExp • InExp • PathExp made by Radmilo Pesic & Branko Golubovic 53/74 Sum of sales Month Jan Total Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1% -1% 0% 1% 3% -1% -9% -1% 2% -4% 3% Change in sales over time Avg. sales Item Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Sony b/w printer 9% -8% 2% -5% 14% -4% 0% 41% -13% -15% -11% Sony color printer 0% 0% 3% 2% 4% -10% -13% 0% 4% -6% 4% HP b/w printer -2% 1% 2% 3% 8% 0% -12% -9% 3% -3% 6% HP color printer 0% 0% -2% 1% 0% -1% -7% -2% 1% -4% 1% IBM desktop computer 1% -2% -1% -1% 3% 3% -10% 4% 1% -4% -1% IBM laptop computer 0% 0% -1% 3% 4% 2% -10% -2% 0% -9% 3% Toshiba desktop comp. -2% -5% 1% 1% -1% 1% 5% -3% -5% -1% -1% Toshiba laptop comp. 1% 0% 3% 0% -2% -2% -5% 3% 2% -1% 0% Logitech mouse 3% -2% -1% 0% 4% 6% -11% 2% 1% -4% 0% Ergo-way mouse 0% 0% 2% 3% 1% -2% -2% -5% 0% -5% 8% Change in sales for each item-time combination made by Radmilo Pesic & Branko Golubovic 54/74 Avg. sales Region Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec North -1% -3% -1% 0% 3% 4% -7% 1% 0% -3% -3% South -1% 1% -9% 6% -1% -39% 9% -34% 4% 1% 7% East -1% -2% 2% -3% 1% 18% -2% 11% -3% -2% -1% West 4% 0% -1% -3% 5% 1% -18% 8% 5% -8% 1% Change in sales for the item IBM desktop computer per region made by Radmilo Pesic & Branko Golubovic 55/74 Complex Aggregation at Multiple Granularities: Multifeature Cubes • Example 1: Query 1: A simple data cube query. Find the total sales in 2000, broken down by item, region, and month, with subtotals for each dimension. • Example 2: Query 2: A complex query. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales among all maximum price tuples. select from where cube by such that item, region, month, MAX(price), SUM(R.sales) Purchases year=2000 item, region, month: R R.price=MAX(price) made by Radmilo Pesic & Branko Golubovic 56/74 • Example 3: Query 3: An even more complex query. Grouping by all subsets of {item,region,month}, find the maximum price in 2000 for each group. Among the maximum price tuples, find the minimum and maximum item shelf life. Also find the fraction of the total sales due to tuples that have minimum shelf life within the set of all maximum price tuples, and the fraction of the total sales due to tuples that have maximum shelf life within the set of all maximum price tuples. select from where cube by such that item, region, month, MAX(price), MIN(R1.shelf), MAX(R1.shelf), SUM(R1.sales), SUM(R2.sales), SUM(R3.sales) Purchases year=2000 item, region, month: R1, R2, R3 R1.price=MAX(price) and R2 in R1 and R2.shelf=MIN(R1..shelf) and R3 in R1 and R3.shelf=MAX(R1.shelf) made by Radmilo Pesic & Branko Golubovic 57/74 From Data Warehousing to Data Mining Data Warehouse Usage • Information processing • Analytical processing • Data mining made by Radmilo Pesic & Branko Golubovic 58/74 From On-Line Analytical Processing to On-Line Analytical Mining • High quality of data in data warehouses • Available information processing infrastructure surrounding data warehouses • OLAP-based exploratory data analysis • On-line selection of data mining functions made by Radmilo Pesic & Branko Golubovic 59/74 Architecture for On-Line Analytical Mining Constraint-based mining query Mining result Layer 4 user interface Graphical user interface API OLAM engine OLAP engine Layer 3 OLAP/OLAM Cube API Meta data MDDB Layer 2 multidimensional database Database API Data filtering, data integration Databases Databases Filtering Data cleaning Data integration Layer 1 data repository Data warehouse An integrated OLAM and OLAP architecture made by Radmilo Pesic & Branko Golubovic 60/74 Data Preprocessing made by Radmilo Pesic & Branko Golubovic 61/74 Data cleaning Data integration Data transformation -2, 32, 100, 59, 48 Data reduction T3 T4 A2 A3 … A1 A126 transactions transactions T2 attributes attributes A1 T1 -0.02, 0.32, 1.00, 0.59, 0.48 A3 … A115 T1 T4 … T1456 … T2000 Format of data preprocesing made by Radmilo Pesic & Branko Golubovic 62/74 Data Cleaning Missing values 1. 2. 3. 4. 5. 6. Ignore the tuple Fill in the missing value manualy Use a global constant to fill in the missing value Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value made by Radmilo Pesic & Branko Golubovic 63/74 Inconsistent data Noisy data • Bining Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition info (equidepth) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 • Clustering • Combined computer and human inspection • Regression made by Radmilo Pesic & Branko Golubovic 64/74 Data Integration and Transformation Data Integration Data Transformation • • • • • Smoothing Aggregation Generalization Normalization Attribute construction made by Radmilo Pesic & Branko Golubovic 65/74 Data Reduction • • • • • Data cube aggregation Dimension reduction Data compression Numerosity reduction Discretization and concept hierarchy generation made by Radmilo Pesic & Branko Golubovic 66/74 Dimensionality reduction 1. Stepwise forward selection 2. Stepwise backward elimination 3. Combination of forward selection and backward elimination • Decision tree induction selection Backward elimination • Example: Forward Initial attribute set: Initial attribute set: {A1,A2,A3,A4,A5,A6} Initial reduced set: {} {A1} {A1,A4} Reduced attribute set: {A1,A4,A6} {A1,A2,A3,A4,A5,A6} Decision tree inductiom Initial attribute set: {A1,A2,A3,A4,A5,A6} A4? Y {A1,A3,A4,A5,A6} A1? {A1,A4,A5,A6} N Reduced attribute set: Y {A1,A4,A6} Class1 Class2 Greedy (heuristic)methods for attribute subset selection. N A6? Y N Class1 Class2 Reduced attribute set: {A1,A4,A6} made by Radmilo Pesic & Branko Golubovic 67/74 Data Compression • Wavelet transforms • Principal components analysis made by Radmilo Pesic & Branko Golubovic 68/74 Numerosity Reduction • Regression and log-linear models • Histograms • Clustering • Sampling made by Radmilo Pesic & Branko Golubovic 69/74 10 9 8 7 6 5 4 3 2 1 25 20 count count Histogram Examples 15 10 5 5 10 15 20 price ($) 25 30 A histogram for price using singleton buckets – each bucket represent one price-value/frequency pair. 1-10 11-20 21-30 price ($) An equiwidth histogram for price, where values are aggregated so that each bucket has a uniform width of $10. made by Radmilo Pesic & Branko Golubovic 70/74 Discretization And Concept Hierarchy Generation ($0…$1000] ($0…$200] ($0…$100] ($200…$400] ($200…$300] ($100…$200] ($400…$600] ($400…$500] ($300…$400] ($600…$800] ($600…$700] ($500…$600] ($800…$1000] ($800…$900] ($700…$800] ($900…$1000] A concept hierarchy for the attribute price. made by Radmilo Pesic & Branko Golubovic 71/74 Discretization And Concept Hierarchy Generation for Numeric Data • • • • • Binning Histogram analysis Cluster analysis Entropy-based Discretization Segmentation by natural partitioning made by Radmilo Pesic & Branko Golubovic 72/74 Concept Hierarchy Generation for Categorical Data • • • • Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes country 15 distinct values province_or_state 365 distinct values city street 3,567 distinct values 674,339 distinct values Automatic generation of a schema concept hierarchy based on the number of distinct attribute values. made by Radmilo Pesic & Branko Golubovic 73/74 Credits: Radmilo Pešić Branko Golubović Veljko Milutinović [email protected] [email protected] [email protected] made by Radmilo Pesic & Branko Golubovic 74/74