Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The extraction of implicit, previously unknown and potentially useful information from large bodies of data often accumulated for other purposes. Data mining: Hmm, what is it? Data warehousing Examples Discussions Mining: large volumes of low grade data sifted through to find something interesting September 9, 2003 Data mining vs. database query DB can answer: Database query languages used when: finding and reporting information within the database (found in terms of factual entry); Provides data with no analysis. Data mining used when: looking for patterns and associations in the database/data warehouse, that are not explicitly present/coded in the database. Analysis needed to unveil hidden patterns and relationships. UMASSD, CIS, Iren Valova 3 As usual, a picture is worth mlns. mlns. S. Throat C I YES NO YES YES NO NO NO YES NO YES Fever C I YES NO YES NO YES NO NO NO YES YES Sw. Glands C I YES NO NO YES NO NO YES NO NO NO Congest. C I YES YES YES NO YES YES NO YES YES YES Headache C I YES YES NO NO NO NO NO YES YES YES Diagnosis C O ST AL COLD ST COLD AL ST AL COLD COLD Database query: List all patients with Diagnosis of Cold, or Diagnosis of Allergy and Swollen Glands. UMASSD, CIS, Iren Valova develop a general profile for credit card customers who take advantage of promotions offered with their credit card billing determine which patient is likely to come back to work after heart attack differentiate individuals who are poor credit risks from those who are likely to make their loan payment on time. September 9, 2003 UMASSD, CIS, Iren Valova 4 One more picture for a good measure Database query: List all patients with Swollen Glands and Diagnosis of ST. September 9, 2003 get a list of all dept.store customers who used credit card to buy gas grill get a list of all patients who have had at least one heart attack and whose cholesterol < 20 get a list of all credit card holders who used their credit card to purchase more than $300 in groceries during January DM can answer: Patient ID R U 1 2 3 4 5 6 7 8 9 10 2 As if it is not clear already … Now that we are in the course, perhaps she will really tell us what is data mining all about?!?! September 9, 2003 UMASSD, CIS, Iren Valova 5 Income RangMagazine ProWatch Promo Life Ins PromCredit Car Sex Age C C C C C C R I I I O I I I 40-50,000 Yes No No No Male 45 30-40,000 Yes Yes Yes No Fema 40 40-50,000 No No No No Male 42 30-40,000 Yes Yes Yes Yes Male 43 50-60,000 Yes No Yes No Fema 38 20-30,000 No No No No Fema 55 30-40,000 Yes No Yes Yes Male 35 20-30,000 No Yes No No Male 27 30-40,000 Yes No No No Male 43 30-40,000 Yes Yes Yes No Fema 41 40-50,000 No Yes Yes No Fema 43 20-30,000 No Yes Yes No Male 29 50-60,000 Yes Yes Yes No Fema 39 40-50,000 No Yes No No Male 55 20-30,000 No No UMASSD, CIS, IrenYes Yes Fema 19 6 September 9, 2003 Valova 1 It becomes claer… claer… I meant “clear”? Data warehousing In order to facilitate decision making, the data in a data warehouse are organized around major subjects, such as customer, item, supplier, and activity. The data are stored to provide information from a historical perspective (such as from the past 5-10 years) and are typically summarized. For example, rather than storing the details of each sales transaction, the data warehouse may store a summary of the transactions per item type for each store or, summarized to a higher level, for each sales region. Well, then we move on with Data Warehouses. Suppose that AllElectronics is a successful international company, with branches around the world. Each branch has its own set of databases. The president of AllElectronics has asked you to provide an analysis of the company's sales per item type per branch for the third quarter. This is a difficult task, particularly since the relevant data are spread out over several databases, physically located at numerous sites. If AllElectronics had a data warehouse, this task would be easy. A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data transformation, data integration, data loading, and periodic data refreshing. September 9, 2003 UMASSD, CIS, Iren Valova A data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure, such as count or sales-amount. 7 September 9, 2003 Data warehousing September 9, 2003 UMASSD, CIS, Iren Valova 8 MultiMulti-dimensional Data Cube UMASSD, CIS, Iren Valova 9 September 9, 2003 MultiMulti-dimensional Data Cube UMASSD, CIS, Iren Valova 10 Data warehouse vs. Data Mart Data warehouse: collects information about subjects that span an entire organization. Data mart: department subset of a data warehouse. OLAP Drill-down on time data September 9, 2003 By providing multidimensional data views and the precomputation of summarized data, data warehouse systems are well suited for On-Line Analytical Processing, or OLAP. OLAP operations make use of background knowledge regarding the domain of the data being studied in order to allow the presentation of data at different levels of abstraction. Such operations accommodate different user viewpoints. Roll-up on address UMASSD, CIS, Iren Valova 11 September 9, 2003 UMASSD, CIS, Iren Valova 12 2 DW and OLAP for DM DW and major characteristics Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. Loosely speaking, a data warehouse refers to a database that is maintained separately from an organization's operational databases. Data warehouse systems allow for the integration of a variety of application systems. They support information processing by providing a solid platform of consolidated historical data for analysis. In the last several years, many firms have spent millions of dollars in building enterprise-wide data warehouses. Many people feel that with competition mounting in every industry, data warehousing is the latest must-have marketing weapon - a way to keep customers by learning more about their needs. September 9, 2003 UMASSD, CIS, Iren Valova 13 September 9, 2003 Data Warehouse vs. Operational DBMS OLTP users clerk, IT professional knowledge worker Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc complex query OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making usage User and system orientation: customer vs. market unit of work read/write index/hash on prim. key short, simple transaction Data contents: current, detailed vs. historical, consolidated # records accessed tens millions Database design: ER + application vs. star + subject #users thousands hundreds View: current, local vs. evolutionary, integrated DB size 100MB-GB 100GB-TB Access patterns: update vs. read-only but complex queries metric transaction throughput query throughput, response access Distinct features (OLTP vs. OLAP): UMASSD, CIS, Iren Valova 15 Why Separate Data Warehouse? September 9, 2003 UMASSD, CIS, Iren Valova 16 DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation. A data warehouse is based on a multidimensional data model which views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Different functions and different data: lots of scans From Tables and Spreadsheets to Data Cubes High performance for both systems 14 OLAP Major task of traditional relational DBMS September 9, 2003 UMASSD, CIS, Iren Valova OLTP vs. OLAP OLTP (on-line transaction processing) data warehouse is organized around major subjects constructed by integrating multiple heterogeneous sources every key structure contains an element of time physically separate store of data transformed from the application data found in the operational environment. missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled September 9, 2003 UMASSD, CIS, Iren Valova 17 September 9, 2003 UMASSD, CIS, Iren Valova 18 3 Cube formation September 9, 2003 UMASSD, CIS, Iren Valova 19 September 9, 2003 UMASSD, CIS, Iren Valova 20 Cube: A Lattice of Cuboids all time time,item 0-D(apex) cuboid item time,location location item,location time,supplier time,item,location supplier location,supplier item,supplier time,location,supplier time,item,supplier 1-D cuboids 2-D cuboids 3-D cuboids item,location,supplier 4-D(base) cuboid time, item, location, supplier September 9, 2003 UMASSD, CIS, Iren Valova 21 Conceptual Modeling of Data Warehouses September 9, 2003 UMASSD, CIS, Iren Valova 22 Start schema example Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation September 9, 2003 UMASSD, CIS, Iren Valova 23 September 9, 2003 UMASSD, CIS, Iren Valova 24 4 Snowflake schema example September 9, 2003 UMASSD, CIS, Iren Valova Fact constellation schema 25 UMASSD, CIS, Iren Valova 26 define cube sales_star [time, item, branch, location]: Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables) First time as “cube definition” define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> September 9, 2003 UMASSD, CIS, Iren Valova Defining a Star Schema in DMQL A Data Mining Query Language: DMQL September 9, 2003 dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 27 Defining a Snowflake Schema in DMQL September 9, 2003 UMASSD, CIS, Iren Valova 28 Defining a Fact Constellation in DMQL define cube sales [time, item, branch, location]: define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension branch as (branch_key, branch_name, branch_type) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales define dimension location as (location_key, street, city(city_key, province_or_state, country)) September 9, 2003 UMASSD, CIS, Iren Valova 29 September 9, 2003 UMASSD, CIS, Iren Valova 30 5 Measures: Three Categories Germany country ... ... Spain North_America Canada ... Mexico E.g., avg(), min_N(), standard_deviation(). holistic: if there is no constant bound on the storage size needed to describe a subaggregate. Europe region E.g., count(), sum(), min(), max(). algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. all all distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning. A Concept Hierarchy: Dimension (location) city Frankfurt Vancouver ... ... Toronto E.g., median(), mode(), rank(). L. Chan office September 9, 2003 UMASSD, CIS, Iren Valova 31 September 9, 2003 View of Warehouses and Hierarchies 32 Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths gi on Specification of hierarchies Industry Region Re Schema hierarchy Set_grouping hierarchy {1..10} < inexpensive Year Category Country Quarter Product day < {month < quarter; week} < year UMASSD, CIS, Iren Valova Multidimensional Data ... M. Wind Product City Office Month Week Day Month September 9, 2003 UMASSD, CIS, Iren Valova 33 2Qtr Date 3Qtr 4Qtr sum Total annual sales of TV in U.S.A. Canada Mexico 0-D(apex) cuboid product product,date country date product,country 1-D cuboids date, country 2-D cuboids sum product, date, country September 9, 2003 UMASSD, CIS, Iren Valova 34 all U.S.A Pr o TV PC VCR sum 1Qtr UMASSD, CIS, Iren Valova Cuboids Corresponding to the Cube Country du ct A Sample Data Cube September 9, 2003 35 September 9, 2003 UMASSD, CIS, Iren Valova 3-D(base) cuboid 36 6 Browsing a Data Cube Typical OLAP Operations Roll up (drill-up): summarize data Drill down (roll down): reverse of roll-up Slice and dice: Pivot (rotate): Other operations September 9, 2003 Visualization OLAP capabilities Interactive manipulation UMASSD, CIS, Iren Valova 37 Customer CONTRACTS ORDER TRUCK Time ANNUALY QTRLY DAILY CITY PRODUCT LINE Product PRODUCT ITEM PRODUCT GROUP SALES PERSON COUNTRY DISTRICT REGION Each circle is called a footprint Location September 9, 2003 DIVISION Promotion 39 other sources Operational DBs Extract Transform Load Refresh drill through: through the bottom level of the cube to its back-end relational tables (using SQL) UMASSD, CIS, Iren Valova 38 Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record September 9, 2003 MultiMulti-Tiered Architecture Monitor & Integrator drill across: involving (across) more than one fact table Organization UMASSD, CIS, Iren Valova Metadata reorient the cube, visualization, 3D to series of 2D planes. Data Warehouse Design Process Customer Orders AIR-EXPRESS project and select September 9, 2003 A StarStar-Net Query Model Shipping Method from higher level summary to lower level summary or detailed data, or introducing new dimensions by climbing up hierarchy or by dimension reduction Serve 40 Three Data Warehouse Models OLAP Server Data Warehouse UMASSD, CIS, Iren Valova Analysis Query Reports Data mining Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart Independent vs. dependent (directly from warehouse) data mart Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized Data Marts Data Sources September 9, 2003 Data Storage OLAP Engine Front-End Tools UMASSD, CIS, Iren Valova 41 September 9, 2003 UMASSD, CIS, Iren Valova 42 7 Data warehouse usage DW and DM applications Business executives in almost every industry use the data collected, integrated, preprocessed, and stored in data warehouses and data marts to perform data analysis and make strategic decisions. In many firms, data warehouses are used as an integral part of a plan-executeassess "closed-loop" feedback system for enterprise management. Data warehouses are used extensively in banking and financial services, consumer goods and retail distribution sectors, and controlled manufacturing, such as demand-based production. DW: Information processing supports querying, basic statistical analysis, and reporting; Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. Operates on historical data. Typically, the longer a data warehouse has been in use, the more it will have evolved. This evolution takes place throughout a number of phases. Initially, the data warehouse is mainly used for generating reports and answering predefined queries. Progressively, it is used to analyze summarized and detailed data, where the results are presented in the form of reports and charts. Later, the data warehouse is used for strategic purposes, performing multidimensional analysis and sophisticated slice-and-dice operations. Finally, the data warehouse may be employed for knowledge discovery and strategic decision making using data mining tools. September 9, 2003 UMASSD, CIS, Iren Valova DM: knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. 43 So, now DW? What about DM? September 9, 2003 UMASSD, CIS, Iren Valova 44 Say what? "How does data mining relate to information processing and on-line analytical processing?" OLAP functions are essentially for user-directed data summary and comparison (by drilling, pivoting, slicing, dicing, and other operations). Information processing, based on queries, can find useful information. However, answers to such queries reflect the information directly stored in databases or computable by aggregate functions. They do not reflect sophisticated patterns or regularities buried in the database. Therefore, information processing is not data mining. Data mining covers a much broader spectrum than simple OLAP operations because it not only performs data summary and comparison, but also performs association, classification, prediction, clustering, time-series analysis, and other data analysis tasks. "Do OLAP systems perform data mining? Are OLAP systems actually data mining systems?" The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data summarization/aggregation tool that helps simplify data analysis, while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data. September 9, 2003 UMASSD, CIS, Iren Valova 45 September 9, 2003 One last attempt: DM rules! UMASSD, CIS, Iren Valova 46 An OLAM Architecture Mining query Data mining can help business managers find and reach more suitable customers, as well as gain critical business insights that may help to drive market share and raise profits. In addition, data mining can help managers understand customer group characteristics and develop optimal pricing strategies accordingly, correct item bundling based not on intuition but on actual item groups derived from customer purchase patterns, reduce promotional spending, and at the same time increase the overall net effectiveness of promotions. Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Filtering&Integration Database API Filtering Layer1 Databases September 9, 2003 UMASSD, CIS, Iren Valova 47 September 9, 2003 Data cleaning Data Data integration Warehouse UMASSD, CIS, Iren Valova Data Repository 48 8 References (I) References (II) S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB’96 D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses. SIGMOD’97. R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97 K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs.. SIGMOD’99. S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997. OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm, 1998. G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data Cubes. VLDB’2001 J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and subtotals. Data Mining and Knowledge Discovery, 1:29-54, 1997. September 9, 2003 UMASSD, CIS, Iren Valova 49 J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures. SIGMOD’01 V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96 Microsoft. OLEDB for OLAP programmer's reference version 1.0. In http://www.microsoft.com/data/oledb/olap, 1998. K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97. K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities. EDBT'98. S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes. EDBT'98. E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons, 1997. W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE’02. Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD’97. September 9, 2003 UMASSD, CIS, Iren Valova 50 Thank you !!! September 9, 2003 UMASSD, CIS, Iren Valova 51 9