Download What is a Data Warehouse?

Data Mining and Data Warehousing: Concepts and Techniques Course outlines Motivation Evolution of Database Technology - overview Why Data Mining? — Potential Applications What Is Data Mining? Data Mining: A KDD Process Data Mining: On What Kind of Data? What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling of Data Warehouses Defining a Snowflake Schema in Data Mining Query Language DMQL Multi-Tiered Architecture - Approaches to Building OLAP Server Indexing OLAP Data: Bitmap Index Data Warehouse Back-End Tools and Utilities From OLAP to On Line Analytical Mining OLAM, An OLAM Architecture Data Mining Functionalities Are All the “Discovered” Patterns Interesting? Market-Basket Data; typical case, Frequent Pairs in SQL, A-Priori Algorithm Motivation Data explosion problem: Automated data collection tools and database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. Data are collected from everywhere and in huge amounts  We are Data Rich but Information Poor  How to make good use of your data? Data warehousing and data mining  On-line analytical processing  Extraction of interesting knowledge (rules, patterns,…) from data in large databases.  Bring together scattered information from multiple sources as to provide a consistent database source for decision support queries.  Provide architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. 2 Evolution of Database Technology - overview 1960s: Data collection, database creation, IMS and network DBMS. 1970s: Relational data model, relational DBMS implementation. 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.). 1990s: Data mining and data warehousing, multimedia databases, and Web technology. 3 Why Data Mining? — Potential Applications  Database analysis and decision support  Market analysis and management  target marketing,  customer relation management, Customer profiling, Identifying customer requirements, market basket analysis, cross selling, market segmentation.  Risk analysis and management  Finance planning and asset evaluation:  Forecasting, customer retention, improved underwriting, quality control, competitive analysis.  Fraud detection and management  Applications- widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc.  Approach- use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. More examples:  Detecting inappropriate medical treatment  Detecting telephone fraud  money laundering - detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network)  Other Applications:  Text mining and Web analysis.  Etc. 4 What Is Data Mining?  Data mining (part of knowledge discovery in databases):  Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) information from data in large databases  Alternative names and their “inside stories”:  Data mining: a misnomer?  Knowledge Discovery in Databases (KDD: SIGKDD), knowledge extraction, data archeology, data dredging, information harvesting, business intelligence, etc.  What is not data mining?  (Deductive) query processing.  Expert systems or small statistical programs 5 Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Pattern Evaluation Knowledge Data Mining Task-relevant Data Data Warehouse Data Cleaning Selection Data Integration Databases 6 Knowledge Pattern Evaluation Data Mining Taskrelevant Data Steps of a KDD Process Data Warehouse Data Cleanin g Selection Data Integration Learning the application domain – relevant prior knowledge and goals of application Data where-housing Databases  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and projection – Find useful features, dimensionality/variable reduction, invariant representation. Data mining  Choosing functions of data mining - summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Interpretation: analysis of results - visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge 7 Data Mining and Business Intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources DBA Paper, Files, Information Providers, Database Systems, OLTP 8 Data Mining: On What Kind of Data? Data mining is performed on data coming from: Relational databases Transactional databases Advanced DB systems and information repositories  Object-oriented and object-relational databases  Spatial databases  Time-series data and temporal data  Text databases and multimedia databases  Heterogeneous and legacy databases  WWW … and accumulated in a data warehouse for long periods of time (several months or sometimes years) 9 What is a Data Warehouse? Defined in many different ways, but not rigorously.  A decision support database that is maintained separately from the organization’s operational database  Support information processing by providing a solid platform of consolidated, historical data for analysis. Knowledge Pattern Evaluation Data Mining “A data warehouse is a Taskrelevant Data subject-oriented, integrated, time-variant, W. H. Inmon and nonvolatile collection of data in support of management’s decision-making process.” Data Warehouse Data Cleanin g Selection Data Integration Databases Data warehousing: The process of constructing and using data warehouses 10 Knowledge What is a data warehouse? Pattern Evaluation Data Mining Data Warehouse (1/2) Taskrelevant Data Data Warehouse Data Cleanin g Subject Oriented Selection Data Integration Databases  Organized around major subjects, such as customer, product, sales.  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.  Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Integrated  Integrate multiple, heterogeneous data sources - relational databases, flat files, on-line transaction records  Data cleaning and data integration techniques are applied.  Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources (E.g., Hotel price: currency, tax, breakfast covered, etc.)  When data is moved to the warehouse, it is converted. 11 Knowledge What is a data warehouse? Pattern Evaluation Data Mining Taskrelevant Data Data Warehouse (2/2) Time Variant Data Warehouse Data Cleanin g Selection Data Integration Databases  The time horizon for the data warehouse is significantly longer than that of operational systems.  Operational database: current value data.  Data Warehouse data: provide information from a historical perspective (e.g., past 5-10 years)  Every key structure in the data warehouse  Contains an element of time, explicitly or implicitly  But the key of operational data may or may not contain “time element”. Non-Volatile  A physically separate store of data transformed from the operational environment.  Operational update of data does not occur in the data warehouse environment.  Does not require transaction processing, recovery, and concurrency control mechanisms  Requires only two operations in data accessing: initial loading of data and access of data. 12 What is a data warehouse? Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases Query driven approach: When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set. Complex information filtering, compete for sources Data warehouse: update-driven, high performance. 13 What is a data warehouse? Data Warehouse vs. Operational DB Systems OLTP (on-line transaction processing) Major task of traditional relational DBMS - Most database operations are of a type called OLTP. Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP):      User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries 14 What is a data warehouse? The most complex OLAP queries are often referred to as data mining OLTP vs. OLAP Common architecture  Local databases, say one per branch store, handle OLTP,  while a warehouse integrating information from all branches handles OLAP. OLTP OLAP users Clerk, IT professional Knowledge worker function day to day operations decision support application-oriented current, up-to-date detailed, flat relational isolated repetitive subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc DB design data usage access read/write index/hash on prim. key unit of work lots of scans short, simple transaction complex query tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response # records accessed 15 Knowledge What is a data warehouse? Pattern Evaluation Data Mining Taskrelevant Data Why Separate Data Warehouse?  High performance for both systems: Data Warehouse Data Cleanin g Selection Data Integration Databases  DBMS — tuned for OLTP: access methods, indexing, concurrency control, recovery  Warehouse — tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.  Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: Decision support requires consolidation (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled. 16 A multi-dimensional data model Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measurements Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. Fact constellations: Multiple fact tables share dimension tables. 17 Knowledge Conceptual Modeling of Data Warehouses Pattern Evaluation Data Mining Taskrelevant Data Example of Star Schema Data Warehouse Data Cleanin g Date Day Month Year Selection Data Integration Databases Product Sales Fact Table Date Store Product StoreID City State Country Region Store Customer unit_sales dollar_sales Yen_sales ProductNo ProdName ProdDesc Category QOH Cust CustId CustName CustCity CustCountry Measurements 18 Knowledge Conceptual Modeling of Data Warehouses Example of Snowflake Pattern Evaluation Data Mining Taskrelevant Data Schema Data Warehouse Data Cleanin g Selection Data Integration Databases Year Year Product Month Month Year Date Day Month Store City City State State Country Country Region StoreID City State Country Sales Fact Table Date Product Store Customer unit_sales dollar_sales Yen_sales ProductNo ProdName ProdDesc Category QOH Cust CustId CustName CustCity CustCountry Measurements 19 Conceptual Modeling of Data Warehouses A Data Mining Query Language: DMQL Language Primitives Cube Definition ( Fact Table ) Knowledge Pattern Evaluation Data Mining Taskrelevant Data Data Warehouse Data Cleanin g Selection Data Integration Databases define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) Special Case (Shared Dimension Tables) First time as “cube definition” define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> 20 Conceptual Modeling of Data Warehouses Defining a Snowflake Schema in DMQL define cube sales [date, product, store, customer]: dollar_sales = sum(sales_in_dollars), yen_sales = sum(sales_in_yens), unit_sales = count(*) define dimension product as (product_no, prod_name,prod_desc, category, QOH) define dimension cust as (custID, cust_name, cust_city, cust_country) define dimension date as (day, month (month_key, year (year_key) ) ) define dimension store as ( storeID, city ( city_key, state( state_key, country(country_key, region) ))) 21 A multi-dimensional data model A Sample Data Cube 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. sum 22 Data warehousing Typical OLAP Operations  Roll up (drill-up): summarize data; by climbing up hierarchy or by dimension reduction  Drill down (roll down): reverse of roll-up; from higher level summary to lower level summary or detailed data, or introducing new dimensions  Slice and dice: project and select  Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.  Other operations  drill across: involving (across) more than one fact table.  drill through: through the bottom level to its back-end relational tables. 23 Data warehouse software architecture Multi-Tiered Architecture Other sources Operational DBs Metadata Extract Transform Load ETL Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools 24 Data warehouse software architecture Three-Tier Data Warehouse Architecture  Enterprise warehouse: collects all of the information about subjects spanning the entire organization.  Data Mart: Other sourc es Operational DBs Metadata Extract Transform Load ETL Refresh Monitor & Integrator Data Warehouse OLAP Server Serv e Analysis Query Reports Data mining Data Marts a subset of corporate-wide data that is of value to a specific groups of users.  Its scope is confined to specific, selected groups, such as marketing data mart.  Independent vs. dependent (directly from warehouse) data mart  Virtual warehouse:  A set of views over operational databases.  Only some of the possible summary views may be materialized. 25 Building Data Warehouses Approaches to Building Warehouses … OLAP Server Architectures Relational OLAP (ROLAP): relational database system tuned for star schemas, e.g., using special index structures such as:  “Bitmap indexes” (for each key of a dimension table, e.g., bar name, a bit-vector telling which tuples of the fact table have that value).  Materialized views = answers to general queries from which more specific queries can be answered with less work than if we had to work from the raw data. Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces. Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services Multidimensional OLAP (MOLAP) - A specialized model based on a “cube” view of data. Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) - User flexibility, e.g., low level: relational, high-level: array. Specialized SQL servers - specialized support for SQL queries over star, snowflake schemas 26 Building Data Warehouses Indexing OLAP Data: Bitmap Index Index on a particular column Each value in the column has a bit vector: bit-op is fast The length of the bit vector: # of records in the base table The i-th bit is set if the i-th row of the base table has the value for the indexed column not suitable for high cardinality domains Base table Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe Type Retail Dealer Dealer Retail Dealer Index on Region Join Indices  Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)  Traditional indices map the values to a list of record ids  In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.  E.g. fact table: Sales and two dimensions city and product ; A join index on city maintains for each distinct city a list of RIDs of the tuples recording the Sales in the city  Join indices can span multiple dimensions RecID Asia 1 1 2 0 3 1 4 0 5 0 Europe 0 1 0 1 0 America 0 0 0 0 1 Index on Type RecID 1 2 3 4 5 Retail 1 0 0 1 0 Dealer 0 1 1 0 1 27 Building Data Warehouses Metadata Repository Meta data are the data defining warehouse objects  Description the structure of the warehouse - schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents  Operational meta-data - data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)  The algorithms used for summarization  The mapping from operational environment to the data warehouse  Data related to system performance - warehouse schema, view and derived data definitions  Business data - business terms and definitions, ownership of data, charging policies 28 Building Data Warehouses Data Warehouse Back-End Tools and Utilities Other sourc es Operational DBs Metadata Extract Transform Load ETL Refresh Monitor & Integrator Data Warehouse OLAP Server Serv e Analysis Query Reports Data mining Data Marts Data Extraction: get data from multiple, heterogeneous, and external sources Data cleaning: detect errors in the data and rectify them when possible Data Transformation: convert data from legacy or host format to warehouse format Load: sort, summarize, consolidate, compute views, check integrity, and build indices and partitions Refresh: propagate the updates from the data sources to the warehouse 29 From data warehousing to data mining Data Warehouse Usage Three kinds of data warehouse applications Information processing - supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing  multidimensional analysis of data warehouse data  supports basic OLAP operations, slice-dice, drilling, pivoting Data mining  knowledge discovery from hidden patterns  supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. 30 From data warehousing to data mining From On-Line Analytical Processing OLAP to On Line Analytical Mining OLAM  Why online analytical mining?  High quality of data in data warehouses DW contains integrated, consistent, cleaned data Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools OLAP-based exploratory data analysis mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions integration and swapping of multiple mining functions, algorithms, and tasks.  Architecture of OLAM 31 From data warehousing to data mining An OLAM Architecture Mining query On Line Analytical Mining Mining result Layer4 User Interface User GUI API OLAM Engine Data Cube API OLAP Engine MDDB Meta Data Database API Filtering & Integration Databases Data cleaning Data integration Filtering Data Warehouse Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository 32 Data Mining   Data mining is a popular term for queries that summarize big data sets in useful ways. Large-scale queries designed to extract patterns from data. Examples: Clustering all Web pages by topic. Finding characteristics of fraudulent credit-card use. Knowledge Pattern Evaluation Data Mining Taskrelevant Data Data Mining Functionalities  Concept description: Characterization and Comparison Generalize, summarize, and possibly contrast data characteristics. Data Warehouse Data Cleanin g Selection Data Integration Databases  Association  From association, correlation, to causality.  finding rules like “inside(x, city) à near(x, highway)”.  Classification and Prediction  Classify data based on the values in a classifying attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.  Predict some unknown or missing attribute values based on other information.  Cluster analysis – Group data to form new classes, e.g., cluster houses to find distribution patterns.  Outlier and exception data analysis  Time-series analysis (trend and deviation)  Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.  Similarity-based pattern-directed analysis  Full vs. partial periodicity analysis  Other pattern-directed or statistical analysis 34 Knowledge Pattern Evaluation Are All the “Discovered” Patterns Interesting? Data Mining Taskrelevant Data Data Warehouse Data Cleanin g Selection Data Integration  A data mining system/query may generate thousands of patterns, not all of them are interesting - Suggested approach: Human-centered, query-based, focused mining Databases  Interestingness measures: A pattern is interesting if it is     easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm  Objective vs. subjective interestingness measures:  Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.  Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc. Can It Find All and Only Interesting Patterns?  Find all the interesting patterns: Completeness - Can a data mining system find all the interesting patterns?  Search for only interesting patterns: Optimization.  Can a data mining system find only the interesting patterns?  Approaches  First general all the patterns and then filter out the uninteresting ones.  Generate only the interesting patterns --- mining query optimization 35 Market-Basket Data; typical case An important form of mining from relational data involves market baskets = sets of “items” that are purchased together as a customer leaves a store. Summary of basket data is frequent itemsets itemsets = sets of items that often appear together in baskets. Example: Market Baskets If people often buy hamburger and ketchup together, the store can: 1. Put hamburger and ketchup near each other and put potato chips between. 2. Run a sale on hamburger and raise the price of ketchup . Finding Frequent Pairs  The simplest case is when we only want to find “frequent pairs” of items.  Assume data is in a relation Baskets(basket, item).  The support threshold s is the minimum number of baskets in which a pair appears before we are interested. 36 Frequent Pairs in SQL SELECT b1.item, b2.item FROM Baskets b1, Baskets b2 WHERE b1.basket = b2.basket AND b1.item < b2.item GROUP BY b1.item, b2.item HAVING COUNT(*) >= s; Look for two Basket tuples with the same basket and different items. First item must precede second, so we don’t count the same pair twice. Create a group for each pair of items that appears in at least one basket. Throw away pairs of items that do not appear at least s times. A-Priori Trick Above query is prohibitively expensive for large data.   A-priori algorithm uses the fact that a pair (i, j) cannot have support s unless i and j both have support s by themselves.  More efficient implementation uses an intermediate relation Baskets1(a materialized view to hold only information about frequent items). INSERT INTO Baskets1(bid, item) SELECT * FROM Baskets WHERE item IN ( SELECT Items that appear in item FROM Baskets at least s baskets. GROUP BY item HAVING COUNT(*) >= s ); Then run the query for pairs on Baskets1 instead of Baskets. 37 A-Priori Algorithm Materialize the view Baskets1. Run the obvious query, but on Baskets1 instead of Baskets. Baskets1 is cheap, since it doesn’t involve a join. Baskets1 probably has many fewer tuples than Baskets. Running time shrinks with the square of the number of tuples involved in the join. Example: A-Priori Suppose: 1. A supermarket sells 10,000 items. 2. The average basket has 10 items. 3. The support threshold is 1% of the baskets.  At most 1/10 of the items can be frequent.  Probably, the minority of items in one basket are frequent -> factor 4 speedup. 38

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download What is a Data Warehouse?