Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2011 1 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas Exercise #5 1 Due: Mar 24 Points: 20 points Pratt & Adamski: Premiere Products [PP] and Henry Books [HB] Databases Assignments must have cover sheet sheet, table of contents contents, index tabs tabs. Use 3 3-hole hole punch notebook (1/2” or smaller). Put your name on the spine of the notebook. Use a tab for PP and a tab for HB. Use ACCESS, PP, and HB databases. Redesign both PP and HB databases as they would be for a data warehouse as described in Adamson & Venables [Chapters 1 & 2] and Marakas [Chapters 1 & 2]. Use the Star diagram as the basis for their design. Be sure to include a meaningful Time dimension table. Turn-in printouts of the REVISED relationship diagrams, i.e., the Star Diagrams, for both databases. On a separate page(s), clearly identify for each database: Fact tables, dimension tables, primary keys, foreign keys, alternate keys, etc. Use relational notation from Pratt & Adamski (SEE CHAP 9, ALSO). Indicate the normal form [1NF, 2NF, 3NF, etc.] of each table. NOTE: Use the ORIGINAL copy of the Premiere Products and Henry Books databases for this assignment. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 2 Homework #5 Scoresheet 1 3 1 Objectives What is the purpose and motivation for developing a Data Warehouse (DW)? Position of DW within IT infrastructure Relationship between DW and business data mart What can a DW do? Foundations for Data Mining Steps in a typical Data mining project What is a “Correlation”? Correlation ? KEY CONCEPT History of Data Visualization vis-à-vis DW Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 4 1 1-1: The Modern Data Warehouse A data warehouse is a copy of transaction data specifically structured for querying querying, analysis and reporting Note that the data warehouse contains a copy of the transactions. These are not updated or changed later by the transaction system. Also note that this data is specially structured, and may have been transformed when it was placed in the warehouse Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1-2: Data Warehouse Roles and Structures 5 1 The DW has the following primary functions: It is i a direct di t reflection fl ti off the th business b i rules l off the th enterprise. It is the collection point for strategic information. It is the historical store of strategic information. It is the source of information later delivered to data marts. It is the source of stable data regardless of how the business processes may change. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 6 1 Elements of a DW Extract Transform Store/Load [ETS or ETL] Modern Data Warehousing, Mining & Visualization, 2003, George Marakas Position of the Data Warehouse Within the Organization – Figure 1-2 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 7 1 8 1 Data Marts A data mart is a smaller, more focused data warehouse. It reflects the business rules of a specific business unit. The data mart does not need to cleanse its data because that was done when it went into the warehouse. It is a set of tables for direct access by users. These tables are designed for aggregation. 9 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1 Data Marts and the Data Warehouse – Figure 1-6 Legacy Systems Legacy systems feed data to the warehouse. The warehouse feeds specialized information to departments and Data Marts and visa versa. Operational Data Store Finance Data Mart Sales Data Mart Marketing Data Mart Accountin g Data Mart Operational Data Store Operational Data Store Organizational Data Warehouse Operational Data Store Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 10 1 The Data Mart is More Specialized – Figure 1-7 The data mart serves the needs of one business unit, not the organization. Organizational Data Warehouse Corporate Highly granular data Normalized design Robust historical data Large data volume Data Model driven data Versatile General purpose DBMS technologies Finance Data Mart Sales Data Mart Marketing Data Mart Accting Data Mart Data Marts Organizational Data Warehouse Departmentalized Summarized, Summarized aggregated data Star join design Limited historical data Limited data volume Requirements driven data Focused on departmental needs Multi-dimensional DBMS technologies Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1-3: What Can a Data Warehouse Do? 11 1 Some of the benefits of a DW are: Fast information delivery y Data integration from across and even outside the organization Future vision from historical trends Additional tools for looking at data in new ways Freedom from IS department resource limitations (you don’tt need programmers, don programmers but rather data analysts to use a data warehouse) Customer Relationship Management [CRM] Customer Service Relationships [CRS] Mining or Auditing for accounting irregularities [Fraud] Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 12 Data Mining Example Service Quality vs. Training 1 Courtesy: MicroStrategy (2005) Modern Data Warehousing, Mining & Visualization, 2003, George Marakas Examples of Common DW Applications Table 1-1 13 1 Sales Analysis Determine real-time product sales to make vital pricing and distribution decisions. Analyze historical product sales to determine success or failure attributes. Evaluate successful products and determine key success factors factors. Use corporate data to understand the margin as well as the revenue implications of a decision. Rapidly identify a preferred customer segments based on revenue and margin. Quickly isolate past preferred customers who no longer buy. Identify daily what product is in the manufacturing and distribution pipeline. Instantly determine which salespeople are performing, on both a revenue and margin basis, and which are behind. Financial Analysis Compare actual to budgets on an annual, monthly and month-to-date basis. Review past cash flow trends and forecast future needs. Identify and analyze key expense generators. Instantly generate a current set of key financial ratios and indicators. Receive near-real-time, interactive financial statements. Human Resource Analysis Evaluate trends in benefit program use. Identify the wage and benefits costs to determine company-wide variation. Review compliance levels for EEOC and other regulated activities. Other Areas Warehouses have also been applied to areas such as: logistics, inventory, purchasing, detailed transaction analysis and load balancing. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 14 1 Table 1-2 Comparison of Typical DW Costs and Benefits Costs Hardware, software, development personnel and consultant costs. Ope Operational a o a cos costs s like eo ongoing go g sys systems e s maintenance. a e a ce Benefits Added Revenue Will the new (business objective) process generate new customers (what is the estimated value?) Will the new (business objective) process increase the buying propensity of existing customers (by how much?) Is the new process necessary to ensure that the competition doesn't offer a demanded service that y you can't match? Reduced costs What costs of current systems will be eliminated? Is the new process intended to make some operation more efficient? If so, how and what is the dollar value? 15 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1-4: The Cost of DW 1 Expenditures can be categorized as one-time initial costs or as recurring, ongoing costs. The initial costs can further be identified as for hardware or software. Expenditures can also be categorized as capital costs (associated with acquisition of the warehouse) or as operational p costs ((associated with running g and maintaining the warehouse) Cost of a Data Warehouse: Rule of Thumb: $1 million per 1 Terabyte of data Courtesy Walmart Corporation. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 16 Expenditures Associated with Building a DW Table 1-3 Recurring Costs Capital Operational 1 One-Time Costs Hardware maintenance Software maintenance Terminal analysis Middleware Hardware Disk CPU Network Terminal Analysis Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW Software DBMS Terminal analysis Middleware Log utility Processing Metadata Infrastructure Integration/transformation processing specification Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition 17 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1-5: Data Mining: Farmers and Explorers 1 Every corporation has two types of DW users. Farmers [Traditional Statistical Hypothesis t ti ] know testing] k what h t they th wantt before b f they th sett outt to t find it. They submit small queries and retrieve small nuggets of information. Explorers [Data Mining] are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless “golden” nuggets. gg Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 18 1-6: Foundations of Data Mining Data mining is the process of using raw data to infer important business relationships. Despite a consensus on the value of data mining, a great deal of confusion exists about what it is. It is a collection of powerful techniques intended for analyzing large datasets. There is no single data mining approach, but rather a set of techniques that can be used in combination with each other. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1-6 & -7: The Foundations of Data Mining 1 19 1 Data mining has roots in practice dating back over 30 years using standard statistics [e.g., bio bio-statistics; statistics; BIOMED software and mainframe computers (1960’s)] In the early 1960s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS. By the 1980s, the traditional techniques had been augmented by new methods such as fuzzy logic logic, heuristics and neural networks. Also, DSS tools came into popular use in the 1980’s with tools such as Lotus 1-2-3 & EXCEL Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 20 1 Data Mining – A General Approach Although all data mining endeavors are unique, they possess a common set of process steps: 1. Infrastructure preparation – choice of hardware platform, the database system and one or more mining tools 2. Exploration – looking at summary data, sampling and applying intuition [Data visualization useful here] 3. Analysis – each discovered pattern is analyzed for significance and trends 21 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas A General Approach (continued) 4. 5. 1 Interpretation – Once patterns have been discovered and analyzed analyzed, the next step is to interpret them. Considerations include business cycles, seasonality and the population the pattern applies to. Exploitation – this is both a business and a technical activity. y One way y to exploit p a pattern is to use it for prediction. Others are to package, price or advertise the product in a different way. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 22 1 1.8: The Approach to Data Exploration and Data Mining A The basis for all data mining activities is CORRELATION B A Perfect Correlation A B A Strong Correlation A B A Weak Correlation 23 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1 The Spectrum of Correlation 1/( -1) Perfect (Inverse) Correlation .5/( -.5) Moderate Correlation 0 No Correlation In general, a correlation coefficient is a number between 0 and ±1 that shows strength of a relationship. Some types of correlation are signed (±) to also show the direction of the relationship. Even a weak correlation can be interesting, however, if it shows a trend over time. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 24 1 Perfect positive or negative correlations +1 -1 25 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas Methods to Determine Correlation The method used depends on the type of elements being correlated. vs B A vs. A vs. A vs. A vs. 1 Data element vs vs. data element e.g., Sales of Digital Cameras vs. Sales of35mm film Data element vs. unit of time e.g., Sales of Digital Cameras vs. Months; TIME SERIES B BB B B B BBB B B B Data element vs. data element groups e.g., Sales of Digital Cameras vs. Public or Private Schools Data element vs. geography e.g., Sales of Digital Cameras vs. Region A vs. Data element vs. external trends e.g., Sales of Digital Cameras vs. Tax cuts A vs. Data element vs. demographics Sales of Digital Cameras vs. Age Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 26 The Data Warehouse and Data Mining 1 Data mining g does not require q the use of a data warehouse (DW), however, DWs are designed with data mining in mind. The data in the DW is integrated and stable (non-volatile) Data changes g continuously y in an operational p database. If multiple analyses are run in sequence, the data need to be held constant (as in a DW). Modern Data Warehousing, Mining & Visualization, 2003, George Marakas Volumes of Data – The Biggest Challenge 27 1 The largest g challenge g a “data miner” may y face is the sheer volume of data (number of rows vs. the number of bytes) in the warehouse. It is quite important, then, that summary data also be available to get the analysis started. A major j p problem is that this sheer volume may y mask the important relationships the analyst is interested in. The ability to overcome the volume and visualize the data becomes quite important. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 28 RFID Technology RFID 1 Technology gy http://www.pbs.org/newshour/bb/science/july-dec06/rfid_08-17.html Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1.9: Foundations of Data Visualization [DV] 29 1 One of the earliest known examples p of data visualization was in London during the 1854 cholera epidemic. A map (next slide) helped to identify the source of the disease. Modern visualization techniques grew from the twin technologies of computer graphics and high performance computing in the 1970s and 1980s. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 30 1 Dr. John Snow used a map to show the source of cholera was a water pump, thus proving the di disease was water borne. Broad Street Pump Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 31 1 DV: Opportunity and Timing Alternative input devices (light pen, sketch pad and mouse) began to appear in the 1960s. In the 1970s, flight simulators became much more realistic when graphics replaced film. In the same decade, special effects computers became entrenched in the entertainment industry. In the 1980s, visualization grew more dynamic with applications like the animation of weather patterns. Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 32 1 One of today’s more useful types of visualization is in simulators (both in games and in practice). This is the only way most of us will ever fly a Boeing 747 [Note: Instrument panel or Dashboard]. 33 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas Data Visualization – Sales by Region 1 Typical Spreadsheet Graphic 90 80 70 60 50 40 East West North 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 34 Data Visualization – Total Precipitation Modern Data Warehousing, Mining & Visualization, 2003, George Marakas DV & DM: Future Success Drivers 1 35 1 In the 1990s, rapid advances in chip technology, both at the CPU and the graphics processor, put data visualization everywhere. – Moore’s Law! On-going reduced costs of computing. Each new generation has a 10X-100X performancecost improvements. Approximately every 18 months [Moore’s Law]. Web-based Ecommerce Business to Consumer Commerce [B to C; and C:C] Generates billions and even trillions of characters per reporting period Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 36 The End Modern Data Warehousing, Mining & Visualization, 2003, George Marakas 1 37