Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehouses Kathy S. Schwaig Outline Data Explosion Data Warehouses Multi-dimensional databases Portions of this presentation are adapted from @ J. Han, Simon Fraser University, Canada, 2000 Now that we have gathered so much data, What do we do with it? “ I never waste memory on things that can easily be stored and retrieved from elsewhere.” - Albert Einstein Data Explosion Problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases. We are drowning in data, but starving for knowledge! What is a Data Warehouse? An integrated and consistent store of subject-oriented data, structured for query and retrieval in order to support management decision making. A data warehouse is where the information systems department puts data to be turned into information. One cannot just dump masses of data into a disk drive and expect it to be usable. Goal of Data Warehousing Resolve enormous data access difficulties: Unavailable data hidden in transaction systems Delays as underpowered systems try to perform huge, complex queries Complex, user-hostile interfaces Difficulties in discovering patterns in large amounts of data Competition for computer resources between transaction systems and decision support systems On-line Transaction Processing (OLTP) Traditional database management systems (DBMS) used for on-line transaction processing (OLTP). Order entry: update status field of order 445522 Banking: transfer $100 from account 55779 to account 99321 Characteristics: detailed up-to-date data structured, repetitive tasks short transactions read and/or update a few records An OLTP example You call retailer Land’s End, where you have done business before. The exchange might be: “Hi, this is Mr. Smith. I’d like to place an order” “Your phone number, please?” “555-555-1212” (Pulls up your file) “Yes, Mr. Smith. What can I help you with? “I’d like to order merchandise number 2222” “I see you were a little late last year in getting your Christmas presents ordered. Would you like some suggestions to get the process started earlier?” “Sure, Why not?” “Last year, you bought your Aunt Jennifer a scarf. We have a lovely pair of gloves to match --they are on special for only $19. Should I add those to your order?” “Uh...sure.” “And would you like the card to say the same as last year?” “Yes, please.” Decision Support versus Transaction Processing Characteristics and usage patterns of operational systems (transaction processing systems) used to automate business processes and those of a Decision Support System are fundamentally different but linked. Why? What is a Data Warehouse? Facility for integrating data Organizes and stores data for analytical processing from historical perspective Maintained separately from organization’s operational database Data Warehouse Architecture DSS Server metadata other sources Extract Transform Load Refresh Serve Data Warehouse Analysis Query Reports Data mining Operational DBs Tools Data Sources Data Marts Characteristics of a Data Warehouse Subject-oriented Integrated Non-volatile Time-varying 1. Subject Oriented Oriented to the major subject areas of the corporation E.g. insurance company: customer, product, transaction, policy, claim, account Operational database and applications may be organized differently E.g. based on type of insurance's: auto, life, medical, fire. 2. Integrated Inconsistencies in encoding and naming conventions exist among data sources. Why? Data converted 3. Non-Volatile Operational data regularly accessed and manipulated a record at a time. Update performed in operational environment. Warehouse data loaded and accessed. Update of data does not occur in the data warehouse environment. 4. Time Variant... A data warehouse is a “time-variant” collection of data, meaning time is a variable in accessing the data. Time Variant Time horizon for data longer than that of operational systems. Operational database contains current value data. Data warehouse data is a sophisticated series of snapshots. The key structure of operational data may or may not contain some element of time. The key structure of the data warehouse always contains some element of time. Data Mart A data mart is a smaller version of a data warehouse, typically containing data related to a single functional area of the firm or having limited scope in some other way. It can be a useful first step to a full-scale data warehouse. Data: The Critical Issue Users need to gather, analyze, report on business information to help organizations gain competitive advantage. Most companies have a wealth of legacy data. Worthless if: existence unknown cannot be found cannot be understood incorrect Data Transformation Simple transformation -- e.g. change data type of field from integer to character Cleansing & scrubbing -- consistent format, valid values Integration -- data from multiple sources and map field by field into data warehouse. Aggregation / summarization Sample Operations Roll up -- summarize data total sales volume last year by product category by region Roll down, drill down, drill through -- go from higher level summary to lower level summary or detailed data For a particular product category, find the detailed sales data for each salesperson by date Slice and dice Sales of beverages in the West over the last 6 months Why Multi-Dimensional Databases? No single "best" data structure for all applications within an enterprise. Need good conceptual fit with the way end-users visualize business data Most business people already think about their businesses in multidimensional terms Managers tend to ask questions about product sales in different markets over specific time periods Adapted from Arun Rai 1999 What is a Multi-Dimensional Database? A multidimensional database (MDD) is a computer software system designed for the efficient and convenient storage and retrieval of large volumes of data that are: (1) intimately related (2) stored, viewed and analyzed from different perspectives. Perspectives called dimensions. Contrasting Relational and Multi-Dimensional Models: An Example SALES VOLUMES FOR GLEASON DEALERSHIP MODEL MINI VAN MINI VAN MINI VAN SPORTS COUPE SPORTS COUPE SPORTS COUPE SEDAN SEDAN SEDAN COLOR BLUE RED WHITE BLUE RED WHITE BLUE RED WHITE SALES VOLUME 6 5 4 3 5 5 4 3 2 Sales Volumes M O D E L Mini Van 6 5 4 Coupe 3 5 5 Sedan 4 3 2 Blue Red White COLOR Mutlidimensional Representation Sales Volumes M O D E L Mini Van Coupe Carr Gleason Clyde Sedan Blue Red White COLOR DEALERSHIP View Data – An Example Sales Volumes M O D E L DEALERSHIP COLOR •Assume that each dimension has 10 positions, as shown in the cube above •How many records would be there in a relational table? •Implications for viewing data from an end-user standpoint? Data Warehousing and The World Wide Web •Access and transfer large numbers of data relatively easily and economically • Integration of external data into data warehouse •Issues of data integrity, accuracy, quality •Quality rating versus price Applications •Data Mining •Data Visualization (Coming Next) Summary •Data versus Information • Data Warehouse Architecture •Characteristics •Applications Appendix: Operational Data Store and Data Warehouse Characteristic Characteristic Operational Data Store How is it built? One application or subject area at a time. Typically multiple subject areas at a time. User requirements Well defined prior to logical design. Often vague and conflicting. Area of support Day-to-day business operations. Decision support for managerial activities. Type of access Relatively small number of records retrieved via a single query. Tuned for frequent access to small amounts of data. Large data sets scanned to retrieve results from either single or multiple queries. Tuned for infrequent access to large amounts of data. Frequency of access Volume of data Data Warehouse Similar to typical daily volume of Much larger than typical daily operational transactions. transaction volume. Appendix: Operational Data Store and Data Warehouse Characteristic (cont’d) Characteristic Operational Data Store Data Warehouse Retention period Retained as necessary to meet Retention period is indeterminate daily operating requirements. and must support historical reporting, comparison, and analysis Currency of data Up-to-the-minuet; real time. Typically represents a static point in time. Availability of data High and immediate availability may be required. Immediate availability is less critical. Typical unit of analysis Small, manageable, transaction-level units. Large, unpredictable,variable units. Design focus High-performance, limited flexibility. High flexibility, high-performance. Appendix: Characteristics of a Data Warehouse Subject orientation. Integrated. Nonvolatile. Time variant. Summarized. Large volume. Not normalized. Metadata. Data sources. Data are organized based on how the users to them. All inconsistencies regarding naming convention and value representations are removed. Data are stored in read-only format and do not change over time. Data are not current but normally time-series. Operational data are mapped into a decision-usable format. Time-series data sets are normally quite large. DW data can be, and often are, redundant. Data about data are stored. Data come from internal and external unintegrated operational systems