Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of Business 1 The Business Dimensional Lifecycle Technical Architecture Design Product Selection & Installation Business Project Planning Requirement Dimensional Modeling Physical Design Data Staging Design & Development Deployment Maintenance and Growth Definition End-User Application Specification End-User Application Development Project Management 2 Data Staging DB2 Access Data Staging Data Warehouse (Oracle) Excel Legacy System 3 Data Staging • • • • • • Extraction Data Cleansing Data Integration Transformation Transportation (Loading) Maintenance 4 Extraction • Extract source data from legacy systems and place it in a staging area. • To reduce the impact on the performance of legacy systems, source data is extracted without any cleansing, integration and transformation operations. 5 Extraction • A variety of file formats exist in legacy systems – Relational database: DB2, Oracle, SQL Server, Informix, Access … – Flat file: Excel file, text file • Commercial data extraction tools are very helpful in data extraction. – Ex: Oracle Data Mart Builder 6 Data Preparation (Cleansing) It’s all about data quality!!! 7 Outline • • • • • • • Measures for Data Quality Causes for data errors Common types of data errors Common error checks Correcting missing values Timing for error checks and corrections Steps of data preparation 8 Measures for Data Quality • Correctness/Accuracy - w.r.t. the real data • Consistency/Uniqueness – data values, references, measures and interpretations • Completeness - scope of data & values • Relevancy – w.r.t. the requirements • Current data – relevant to the requirements 9 Causes for Data Errors • Data entry errors • Correct data not available at the time of data entries • By different users same time or same users overtime – Inconsistent or incorrect use of “codes” – Inconsistent or incorrect interpretation of “fields” • Transaction processing errors • System and recovery errors • Data extract/transformation errors 10 Common Data Errors • • • • Missing (null) values Incorrect use of default values (e.g., zero) Data domain integrity violation (e.g., 0/1) Data value (dependency) integrity violation (e.g., if mm=02 then DD<30) • Data referential integrity violation (e.g., a customer’s order record cannot exist unless the customer record already exists) 11 Common Data Errors, Cont’d • Data retention integrity violation (e.g., old inventory snapshots should not be stored) • Data Derivation/Transformation/Aggregation Integrity Violation (e.g., profit not = sales – costs) • Inconsistent data values of the same data (M versus m for male) • Inconsistent use of the same data value (DM for Data Mining and Data Marts) 12 Error Checks • • • • • • • • Domain value validation Value dependency validation Referential integrity validation Identify missing-value or default-value records Identify outliers Cross-footing -Check aggregates and derivations across different levels and against common sense Eyeballs! 13 Process validation Data Cleaning: Missing Values 1. Exclude the record 2. Exclude the attribute/field 3. Replaced by a global constant 4. Replaced by the attribute mean 5. Replaced by the most probable value 6. Apply 4 – 6 by class/segments of records 7. Manual correction 8. Application specific algorithm 1-6 are less practical for OLAP bound data 14 Timing for Error Checking • During Data Staging • During Data Loading • Others – Before data extraction (data entries, transaction processing, recovery, audits, etc.) – After data loading 15 Steps of Data Preparation • • • • • • • • • Identify data sources Extract and analyze source data Standardize data Correct and complete data Match and consolidate data Analyze data defect types Transform and enhance data into target Calculate derivations and summary data Audit and control data extract, transformation and loading 16 Data Integration • Data from different data sources with different formats need to be integrated into one data warehouse – Ex: 3 customer table in sales department, marketing department and an acquired company Customer (cid, cname, city …) Customer (customerid, customername,city…) Customer (custid, custname, cname,…) 17 Data Integration • Same attribute with different name: cid, customerid, custid • Different attribute with same name: – cname -> customer name – cname -> city name • Same attribute with different formats 18 Data Integration • How to integrate – Get the schemas of all data sources – Get the schema of the data warehouse – Integrate source schemas with the help from commercial tools and domain experts 19 Transformation • Prepare data for loading into the data warehouse – Change the data format – Create derived attributes and tables – Aggregate – Create warehouse keys 20 Transportation • Using bulk load tools, such as Oracle SQL Loader, instead of SQL command • Create indexes 21 Maintenance • Maintenance frequency: daily, weekly, monthly • Identify change records and new records in legacy systems – Create timestamps for changes and new records in legacy systems – Compare data between legacy systems and DW • Load changes and new records into DW 22