* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Populating the Data Warehouse (ETL)
Open Database Connectivity wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Functional Database Model wikipedia , lookup
Relational model wikipedia , lookup
Extract, Transform, Load 1 Agenda  Review     Analysis Logical Design Physical Design Implementation (Bus Matrix, Info Package) (Dimensional Modeling) (Spreadsheet) (Data Mart Relational Tables)  ETL Process Overview  ETL Components     Staging Area Extraction Transformation Loading  Documenting High-Level ETL Requirements  Documenting Detailed ETL Flows  Example ETL 2 Review: Dimensional Modeling 3 Review: DM Implementation DimStudent CREATE TABLE DimStudent( student_sk int identity(1,1), student_id varchar(9), firstname varchar(30), lastname varchar(30), city varchar(20), state varchar(2), major varchar(6), classification varchar(25), gpa numeric(3, 2), club_name varchar(25), undergrad_school varchar(25), gmat int, undergrad_or_grad varchar(10), CONSTRAINT dimstudent_pk PRIMARY KEY (student_sk)); GO FactEnrollment CREATE TABLE FactEnrollment( student_sk int, class_sk int, date_sk int, professor_sk int, course_grade numeric(2, 1), CONSTRAINT factenrollment_pk PRIMARY KEY (student_sk, class_sk, date_sk, professor_sk), CONSTRAINT factenrollment_student_fk FOREIGN KEY (student_sk) REFERENCES dimstudent(student_sk), CONSTRAINT factenrollment_class_fk FOREIGN KEY(class_sk) REFERENCES dimclass (class_sk), CONSTRAINT factenrollment_date_fk FOREIGN KEY(date_sk) REFERENCES dimtime (date_sk), CONSTRAINT factenrollment_professor_fk FOREIGN KEY(professor_sk) REFERENCES dimprofessor (professor_sk)); GO 4 Review: Physical DW Design 5 ETL Overview  Reshaping relevant data from source systems into useful information stored in the DW  Extract  Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW  Transform  Cleaning and converting data to prepare it for loading into the DW  Load  Putting cleansed and converted data into the DW 6 ETL Process  Not Really New, BUT…  Much more data  Includes rearranging, summarizing  Data used for strategic decision-making  Characteristics:     Process AND technology Detailed, highly-dependent tasks Consumes average 75% of DW development An on-going process for life of DW  Requirements:  Well-documented  Automated  Flexible 7 ETL Process 1. Determine target data 2. Determine data sources 3. Prepare data mapping 4. Organize data staging area 5. Establish data extraction rules 6. Establish data transformation rules 7. Plan aggregate tables 8. Establish data load procedures 9. Load dimension tables 10. Load fact tables 8 ETL Process Flow 3, Spreadsheet 1, Dim Model 2, Spreadsheet 6, 7, Map & SSIS 5, SSIS 8, 9, 10, SSIS 4 9 ETL Staging Area  Information hub, facilitating the enriching stages that data goes through to populate a DW  Advantages:  Separates source systems and DW  Minimizes ETL impact on source AND DW systems  Can consist of multiple “hubs”  “upload” area  “staging” area  “DW load images” 10 ETL Staging Area, cont… 11 High Level Design of ETL Process  Initial documentation of:  What data do we need and where is it coming from?  Physical DW Design Spreadsheet shown previously  What are the major transformation/cleansing needs?  “Extend” Physical DW Design Spreadsheet OR  ETL Map  What’s the sequence of activities for ETL?  ETL Map 12 Common Transformations  Format Revisions  Key Restructuring, Lookup  Handling of Null Values  Decoding fields  Calculated, Derived values  Merging of Data 13 Common Transformations, cont…  Splitting of single fields  Character set conversion  Units of measurement conversion  Date/time conversion  Summarization  Deduplication 14 Common Transformations, cont…  Other Data Quality Issues  Standardize values  Validate values  Identifying mismatches, misspellings  Etc…  Suggestions:  Appoint “Data Stewards”  Ensure ETL programs have control checks  Data Profiling… 15 Comparison of Models 16 Transformations Example DimTime DimProfessor DimClass DimStudent FactEnrollment Create table Generate SK Generate SK Generate SK Add SKs: student, section, prof (join registration to student, time, and section dims; left join them to prof) Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Expand rank values (use SQL case) Get coursename & cred hrs from section tbl (join section to course) Expand classification values (use SQL case) Expand department values (join prof to departments) Expand state values (needs lookup table but use SQL case instead) Get gmat, undergrad school from grad table (join student to grad) Get club name from club (join student to undergrad; Left join them to club) Create undergrad_or_grad values (if stud_id in undergrad or stud_id in grad) 17 Data Profiling  Systematic analysis of the content of a data source  Goals:  Anticipate potential data quality issues upfront  Build quality corrections and controls into ETL process  Manual and/or Tool-assisted 18 Profiling Example: Manual Account CustID Number Customer First Type Title Name AW000110 11000 00 I AW000110 11001 01 I AW000110 11002 02 Last Name Gender Email Phone Address Line1 Address Line2 State Postal Code Country Yang F [email protected]. 1(11) 500 5550162 3761 N. 14th St Queensland 4700 AU Eugene Huang F [email protected]. 500-555-0110 2243 W St. Victoria 3198 AU I Ruben Torres F [email protected]. 1(11) 500 5550184 5844 Linden Dr New South Wales 7001 AU AW000110 11003 03 I Christy Zhu F [email protected]. 1(11) 500 5550162 1825 Village Pl. Queensland 2113 AW000110 11004 04 I F [email protected]. 7553 Harness (500) 555-0131 Circle AW000110 11005 05 I M [email protected]. 1(11) 500 5550151 Mr. Jon Mrs. Elizabeth Johnson Julio Ruiz 7305 Humphrey Drive New South Wales 2500 AU 4169 OZ 19 Profiling Example: SSIS 20 Documenting ETL High Level Design  Add to existing DW Physical Design Spreadsheet 21 Documenting ETL High Level Design 22 Low Level Design of ETL Process  Detailed documentation of:  What data do we need and where is it coming from?  What are the major transformation/cleansing needs?  What’s the sequence of activities for ETL?  Can use tool like SSIS 23 Extracting Source Data  Two forms: 1. Static Data Capture   Point-in-time snapshot Initial Loads and periodic refreshes 2. Revised Data Capture    Only data that has been added, updated, deleted since last load Ongoing incremental loads Two timeframes   Immediate Deferred 24 Static Data Capture  (T)SQL Scripts  e.g., small number of tables/rows  Export/Import Tables  e.g., database or non-database sources  Backup/Restore Database  e.g., copying sqlserver source database for initial load ETL  Detach/Attach Database  e.g., copying older sqlserver version to newer sqlserver version for initial load ETL 25 Revised Data Capture  Immediate / Real-time  ETL side:  OLTP side:  OLTP side: procs get changed data from log real-time and update ETL staging tables triggers update ETL staging tables apps write to OLTP AND ETL staging tables  Deferred  ETL side:  ETL side:  OLTP side: procs get changed data from OLTP tables based on timestamps procs do file comparison changed data capture (SS 2008) 26 Documenting ETL Low Level Design: SSIS  Comes with SQL Server  Helps document and automate ETL process  Based on defining  Packages  Tasks  One approach  A package for each target table  A "master" package 27 SSIS Package Examples: Master 28 SSIS Package Examples: Extract All 29 SSIS Package Examples: Extract Changed using CDC Eg, SELECT * from cdccustomer WHERE cdc_chg_date > etl_last_capture_date; 30 SSIS Package Examples: Transforms 31 SSIS Package Examples: Load 32 Class Performance DW Example  Create ClassPerformanceDW database  Using ClassPerformanceDW database…  Create ClassPerformanceDW tables using SQL Script  http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables/create_class_performance_dw_tables.sql 33 ETL Example using SQL Scripts  One "Master Script"  Calls five "table" scripts 34 "Master" Script --be sure to turn on Query, SQLCMD mode in order to run this script Use ClassPerformanceDW print 'loading dimclass table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimclass.sql" print 'loading dimprofessor table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimprofessor.sql" print 'loading dimstudent table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimstudent.sql" print 'loading dimtime table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_dimtime.sql" print 'loading factenrollment table' Go :r "C:\Documents and Settings\Gina\Desktop\generate_class_performance_dw_tables\load_factenrollment.sql" Print 'class performance DW data transformation and loading is complete' Go 35 Load "DimProfessor" Script (pg. 1 of 3) set nocount on print 'remove existing data from dimprofessor' delete from dimprofessor; go print 'reseeding SK identity value back to 1' dbcc checkident ('dimprofessor', reseed, 0); go print 'adding oltp prof data to dimprofessor' print 'professor_sk will be automatically inserted' insert into dimprofessor ( professor_id, firstname, lastname, rank, department) select prof_id, firstname, lastname, rank, dept from regnOLTP.dbo.prof ; go 36 Load "DimProfessor" Script (pg. 2 of 3) print 'decoding rank field' UPDATE dimprofessor SET dimprofessor.rank = case dimprofessor.rank when 'asst' then 'assistant prof' when 'assc' then 'associate prof' when 'prof' then 'full prof' end ; Go print 'decoding department field using imported excel spreadsheet' UPDATE dimprofessor SET dimprofessor.department = regnOLTP.dbo.departments.department FROM dimprofessor, regnOLTP.dbo.departments WHERE dimprofessor.department = regnOLTP.dbo.departments.prefix ; Go 37 Load "DimProfessor" Script (pg. 3 of 3) print 'adding SK -1 row' set identity_insert dimprofessor on Go insert into dimprofessor ( professor_sk, professor_id, firstname, lastname, rank, department) Values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown'); GO set identity_insert dimprofessor off Go Set nocount off 38 Load "FactEnrollment" Script print 'adding oltp registration data to fact_enrollment' INSERT INTO factenrollment ( student_sk, class_sk, date_sk, professor_sk, course_grade) SELECT student_sk, class_sk, datekey, professor_sk, final_grade FROM ((((regnOLTP.dbo.registration INNER JOIN dimstudent ON registration.stud_id = dimstudent.student_id) INNER JOIN dimclass ON regnOLTP.dbo.registration.callno = dimclass.crn) INNER JOIN dimtime ON CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = actualdatekey) INNER JOIN regnOLTP.dbo.section ON dimclass.crn = regnOLTP.dbo.section.callno) LEFT JOIN dimprofessor ON regnOLTP.dbo.section.prof_id = dimprofessor.professor_id ; Go 39 Entire Transform/Load "Package" http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables.zip 40
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            