* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slides - Zhangxi Lin`s homepage
Entity–attribute–value model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Functional Database Model wikipedia , lookup
ISQS 6339, Business Intelligence Extraction, Transformation, and Loading Zhangxi Lin Texas Tech University 1 Outline    Introduction to SSIS Learn by doing – Exercise 4 More about SSIS features - Package development tools 2 Youtube Videos  Introduction to SQL Server Integration Services ◦ Part 1 10’19”, Part 2 8’12”   Create A Basic SSIS Package with SSIS 7’55” How to create a simple SSIS package ◦ Part 1 3’19”, Part 2 7’54”, Part 3 6’41”  More videos ◦ An Overview of SSIS  Part 1 6’12”, Part 2 6’13”, Part 3 8’20” ◦ ETL Demon 10’40” ◦ ETL Tools 4’56” ◦ SSIS 2008 Package Deployment  Part I 8’04”, Part II 5’09” ◦ Introduction to SSIS 2008  Part I 9’57”, Part II 9’57”, Part III 9’55”, Part IV 9’59”, Part V 6’11” ◦ ETL Strategies with SSIS 21’56” Introduction to SSIS Structure and Components of Business Intelligence MS SQL Server 2008 SSMS SSIS SSAS BIDS SSRS SAS EG SAS EM 5 ETL Topics  Dimension Processing ◦ Extract changed row from the operational database ◦ Handling slowly changing dimensions ◦ De-duplication and fuzzy transforms  Fact Processing ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦  Extract fact data from the operational database Extract fact updates and deletes Cleaning fact data Checking data quality and halting package execution Transform fact data Surrogate key pipeline Loading fact data Analysis services processing Integrating all tasks 6 Automating your routine information processing tasks  Your routine information processing tasks ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦  Read online news at 8:00a and collect a few most important pieces Retrieve data from database to draft a short daily report at 10a View and reply emails and take some notes that are saved in a database View 10 companies’ webpage to see the updates. Input the summaries into a database Browse three popular magazines twice a week. Input the summaries into a database Generate a few one-way frequency and two-way frequency tables and put them on the web Merge datasets collected by other people into a main database. Prepare a weekly report using the database and at 4p every Monday, and publish it to the internal portal site. Prepare a monthly report at 11a on the first day of a month, which must be converted into a pdf file and uploaded to the website. Seems there are many things on going. How to handle them properly in the right time? ◦ Organizer – yes ◦ How about regular data processing tasks? 7 SQL Server Integration Services (SSIS) The data in data warehouses and data marts is usually updated frequently, and the data loads are typically very large.  Integration Services includes a task that bulk loads data directly from a flat file into SQL Server tables and views, and a destination component that bulk loads data into a SQL Server database as the last step in a data transformation process.  An SSIS package can be configured to be restartable. This means you can rerun the package from a predetermined checkpoint, either a task or container in the package. The ability to restart a package can save a lot of time, especially if the package processes data from a large number of sources.  8 What can you do with SSIS? To load the dimension and fact tables in the database. If the source data for a dimension table is stored in multiple data sources, the package can merge the data into one dataset and load the dimension table in a single process, instead of using a separate process for each data source.  To update data in data warehouses and data marts. The Slowly Changing Dimension Wizard automates support for slowly changing dimensions by dynamically creating the SQL statements that insert and update records, update related records, and add new columns to tables.  To process Analysis Services cubes and dimensions. When the package updates tables in the database that a cube is built on, you can use Integration Services tasks and transformations to automatically process the cube and to process dimensions as well.  To compute functions before the data is loaded into its destination. If your data warehouses and data marts store aggregated information, the SSIS package can compute functions such as SUM, AVERAGE, and COUNT. An SSIS transformation can also pivot relational data and transform it into a less-normalized format that is more compatible with the table structure in the data warehouse.  9 SQL Server Integration Services  The hierarchy of SSIS ◦ Project -> Package -> Control flow -> Data flow  Package structure ◦ ◦ ◦ ◦ ◦  Control flow Data flow Event handler Package explorer Connection tray Features ◦ ◦ ◦ ◦ Control Flow Data Flow Event Event driven Handler Layered Drag-and-drop programming Data I/O definitions are done using Connection Managers 10  SSIS Architecture 11 Control Flow      Bulk Insert task: Perform a fast load of data from flat files into a target table. Good for loading clean data. Execute SQL task: Perform database operations, creating views, tables, or even databases. Good for query data or metadata File Transfer Protocol and File System tasks: transfer files or sets of files. Execute Package, Execute DTS2000 Package, and Execute Process tasks: Break a complex workflow into smaller ones, and define a parent or master package to execute them. Send Mail task: sends an email message. 12 Control Flow (cont’d) Script and ActiveX Script tasks: Perform an endless array of operations that are beyond the scope of the standard tasks.  Data Mining and Analysis Service Processing tasks: Launch processing on SSAS dimensions and databases. Use SSAS DDL task to create new Analysis Services partitions, or perform any data definition language operation.  XML and Web Services tasks  Message Queue, WMI Data Reader, and WMI Event Watcher tasks: Help to build an automatic ELT system.  ForEach Loop, For Loop, and Sequence containers: Execute a set of tasks multiple times  Data Flow tasks  13 Data Flow Task    Data Flow task is a pipeline in which data is picked up, processed and written to a destination. Avoids I/O, which provided excellent performance Concepts ◦ ◦ ◦ ◦ Data sources Data destinations Data transformations Error flows 14 Frequently Used Data Transformation Steps Sort and Aggregate transforms  Conditional Split and Multicast transforms  Union All, Merge Join, and Lookup transforms  Slowly Changing Dimension transform  OLE DB Command transform  Row Count and Audit transforms  Pivot and Unpivot transforms  Data mining Model Training and data Mining Query transforms  Term extraction and Term Lookup transforms  File Extractor and File Injector transforms  15 Dynamic Packaging    Modifying the actions that a package takes when it’s executing. SSIS implements a rich expression language that is used in control flow and also in data flow transform. Concepts ◦ Expressions. Uses an expression language, simple. ◦ Variables. Can be defined within a package. Can be scoped to any object: package-wide, within a container, a single task, etc. ◦ Configurations. Can overwrite most of the settings for SSIS objects by supplying a configuration file at runtime. 16 Decision Issues in ETL System Design Source-to-target mapping  Load frequency  How much history is needed  1 7 ISQS 6339, Data Management & Business Intelligence Strategies for Extracting Data     1 8 Extracting data from packaged source systems –selfcontained data sources ◦ May not be good to use their APIs ◦ May not be good to use their add-on analytic system Extracting directly from the source databases ◦ Strategies vary depending on the nature of the source database Extracting data from incremental loads ◦ How the source database records the changes of the rows Extracting historical data ISQS 6339, Data Management & Business Intelligence De-Duplication Two common situations: person, and organization  SSIS provides two general-purpose transforms helping address data quality and de-duplication  ◦ Fuzzy Lookup ◦ Fuzzy Grouping 19 LEARN BY DOING – EXERCISE 4 Exercise 4: Populate Maximum Miniatures Manufacturing Data Mart Dimensions  Preparation: Data sources and destination definition ◦ Source database: AccountingSystemDatabase  Loading dimensions ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦   ProductType ProductSubType Product Country Plant (using SQL Command) Material (using SQL Command, Aggregate item) MachineType (copied from the Material loading task) Machine (copied from the MachineType loading task) Note: DimBatch and the fact table will be loaded in the next exercise. Debugging ◦ Step by step ◦ Understand the error messages ◦ Watch database loading status   See more detailed Guidelines of this exercise Submit the screenshots of “green” results of the ETL flow to [email protected] by February 20 before 5p. 21 Snowflake Schema of the Data Mart 10 DimBatch 9 DimMachine 8 ManufacturingFact 3 2 1 DimProduct 7 DimMachineType DimPlant 5 6 DimMaterial DimCountry 4 DimProductSubType DimProductType Aggregate SQL Coding 22 Codes for data flows  The following codes are used to selectively retrieve data from the source for the destination database  Code for DimPlant loading SELECT LocationCode, LocationName, CountryCode From Locations WHERE LocationType = 'Plant Site'  Code for DimMaterial loading SELECT AssetCode, AssetName, AssetClass, LocationCode, Manufacturer, DateOfPurchase, RawMaterial FROM CapitalAssets WHERE AssetType = 'Molding Machine' 23 Package Items   Data flow Task – main task Control Flow Items ◦  Data Preparation Tasks ◦   Analysis Services Processing Task, Analysis Services Execute DDL Task, Data Mining Query Task Transfer Tasks ◦  ActiveX Script Task, Script Task Analysis Services Tasks ◦  Bulk Insert Task, Execute SQL task Scripting Tasks ◦  Execute Package Task, Execute DTS 2000 Package Task, Execute Process Task, Message Queue Task, Send Mail Task, WMI Data Reader Task, WMI Event Watcher Task SQL Server Tasks ◦  File System Task, FTP Task, Web Service Task, XML Task Work Flow Tasks ◦  For Loop Container, Foreach Loop Container, Sequence Container Transfer Database Task, Transfer Error Messages Task, Transfer Logins Task Transfer Objects Task, Transfer Stored Procedures Task Maintenance Tasks Custom Tasks 24 Exploring Features of SQL Server ETL System  Data sources and data destinations ◦ SQL Server file (OLE DB file) ◦ Flat file ◦ Excel file  Data flow transformation ◦ ◦ ◦ ◦ Aggregate Derived Column Data Conversion Sort 25 More About SSIS Features ETL System Debugging  Most frequently encountered errors ◦ Data format error: The database table’s data type does not match the input data’s format  Reason 1: Flat Text file uses varchar(50), or string[DT_STR] format; Excel file uses nvarchar format  Reason 2:You defined the database using different formats, which could be caused by the imported data set.  Solution: A Data Conversion data transformation node can be used for changing the format ◦ SQL Server system error: Even though you did things correctly you could not get through.  Solution: the easiest way to solve this problem is to redo the ETL flow. 27 ETL How-to Problems       How to use Merge function of Data Transformation to join datasets from two tables into one. How to split a dataset to two tables How to remove duplicated rows in a table. How to detect the changes of the rows in the data sources and extract the updated rows into a table in the data warehouse. How to load multiple datasets with similar structure into a table Reference: SQL Server 2005 Integration Services, McGraw Hill Osborne, 2007 28 Connection managers           Excel Connection Manger File Connection Manger Flat File Connection Manager FTP Connection Manager HTTP Connection Manager ODBC Connection Manager OLE DB Connection Manager ADO Connection Manager – for legacy applications using earlier versions of programming languages, such as VB 6.0 ADO.NET Connection Manager – Access to Microsoft SQL Server and data sources exposed through OLE DB and XML by using a .NET provider Microsoft .NET Data Provider for mySQL Business Suite – access to SAP server and enables to execute RFC/NAPI commands and select queries against SAP tables Design-time data source objects can be created in SSIS, SSAS and SSRS projects 29 Container Managers Foreach Loop Container  For Loop Container  Sequence Container  30 Different Types of ETL Control Flows   With data flows, e.g. ◦ Import data ◦ Database updates ◦ Loading SCD ◦ Database cleansing ◦ Aggregating data Without data flows, e.g. ◦ Downloading zipped files ◦ Archiving downloaded files ◦ Reading application log ◦ Mailing opportunities ◦ Consolidating workflow package 31 Data Flow for Updating Database 32 Data Flow for Loading Slowly Changing Dimension 33 Control Flow for Importing Expanded Files 34 Exploring Features of SQL Server ETL System (TBD)  Data Set: ◦ Source: Commrex_2011, D5.txt (in the shared directory under \OtherDatasets) ◦ Destination: Flat file, Excel file, OLE DB file  Data flow transformation ◦ Aggregate (Use D5.txt, and aggregate the data with regard to UserID) ◦ Derived Column (Use Commrex_2011, and create a new column “NewID”) ◦ Data Conversion (Use Commrex_2011, and convert data type of some columns, such as UserID, Prop_ID) ◦ Sort (use D5.txt, sort ascending with ID, Date, Time) 35 Data Source  Lin.AccountingSystemDatabase