Download Slides - Zhangxi Lin`s - Texas Tech University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
ISQS 3358, Business Intelligence
Extraction,Transformation, and
Loading
Zhangxi Lin
Texas Tech University
1
Outline




Data Integration
Introduction to SSIS
Learn by doing – Exercise 4
More about SSIS features - Package development
tools
2
Data Integration
Data Integration

Integration that comprises three major processes:
◦ data access,
◦ data federation, and
◦ change capture.


When these three processes are correctly implemented, data
can be accessed and made accessible to an array of ETL and
analysis tools and data warehousing environments
ETL Tools 4’56”
ISQS 3358 BI
4
Data Integration
Enterprise application integration (EAI)
A technology that provides a vehicle for pushing data from source systems into a
data warehouse, including application functionality integration. Recently serviceoriented architecture (SOA) is applied
 Enterprise information integration (EII)
An evolving tool space that promises real-time data integration from a variety
of sources, such as relational databases, Web services, and multidimensional
databases
 Extraction, transformation, and load (ETL)
A data warehousing process that consists of extraction (i.e., reading data from a
database), transformation (i.e., converting the extracted data from its previous
form into the form in which it needs to be so that it can be placed into a data
warehouse or simply another database), and load (i.e., putting the data into the
data warehouse)

ISQS 3358 BI
5
Transformation Tools: To purchase or to
Build in-House
Issues affect whether an organization will purchase data transformation
tools or build the transformation process itself
◦ Data transformation tools are expensive
◦ Data transformation tools may have a long learning curve
◦ It is difficult to measure how the IT organization is doing until it has
learned to use the data transformation tools
 Important criteria in selecting an ETL tool
◦ Ability to read from and write to an unlimited number of data source
architectures
◦ Automatic capturing and delivery of metadata
◦ A history of conforming to open standards
◦ An easy-to-use interface for the developer and the functional user

ISQS 3358 BI
6
Introduction to SSIS
ETL Topics

Dimension Processing
◦ Extract changed row from the operational database
◦ Handling slowly changing dimensions
◦ De-duplication and fuzzy transforms

Fact Processing
◦
◦
◦
◦
◦
◦
◦
◦

Extract fact data from the operational database
Extract fact updates and deletes
Cleaning fact data
Checking data quality and halting package execution
Transform fact data
Surrogate key pipeline
Loading fact data
Analysis services processing
Integrating all tasks
8
Different Types of ETL Control Flows


With data flows, e.g.
◦ Import data
◦ Database updates
◦ Loading SCD
◦ Database cleansing
◦ Aggregating data
Without data flows, e.g.
◦ Downloading zipped files
◦ Archiving downloaded files
◦ Reading application log
◦ Mailing opportunities
◦ Consolidating workflow package
9
Structure and Components of Business
Intelligence
MS SQL Server 2008
SSMS
SSIS
SSAS
BIDS
SSRS
SAS
EG
SAS
EM
10
What can you do with SSIS?
To load the dimension and fact tables in the database. If the source
data for a dimension table is stored in multiple data sources, the package
can merge the data into one dataset and load the dimension table in a
single process, instead of using a separate process for each data source.
 To update data in data warehouses and data marts. The Slowly
Changing Dimension Wizard automates support for slowly changing
dimensions by dynamically creating the SQL statements that insert and
update records, update related records, and add new columns to tables.
 To process Analysis Services cubes and dimensions. When the
package updates tables in the database that a cube is built on, you can use
Integration Services tasks and transformations to automatically process the
cube and to process dimensions as well.
 To compute functions before the data is loaded into its
destination. If your data warehouses and data marts store aggregated
information, the SSIS package can compute functions such as SUM,
AVERAGE, and COUNT. An SSIS transformation can also pivot relational
data and transform it into a less-normalized format that is more
compatible with the table structure in the data warehouse.

11
Youtube Videos

Introduction to SQL Server Integration Services
◦ Part 1 10’19”, Part 2 8’12”


Create A Basic SSIS Package with SSIS 7’55”
How to create a simple SSIS package
◦ Part 1 3’19”, Part 2 7’54”, Part 3 6’41”

More videos
◦ An Overview of SSIS

Part 1 6’12”, Part 2 6’13”, Part 3 8’20”
◦ ETL Demon 10’40”
◦ ETL Tools 4’56”
◦ SSIS 2008 Package Deployment

Part I 8’04”, Part II 5’09”
◦ Introduction to SSIS 2008

Part I 9’57”, Part II 9’57”, Part III 9’55”, Part IV 9’59”, Part V 6’11”
◦ ETL Strategies with SSIS 21’56”
SQL Server Integration Services

The hierarchy of SSIS
◦ Project -> Package -> Control
flow -> Data flow

Package structure
◦
◦
◦
◦
◦

Control flow
Data flow
Event handler
Package explorer
Connection tray
Features
◦
◦
◦
◦
Control
Flow
Data
Flow
Event
Event driven
Handler
Layered
Drag-and-drop programming
Data I/O definitions are done
using Connection Managers
13
Control Flow





Bulk Insert task: Perform a fast load of data from flat
files into a target table. Good for loading clean data.
Execute SQL task: Perform database operations,
creating views, tables, or even databases. Good for
query data or metadata
File Transfer Protocol and File System tasks: transfer
files or sets of files.
Execute Package, Execute DTS2000 Package, and
Execute Process tasks: Break a complex workflow into
smaller ones, and define a parent or master package to
execute them.
Send Mail task: sends an email message.
14
Control Flow (cont’d)
Script and ActiveX Script tasks: Perform an endless
array of operations that are beyond the scope of the
standard tasks.
 Data Mining and Analysis Service Processing tasks:
Launch processing on SSAS dimensions and databases.
Use SSAS DDL task to create new Analysis Services
partitions, or perform any data definition language
operation.
 XML and Web Services tasks
 Message Queue, WMI Data Reader, and WMI Event
Watcher tasks: Help to build an automatic ELT system.
 ForEach Loop, For Loop, and Sequence containers:
Execute a set of tasks multiple times
 Data Flow tasks

15
Data Flow Task



Data Flow task is a pipeline in which data is picked
up, processed and written to a destination.
Avoids I/O, which provided excellent performance
Concepts
◦
◦
◦
◦
Data sources
Data destinations
Data transformations
Error flows
16
Frequently Used Data Transformation
Steps
Sort and Aggregate transforms
 Conditional Split and Multicast transforms
 Union All, Merge Join, and Lookup transforms
 Slowly Changing Dimension transform
 OLE DB Command transform
 Row Count and Audit transforms
 Pivot and Unpivot transforms
 Data mining Model Training and data Mining Query
transforms
 Term extraction and Term Lookup transforms
 File Extractor and File Injector transforms

17
Demo: Exploring Features of SSIS

Data Set:
◦ Source: Commrex_2011, D5.txt (in the shared directory
under ~\Datasets)
◦ Destination: Flat file, Excel file, OLE DB file

Data flow transformation
◦ Aggregate (Use D5.txt, and aggregate the data with regard
to UserID)
◦ Derived Column (Use Commrex_2011, and create a new
column “NewID”)
◦ Data Conversion (Use Commrex_2011, and convert data
type of some columns, such as UserID, Prop_ID)
◦ Sort (use D5.txt, sort ascending with ID, Date, Time)
18
LEARN BY DOING –
EXERCISE 4
Exercise 4: Populate Maximum Miniatures
Manufacturing Data Mart Dimensions

Preparation: Data sources and destination definition
◦ Source database: AccountingSystemDatabase

Loading dimensions
◦
◦
◦
◦
◦
◦
◦
◦


ProductType
ProductSubType
Product
Country
Plant (using SQL Command)
Material (using SQL Command, Aggregate item)
MachineType (copied from the Material loading task)
Machine (copied from the MachineType loading task)
Note: DimBatch and the fact table will be loaded in the next exercise.
Debugging
◦ Step by step
◦ Understand the error messages
◦ Watch database loading status


See more detailed Guidelines of this exercise
Submit the screenshots of “green” results of the ETL flow to
[email protected] by February 20 before 5p.
20
Review: A Primary Key
A primary key is a candidate key
chosen to be the main key for the
relation.
 If you know the value of the primary key,
you will be able to uniquely identify a
single row.

Review: A Foreign Key
To preserve relationships, you may need
to create a foreign key.
 A foreign key is a primary key from one
table placed into another table.
 The key is called a foreign key in the table
that received the key.

Review: Referential Integrity
Referential integrity states that every value
of a foreign key must match a value of an
existing primary key.
 Example (see previous slide):

◦ If EmpID = 4 in EMPLOYEE has a DeptID = 7 (a
foreign key), a Department with DeptID = 7 must
exist in DEPARTMENT.
◦ The primary key value must exist before the
foreign key value is entered.
Snowflake Schema of the Data Mart
10
DimBatch
9
DimMachine
8
ManufacturingFact
3
2
1
DimProduct
7
DimMachineType
DimPlant
5
6
DimMaterial
DimCountry
4
DimProductSubType
DimProductType
Aggregate
SQL Coding
Why following this order?
24
Data Source

Lin.AccountingSystemDatabase
Codes for data flows

The following codes are used to selectively retrieve data from the source
for the destination database

Code for DimPlant loading
SELECT LocationCode, LocationName, CountryCode
From Locations
WHERE LocationType = 'Plant Site'

Code for DimMaterial loading
SELECT AssetCode, AssetName, AssetClass, LocationCode,
Manufacturer, DateOfPurchase, RawMaterial
FROM CapitalAssets
WHERE AssetType = 'Molding Machine'
26
More About SSIS
Features
ETL System Debugging

Most frequently encountered errors
◦ Data format error: The database table’s data type does
not match the input data’s format
 Reason 1: Flat Text file uses varchar(50), or
string[DT_STR] format; Excel file uses nvarchar
format
 Reason 2:You defined the database using different
formats, which could be caused by the imported
data set.
 Solution: A Data Conversion data transformation
node can be used for changing the format
◦ SQL Server system error: Even though you did things
correctly you could not get through.
 Solution: the easiest way to solve this problem is to
redo the ETL flow.
28
ETL How-to Problems






How to use Merge function of Data Transformation to join
datasets from two tables into one.
How to split a dataset to two tables
How to remove duplicated rows in a table.
How to detect the changes of the rows in the data sources
and extract the updated rows into a table in the data
warehouse.
How to load multiple datasets with similar structure into a
table
Reference: SQL Server 2005 Integration Services, McGraw Hill
Osborne, 2007
29
Strategies for Extracting Data




3
0
Extracting data from packaged source systems –selfcontained data sources
◦ May not be good to use their APIs
◦ May not be good to use their add-on analytic system
Extracting directly from the source databases
◦ Strategies vary depending on the nature of the source
database
Extracting data from incremental loads
◦ How the source database records the changes of the
rows
Extracting historical data
ISQS 6339, Data Management & Business
Intelligence
Decision Issues in ETL System Design
Source-to-target mapping
 Load frequency
 How much history is needed

3
1
ISQS 6339, Data Management & Business
Intelligence
De-Duplication
Two common situations: person, and
organization
 SSIS provides two general-purpose transforms
helping address data quality and de-duplication

◦ Fuzzy Lookup
◦ Fuzzy Grouping
32
Connection managers










Excel Connection Manger
File Connection Manger
Flat File Connection Manager
FTP Connection Manager
HTTP Connection Manager
ODBC Connection Manager
OLE DB Connection Manager ADO Connection Manager – for legacy
applications using earlier versions of programming languages, such as VB 6.0
ADO.NET Connection Manager – Access to Microsoft SQL Server and
data sources exposed through OLE DB and XML by using a .NET provider
Microsoft .NET Data Provider for mySQL Business Suite – access to SAP
server and enables to execute RFC/NAPI commands and select queries
against SAP tables
Design-time data source objects can be created in SSIS, SSAS and SSRS
projects
33
Container Managers
Foreach Loop Container
 For Loop Container
 Sequence Container

34
Data Flow for Updating Database
35
Data Flow for Loading Slowly Changing
Dimension
36
Control Flow for Importing Expanded Files
37
Automating your routine information
processing tasks

Your routine information processing tasks
◦
◦
◦
◦
◦
◦
◦
◦
◦

Read online news at 8:00a and collect a few most important pieces
Retrieve data from database to draft a short daily report at 10a
View and reply emails and take some notes that are saved in a database
View 10 companies’ webpage to see the updates. Input the summaries into a
database
Browse three popular magazines twice a week. Input the summaries into a
database
Generate a few one-way frequency and two-way frequency tables and put them
on the web
Merge datasets collected by other people into a main database.
Prepare a weekly report using the database and at 4p every Monday, and publish
it to the internal portal site.
Prepare a monthly report at 11a on the first day of a month, which must be
converted into a pdf file and uploaded to the website.
Seems there are many things on going. How to handle them properly in
the right time?
◦ Organizer – yes
◦ How about regular data processing tasks?
38
Package Items


Data flow Task – main task
Control Flow Items
◦

Data Preparation Tasks
◦


Analysis Services Processing Task, Analysis Services Execute DDL Task, Data Mining Query Task
Transfer Tasks
◦

ActiveX Script Task, Script Task
Analysis Services Tasks
◦

Bulk Insert Task, Execute SQL task
Scripting Tasks
◦

Execute Package Task, Execute DTS 2000 Package Task, Execute Process Task, Message Queue Task,
Send Mail Task, WMI Data Reader Task, WMI Event Watcher Task
SQL Server Tasks
◦

File System Task, FTP Task, Web Service Task, XML Task
Work Flow Tasks
◦

For Loop Container, Foreach Loop Container, Sequence Container
Transfer Database Task, Transfer Error Messages Task, Transfer Logins Task Transfer Objects Task,
Transfer Stored Procedures Task
Maintenance Tasks
Custom Tasks
39