Download data warehousing olap project description

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data warehousing Project
1.Synopsis: ........................................................................................................................................2
2.Software and Database Specification: ...........................................................................................2
3.Features of MS SQL server 2008: .................................................................................................2
4.Project Design: ..............................................................................................................................3
5.Software Installation: .....................................................................................................................3
6.Creation of Fact and Dimenions Tables: .......................................................................................4
7.Report Generation using the SSRS wizard of SQL server 2008 ...................................................4
8.Data Extraction from a Flat file using the SSIS tool: ....................................................................9
9.Populating the Datawarehouse using the SSIS integration services tool: ...................................14
10.Data Normalization: ..................................................................................................................19
11.Assuring Data Quality: .............................................................................................................19
12.Books and references .................................................................................................................19
DATA WAREHOUSING OLAP PROJECT DESCRIPTION
1.
Synopsis:
The main aim of doing this project is to understand the basic concepts of how the data warehouse is
built from scratch using an external data file. The project is to be done using MS SQL server 2008
which is a commercial software being used for building and maintaining warehouse. This tool can also
be used for generation of reports for analysis purposes.
2.
Software and Database Specification:
Software used:
The MS SQL server 2008 was downloaded from the following site:
http://www.microsoft.com/sqlserver/2008/en/us/trial-software.aspx
This tool was a free download and the trial version is applicable for a period of 180 days.
Database tables and Schema:
Microsoft is also offering free database tables with data, so that we can use these samples as our
database for building the data warehouse and also for reporting purposes. The databases use MySQL as
the querying script. The databases are to be downloaded from the following link:
http://sqlserversamples.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=18648
3.
Features of MS SQL server 2008:
This is a commercial tool being used in industry and it has all the features required to do all the OLAP
activities (example data mining, data warehousing, reporting etc). Some of the main features of this
tool are as follows:
Server Management Studio: This is a graphical tool used for all the data management activities and it
also helps us to build a query in a very user friendly development environment.
Analysis Services: Analysis Services makes it easy to create sophisticated on line analytical processing
(OLAP) and data mining solutions. The Analysis Services tools provide the capability to design, create,
and manage cubes and data-mining models from data warehouses and to provide client access to OLAP
data and data mining data
Integration services: Microsoft Integration Services makes it easy to create robust and complex
solutions for extracting, transforming, and loading data (ETL). The Integration Services tools let you
design, create, deploy, and manage packages that address everyday business requirements. This was the
main tool which is used by the data warehouse developer for designing data warehouse
Reporting Services: This tool is used to configure a report server, manage report server content and
operations, and create and publish reports and report models
4.
Project Design:
Create a small datawarehouse with the following tables:
Employee table which contains the employee information with attributes the employee id, First name,
Last name. Empid is the primary key of this table.
Department table which contains the name of the department and the department id. In this table
Dept_id would be the primary key.
Salary table: This table will have the dno which would be the foreign key form the department table
along with the empid. This table will have the details like the rate the employee charges for a day and
the frequency of how many times in a month he/she is paid.
Steps for the creation of datawarehouse
Step1 : Data analysis
Step2: Data Extraction from an external source and tranforming the data into a data source(NDS/ODS).
In our case, it is NDS as the data is already normalized. The granularity is at the lowest level as the
amount of data is very less.
Step3: Creation of data and fact tables and populating the data.
Step4: Quality assurance and data check done using reporting.
Project execution
In our case, as we do not have the raw data, we would do the data extraction from our available
database using the reporting tool and then upload the data from the flat file into the fact and dimension
tables using the SSIS software of MS Sql server which is used to extract the data from the operational
envionment and upload the same into the Data Source.
We will also be manipulating the columns like how the data are intergrated in a datawarehouse using
the SSIS tool. Hence, all the operations of ETL, i.e. the extraction, tranforming and loading of data
would be done using this SSIS tool.
5.
Software Installation:
Once the tools and database are downloaded, they are to be installed on your PC and the installations
are done using the MSDN help, so that the PC becomes both the server and the database. Once the
installation is done, do connect database with the SQL server and test the connectivity. Once the
connection is established all the reports and integration activities can be done on the same PC.
Testing data connectivity:
Open the SQL server management studio. To check the database engine connectivity, select database
under server type give the server name. If all the credentials are correct, the database would get
connected and the below screen would be displayed.
6.
Creation of Fact and Dimensions Tables:
The fact and the dimensions tables form the basis of the data warehousing. In our scenario, we will
have three dimensions tables, namely dim_emp, dim_dept, dim_salary.
We will have only one fact table which will have columns like empid,dno,rate,Asalary and Msalary
where emp id is the key form the emp table, dno is the key from the department table and rate is a
column from the salary table. The Asalary and Msalary would be calculated as the annual and Monthly
salary for each and every employee.
7.
Report Generation using the SSRS wizard of SQL server 2008
Using this tool we can create reports and do necessary analysis with them. This wizard is very efficient
from an end user perspective as the user does not need to know the complete schema of the tables. He
just needs to know the tables and what type of data it contains.
Follow the steps given below for generating a report.
Click on Start-> programs-> MS SQL server 2008->Server integration Business Intelligence Studio
The next screen which we need to select would be the project type. In the above case, we would select
the “Reporting using Wizard” option. There is also another option “Reporting services project” that
can be used if we didn’t have to use the wizard.
After we give all the necessary server and database details a connection would be established. Also do
check if the details are right by clicking on the test connection button. The pop up “Test connection
successful” confirms the establishing of the connection.
When we click on OK the Query builder page would open as below
Build the query by selection the table option on the left hand side.
Specify the columns which needs to be displayed in the report the query is automatically generated by
query builder.. The Report type also needs to be selected . This decides if the report is to be in tabular
or in matrix from. After all the necessary details have been given the report is generated as shown
below.
To make the data from the database to be displayed in the table format of the report, from the toolbox
click on “create table option” options and match each and every cell of the table with the individual
columns of the report. If the mapping is not done correctly the report would not be displayed.
When we click on preview we get the below report view. Click on the Save button and export the data
into an TXT file so that we can use the same as a external flat file to build a data warehouse Save the
TXT file in our PC so that it can be later used to build the data warehouse
8.
Data Extraction from a Flat file using the SSIS tool:
In our case, we would be extracting data from the report which we have got from the database and we
would insert the same into our new database.
This is one of the ETL operations Extract transform and then Load activities and hence this is done by
the Integration services of the SQL Server. This data extraction forms the base for constructing the data
warehouse
Click on Start-> programs-> MS SQL Server 2008-> SQL Server Business Development intelligence
studio.
Select the option “Integration services” using wizard from the project type option and then give the
name of the Integration services project.
Initially the Integration services tool would ask us to specify the type of data we would select the flat
file option as the external data is the flat file. Also select the “OLE DB/ SQL server” option as the
destination is the table in the database (i.e. the data is loaded form flat file to table). Please specify the
necessary database and connection details and to check the connectivity click on test connection. The
screens shown below would appear.
Once the connection is tested we have to select which of the two would be the source and destination.
In our case the source would be a flat file and the destination would be a OLE DB. After selecting all of
these options click on OK. The data flow object comes on the screen, as shown below.
First, when the source and destination comes in the screen click on the green arrow and connect the
source to that of the destination just to indicate that the connection will be established between both.
Double click on the flat file source option and then select the flat file which needs to be loaded from the
PC. Also modify the column values to their respective data types so that they don't suffer any data type
discrepancy while inserting the data into the database.
Then double click on the OLE DB destination task and give the specific server and database details.
Also, do test the connection by clicking on the test connection button.
Then we need to map the table column names to that of the column names of the flat file. This is done
when we click on the mappings tab. When we select the appropriate match the mappings are done
accordingly and then click on OK.
Once the page is reloaded, click on the small green triangle option which means load the package. If
the package is without any error, then the data gets inserted into the respective specified tables.
The next operation would be check if the records are inserted into the database. Connect to the database
server manager and execute a select query to the table. The query should return records, which proves
that the data is loaded into the database.
In this manner, all the tables are loaded from a flat file and populated as NDS (Normalized data source)
in the database. These is the normalized data source which would be used to create the fact and
dimensions tables of the data warehouse
9.
Populating the Data warehouse using the SSIS integration services tool:
Create a new SSIS package in the same data integration project
Toolbox->control flow items->data flow task
Double click on the data flow task to edit the same and then create a OLDB source from the toolbox.
The give the connectivity details for the OLEDB source so that it points to the master database of our
server. Then from the tool box select the Derived column task.
Then we double click on the derived column the following screen comes.
Specify the columns which need to be modified/transformed or inserted. In our case, we would be
modifying two columns and we would be inserting one column.
Then connect the derived column with the slowly changing attribute task which is also taken from the
toolbox. Then double click on the slowing changing attribute where we would specify the details of the
attributes
Select the business key for the table.
Also add a new column Msalary in our table, so that it can be mapped with the new column which is
been created.
Once we debug the package, we would get the screen below.
Check the database, we would see the changes being modified to the table.
In this manner the NDS or normalized data source is populated. In our case as it is a very small
database, normalization techniques are not required. Using the SSIS tool we can also normalize the
data, so that there is no redundancy in the data and the database is also accurate.
10.
Data Normalization:
Once the data is populated the data needs to be normalized. There is a separate method or way of data
normalization which is also done using the SSIS tool. In our case there is no data normalization which
is required.
11.
Assuring Data Quality:
The data quality process included the activities to make sure that the data in the data warehouse is
correct and complete. This process includes the mechanism to report the data and check the quality of
the report and do the checking and modification so that consistency is maintained throughout the data
warehouse
The main rules of data quality are given below. These rules essentially filter to prevent the dirty data
from getting into the data warehouse. Based on the data location, we have three kinds of data quality
rules:
Incoming data validation: The validation of the incoming data is usually applied on the fly when the
data is being loaded into the data warehouse. It is usually verified whether the data from the source
system are valid. If there is any discrepancy with the data type, etc., the data will not even get inserted
into the data base
Cross reference Validation: Cross verification is the checking/comparing which is done on the
incoming data to the data which is already present in the data warehouse. The main aim of this is to see
if the incoming data is within the range of the data which is already present in the data warehouse
Data warehouse Internal validation: This is a type of validation which is done on the data which is
present inside the data warehouse. This is basically done to maintain the quality of data in the data
warehouse
12.
Books and references

Wikipedia

MSDN for MS SQL server 2008

Building a data warehouse by Vincent Rainardi

Building the data warehouse by W.H Inmon.