Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CO6002 Advance Database Management March 30 2011 The AWISDM project can be found by utilizing the short cut on the desktop or by following the path C:\Users\0817804\Documents\Visual Studio 2008\Projects\aj_AWISDM\aj_AWISDM the user name for the remote desktop is 0817804 and the password is P@55word [Type the document subtitle] Contents Table of Figures ....................................................................................................................................... 2 Table of Tables ........................................................................................................................................ 2 Pilot InternetSales Data Mart Design...................................................................................................... 3 Starting with dimension tables ........................................................................................................... 3 Database design as a star schema model ........................................................................................... 3 Date ..................................................................................................................................................... 5 ProductSubcategory............................................................................................................................ 5 Territory .............................................................................................................................................. 5 Customer............................................................................................................................................. 5 Surrogate Keys ................................................................................................................................ 6 Date Attributes................................................................................................................................ 6 Normalisation and de-normalisation .............................................................................................. 6 Comparing the terms star schema and snow-flake schema ............................................................... 7 Schemas in Data Warehouses ......................................................................................................... 8 Slowly Changing Dimensions ........................................................................................................ 10 Design the fact table ..................................................................................................................... 10 Fact Table considerations and Granularity ................................................................................... 11 Estimate the size of the Data Mart ............................................................................................... 11 Design, develop and save an SSIS Package ........................................................................................... 12 Integrate data from the four data sources into a staging database ................................................. 12 Load data to the data mart ........................................................................................................... 14 The key features of the ETL process ............................................................................................. 14 Business Intelligence Applications. ....................................................................................................... 15 Slicing and Dicing the Cube ............................................................................................................... 16 Reports .............................................................................................................................................. 17 Ad hoc Queries .................................................................................................................................. 18 OLAP .................................................................................................................................................. 18 Data Mining....................................................................................................................................... 19 Decision Support ............................................................................................................................... 20 Table of Figures Figure 1 Four types of dimension tables ................................................................................................. 3 Figure 2 Star Schema structure for database design .............................................................................. 3 Figure 3 Entity relationship model of fact and dimension tables ........................................................... 4 Figure 4 Starter Dimensions.................................................................................................................... 4 Figure 5 Date table with the addition of surrogate keys ........................................................................ 5 Figure 6 ProductSubCategory dimension with additional surrogate keys.............................................. 5 Figure 7 The Territory Dimension ........................................................................................................... 5 Figure 8 The Customer dimension .......................................................................................................... 5 Figure 9 Star schema model.................................................................................................................... 7 Figure 10 AWISDM Data Mart as a Star Schema .................................................................................... 8 Figure 11 The benefits of a Snowflake Schema ...................................................................................... 9 Figure 12 Snowflake Schema .................................................................................................................. 9 Figure 13 AWISDM as Snowflake Schema .............................................................................................. 9 Figure 14 Fact Table .............................................................................................................................. 10 Figure 15 Entity Relationship Diagram for AWISDM ............................................................................ 11 Figure 16 Concept design for integrating the four data sources .......................................................... 12 Figure 17 Data is imported from the four sources ............................................................................... 13 Figure 18 the ETL Process ..................................................................................................................... 13 Figure 19 Loading the data into the Data Mart .................................................................................... 14 Figure 20 A flow diagram to deal with error checking.......................................................................... 14 Figure 21 Data flow to check for and correct errors............................................................................. 15 Figure 22 Output to Fact Table ............................................................................................................. 15 Figure 23 BI Platform ............................................................................................................................ 15 Figure 24 Microsoft Office Integration ................................................................................................. 16 Figure 25 Single Data Point ................................................................................................................... 16 Figure 26 one Dimension of Data ......................................................................................................... 16 Figure 27 Two Dimensions of Data ....................................................................................................... 17 Figure 28 Three Dimensional Data........................................................................................................ 17 Figure 29 Decision Support ................................................................................................................... 20 Table of Tables Table 1 Snowflake verses Star Schema ................................................................................................... 7 Table 2 AWISDM Dimension Tables Sizes ............................................................................................. 12 Pilot InternetSales Data Mart Design In this section we will be exploring the dimensional modelling of the pilot data mart for AdventureWorks Ltd. The broad agreement in the fields of data warehousing and business intelligence is that the dimensional model is the preferred structure. Data warehousing centres on facts and the fact table. Facts are Figure 1 Four types of dimension tables usually numeric and are a measurement of an event. Starting with dimension tables Dimensions are the nouns of the business world; they describe the surrounding measurement events. A single dimension that is shared across all of these facts is called a conformed dimension. A junk dimension is a convenient way of grouping flags and indicators. A degenerate dimension are common when the grain is of the fact table is at a unit level. Role-playing dimensions are recycled from multiple applications within the same database. Database design as a star schema model Figure 2 Star Schema structure for database design In its initial state, the database consists of four tables with no relationships, shown in Figure 2 Star Schema structure for database design. Dimension: ProductSubCategory a Star Schema structure Dimension: Customer fact table Dimension: Territory Dimension: Date DimDate PK,I2 DateID I1 Date key Date Day MonthID MonthDescrition QuarterID QuarterDescription YearID YearDescription MonthlyTimeSpan QuarterTimeSpan MonthEndDate QuarterEndDate YearEndDate FullDate Day of the week Day number in month Day number overall Week number in year Week number overall Quarter FiscalPeriod Last day in the month flag EventKey Holiday Weekend I4 I5 I6 DimProductSubCategory DimTerritory PK,I1 PK,I2 TerritoryID FK1,I1 TerritoryRegion State Province City ProductID ProductSubCategoryName ProductCategoryName ProductName Brand ProductCategory ProductDescrition I3 DimCustomer PK CustomerKey MaritalStatus BirthDate Education Occupation AddressLine1 City StateProvince CountryRegionName PostalCode FirstName MiddleName LastName Phone hasDimProductSubCategoryFact Table / is of Fact Table hasDimTerritoryFact Table / is of PK PK,FK1,I1 PK,FK4,I4 PK,FK2,I2 PK,FK3,I3 PK ID CustomerKey TerritoryID DateID ProductID ProdSubCategory hasDimDateFact Table / is of hasDimCustomerFact Table / is of OrderValue Figure 3 Entity relationship model of fact and dimension tables The first steps in developing our dimension tables are to add surrogate keys to them. Figure 4 Starter Dimensions This basic starting point requires us to add surrogate keys to the dimension tables. The addition of surrogate keys Date Figure 5 Date table with the addition of surrogate keys DimDate PK,FK1 PK,I2 CustomerKey DateID I1 Date key Date Day MonthID MonthDescrition QuarterID QuarterDescription YearID YearDescription MonthlyTimeSpan QuarterTimeSpan MonthEndDate QuarterEndDate YearEndDate FullDate Day of the week Day number in month Day number overall Week number in year Week number overall Quarter FiscalPeriod Last day in the month flag EventKey Holiday Weekend ID ProdSubCategory I4 I5 I6 I3 FK1 FK1 We can see from the data table shown in Figure 5 Date table with the addition of surrogate keys, we have added a number of surrogate, or non-natural keys to the data dimension. ProductSubcategory Figure 6 ProductSubCategory dimension with additional surrogate keys DimProductSubCategory I1 ProductID ProductSubCategoryName ProductCategoryName ProductName Brand ProductCategory ProductDescrition The ProductSubCategory dimension with additional keys, as show in Figure 6 ProductSubCategory dimension with additional surrogate keys. Territory The territory dimension with additional surrogate keys, shown in Figure 7 The Territory Dimension. Figure 7 The Territory Dimension DimTerritory Customer I1 TerritoryID TerritoryRegion State Province City Figure 8 The Customer dimension DimCustomer PK CustomerKey MaritalStatus BirthDate Education Occupation AddressLine1 City StateProvince CountryRegionName PostalCode FirstName MiddleName LastName Phone Finally we have the customer dimension with its surrogate keys, as shown in Figure 8 The Customer dimension. Surrogate Keys A surrogate key is usually created to simplify the key structure. That is to say it is an artificial substitution for a natural primary key, data held within this key may change with time, known as a slowly changing dimension. These surrogate keys are neither intelligent nor business specific and most commonly system-generated. Their purpose is to ensure that rows are unique within a database table, typically they are numeric fields. Date Attributes Data mart solutions often include a role-playing dimension to record the passage of time; the “Date” dimension. It is necessary to provide additional attributes to this dimension other than just the date field, which would only allow users to view sales for a specific date, or specify a range of dates, this provides a limiting scope for data analysis. Adding fields such as: the day of the week, month name, fiscal calendar quarter or a public holiday flag fields enables much more flexibility. For example a user would be able to compare sales data for a particular day in a particular month to the same day in the previous year, or compare sales totals from one week to the next. This additional functionality allows business decision makers to generate considerably more useful conclusions from the data. Normalisation and de-normalisation The dimension tables of a data mart contain de-normalised data, whilst the fact tables will contain fully normalised data. Whilst this does not optimise the databases’ file size, it does increase the performance that queries on the data will execute at, often seen as the If the database was fully normalised – for example, if the customer table was normalised, additional slow lookups for marital status, profession and education would have to be performed. The fact tables are fully normalised; they contain only time-period data and the relevant surrogate key for each dimension. The dimensional tables are de-normalised, usually to the second normal form, and contain surrogate keys to facilitate query speeds. Comparing the terms star schema and snow-flake schema “Star and snowflake schema designs are mechanisms to separate facts and dimensions into separate tables. Snowflake schemas further separate the different levels of a hierarchy into separate tables.” (IBM, 2011) Snowflake Schema Star Schema Joins Higher number of Joins Fewer Joins Ease of Use More complex queries and hence less easy Less complex queries and easy to to understand understand More foreign keys-and hence more query Less no. of foreign keys and hence execution time lesser query execution time Ease of maintenance or No redundancy and hence more easy to Has redundant data and hence less easy change maintain and change to maintain/change Type of data warehouse Good to use for small data warehouses/data Has redundant data and hence less easy marts to maintain/change Query Performance Dimensional table It may have more than one dimension Contains only single dimension table for table for each dimension each dimension Dimension Table Normalisation 3 Normal Form 2 Normal De-normalised Form Table 1 Snowflake verses Star Schema Primary keys Foreign keys Fact tables Dimension Tables Star schemas Snowflake schemas Figure 9 Star schema model A hierarchy is a set of levels having many-to-one relationships between each other, and the set of levels, collectively makes up a dimension. In a relational database, the different levels of a hierarchy can be stored in a single table (as in Figure 9 Star schema model) or in separate tables (as in a snowflake schema). Many-to-one relationships Balanced and unbalanced hierarchies Ragged hierarchies Schemas in Data Warehouses Date Customer Fact Product Subcategory Territory Figure 10 AWISDM Data Mart as a Star Schema A star schema model consists of one or more fully normalised fact tables,(3NF), that reference any number of de-normalised dimension tables. The dimension tables feature redundant data in the most granular form and are in second normal form, (2NF). This increases the simplicity of the database and reduces the complexity of queries made upon it. A snow-flake schema is a variation on this and features fully normalised dimension tables, (3NF). This reduces the amount of file space needed to store the database and reduces the number of places data would need to be altered if an update is required; but this come at a cost which is a reduction of query performance. The facts that the data warehouse aids in analyse are classified along different dimensions. The fact table holds the main data. It includes a large amount of aggregated data, such as price and units sold. There may be multiple fact tables in a star schema. Dimension tables, which are usually smaller than fact tables, include the attributes that describe the facts. Most often we see a segregation of these dimensions into a separate table for each dimension. Dimension tables are be joined to the fact table(s) as needed. Dimension tables have a simple primary key, while fact tables have a set of foreign keys which make up a compound primary key consisting of a combination of relevant dimension keys. It is common for dimension tables to consolidate redundant data in the most granular column, and thus rendered in second normal form. Fact tables are usually in third normal form because all data depends on either one dimension or all of them, not on combinations of a few dimensions. Figure 11 The benefits of a Snowflake Schema The benefit of using the snowflake schema as shown in Figure 11 The benefits of a Snowflake Schema, is that the storage requirements are lower since the snowflake schema eliminates many duplicate values from the dimensions themselves. Figure 12 Snowflake Schema The main design scheme is star but the ProductCategory can be granulated by adding more depth through using a snowflake schema for this element of the design. Extending the granularity of the DimProductSubCategory dimension DimDate I2 I1 I4 I5 I6 I3 DateID Date key Date Day MonthID MonthDescrition QuarterID QuarterDescription YearID YearDescription MonthlyTimeSpan QuarterTimeSpan MonthEndDate QuarterEndDate YearEndDate FullDate Day of the week Day number in month Day number overall Week number in year Week number overall Quarter FiscalPeriod Last day in the month flag EventKey Holiday Weekend DimCustomer PK CustomerKey MaritalStatus BirthDate Education Occupation AddressLine1 City StateProvince CountryRegionName PostalCode FirstName MiddleName LastName Phone DimTerritory I1 TerritoryID TerritoryRegion State Province City hasDimCustomerFact Table / is of Fact Table PK PK,FK1,I1 PK,FK4,I5 PK,FK2,I2 PK,FK3,I4 PK hasDimDateFact Table / is of DimProductSubCategory I1 I3 ProductID hasDimProductSubCategoryFact Table / is of ProductSubCategoryName ProductCategoryName ProductName Brand ProductCategory ProductDescrition ID CustomerKey TerritoryID DateID ProductID ProdSubCategory OrderDate Birthdate OrderValue SalesAmount SalesTerritoty SalesRegion MS Addr1 City Country Postcode ProductSubCategoryName AWISProjDate TerritoryRegion FirstName LastName DimProductCategory has / is of PK ProductCategoryID I2 I1 FK1 ProductSubCategoryID ProductCategoryName ProductID Figure 13 AWISDM as Snowflake Schema hasDimTerritoryFact Table / is of Slowly Changing Dimensions This is a dimensional problem that occurs with time, the attributes for a record changes over time. This can be seen in particular in the Customer table of the AWISDM. For example, if the customer number was used as a natural key and a customer changes their address it would cause numerous problems. Firstly, if the record was simply updated it would not be possible to track where previous items had been delivered; it would be as though the customer had only a single address and had always lived there. By using a surrogate key, a new record for the customer can be added with the new address in the table and all new queries will use this new record; the old data remains intact. It is possible by creating a composite natural key (customer number and address), if not all parts of this natural key exist in the fact table it may not be possible to do a join on this new enlarged key – therefore requiring another ID field; the surrogate key. Although there is only one customer, they would be treated dimensionally as two separate customers. The problem of slowly changing dimensions is one that needs addressing to avoid the deletion of data that may be required for statistical analysis. For example: if the company wanted to analyse sales by a certain region and the customer’s address has changed, all sales made before the alteration would become assigned to the new region. Design the fact table Fact Table PK PK,FK1,I1 PK,FK4,I4 PK,FK2,I2 PK,FK3,I3 PK ID CustomerKey TerritoryID DateID ProductID ProdSubCategory OrderValue Figure 14 Fact Table DimDate PK,I2 DateID I1 Date key Date Day MonthID MonthDescrition QuarterID QuarterDescription YearID YearDescription MonthlyTimeSpan QuarterTimeSpan MonthEndDate QuarterEndDate YearEndDate FullDate Day of the week Day number in month Day number overall Week number in year Week number overall Quarter FiscalPeriod Last day in the month flag EventKey Holiday Weekend I4 I5 I6 DimProductSubCategory DimTerritory PK,I1 PK,I2 TerritoryID FK1,I1 TerritoryRegion State Province City ProductID ProductSubCategoryName ProductCategoryName ProductName Brand ProductCategory ProductDescrition I3 DimCustomer PK CustomerKey MaritalStatus BirthDate Education Occupation AddressLine1 City StateProvince CountryRegionName PostalCode FirstName MiddleName LastName Phone hasDimProductSubCategoryFact Table / is of Fact Table hasDimTerritoryFact Table / is of PK PK,FK1,I1 PK,FK4,I4 PK,FK2,I2 PK,FK3,I3 PK ID CustomerKey TerritoryID DateID ProductID ProdSubCategory hasDimDateFact Table / is of hasDimCustomerFact Table / is of OrderValue Figure 15 Entity Relationship Diagram for AWISDM Fact Table considerations and Granularity The fact table in the AWISDM requires a daily sales total for a customer in each product sub category and region. The ETL process is currently unaffected by the source granularity as the incoming data is all of the same level of detail. If one of the data sources listed its sales on an individual order basis rather than a daily total for example, this data would have to be aggregated by date to enable a courser level detail in the other data. All subsequent sales will need to at least match this granularity. If one of the data sources begins to produce a finer level of detail of data; for example grouping sales data by product category instead of sub-category, all of the data being imported to the data mart would have to be brought to the finer level of granularity to ensure consistency. Estimate the size of the Data Mart The fact table in the AWISDM currently has seven fields with an average field length of 38.9 bytes (1 field of 4 byte, 5 fields with a size of 52 bytes, and one of 8bytes). Assuming that there are 500,000 rows in the table this gives a total table size of 7 fields x 38.9 bytes/field x 500,000 rows = ≥ 136,150,000 bytes (≥ 129.84 Megabytes, or ≥ 132,959 Kilobytes) The dimension tables are much smaller and typically contain a fraction of the number of rows found in the fact table. The dimension tables in the AWISDM are of the following sizes, shown in Table 2 AWISDM Dimension Tables Sizes. Table Name Average Field Size 41.4 bytes DimCustomer 14.4 bytes DimDate 52 bytes DimTerritory 52 bytes DimProductSubCategory Field Count Row Count 14 27 5 7 18,484 1,158 10 37 File Size ≥ 10.22 Megabytes ≥ 0.43 Megabytes ≥ 2.54 Kilobytes ≥ 13.15 Kilobytes Table 2 AWISDM Dimension Tables Sizes As shown above, the Data mart fact table is exponentially larger than its dimension tables and will continue to grow at a much faster rate; as more and more sales data is added each day the dimension tables will remain relatively unchanged. Design, develop and save an SSIS Package Design an ETL Process to: Integrate data from the four data sources into a staging database Australia D1 UK D2 Sales data Australia UK Data flow Aj_AWISDM US Germany D3 D4 US Germany Figure 16 Concept design for integrating the four data sources The four data sources are uploaded into the transact process, for the UK sales a replacement takes place, replacing tyre with tire. All four data sources are then formatted to ensure that the data types are consistent throughout the process; the output from these conversions is then joined into a union all function to align the data being imported. See Figure 17 Data is imported from the four sources. Figure 17 Data is imported from the four sources AdventureWorks Ltd InternetSales Data Mart (AWISDM) US Sales compatability PK UK Sales schema PK,FK1 PK,FK1 ProdSubCategory OrderDate SalesRegion MS Birthdate Addr1 City Country Postcode SalesAmount CustomerKey ProductSubCategoryName AWISProjDate TerritoryRegion BirthDate FirstName MiddleName LastName SalesAmount australian sales compatibility CustomerKey ProdSubCategory OrderDate SalesTerritory OrderValue CustomerKey has / is of German sales has / is of Uk sales XML PK PK ProductSubCat OrderDate SalesRegion MS Birthdate US Sales Addr1 City Country Postcode SalesAmount Austrailian sales PK CustomerKey ProdSubCategory OrderDate SalesTerritory OrderValue US Sales CustomerKey PK CustomerKey ProductSubCategoryName AWISProjDate TerritoryRegion BirthDate FirstName LastName SalesAmount Tra n sa ct Ex tra ct Load Staging Database MS Access database Figure 18 the ETL Process CustomerKey ProdSubCategory OrderDate SalesRegion MS Birthdate Addr1 City Country Postcode SalesAmount Load data to the data mart Import Australia Sales Compatibility Import Australia Sales Import German Sales Import UK Sales UK Import US Sales Compatibility Import US Sales Germany US Austrailia Consolidate and clean UK detail Consolidate and clean Australia detail UK Prime Australia Prime Consolidate and clean German detail Consolidate and clean US detail German Prime US Prime Aj_AWISDM Starter Dimensions Figure 19 Loading the data into the Data Mart The key features of the ETL process The key features of the ETL process are in the acronym, we firstly extract the data from the disparate data sources, we transact this data by converting the data types and then error check the data, and finally we load the data to the data mart. Tasks designed to validate the data Start Read data from data stores Australia Data Store German Data Store Is data validated UK Data Store US Data Store No Figure 20 A flow diagram to deal with error checking Figure 21 Data flow to check for and correct errors Figure 22 Output to Fact Table Business Intelligence Applications. Figure 23 BI Platform As we can see from Figure 23 BI Platform, the Business Integration system is a collection of Microsoft SQL Server Services. The BI model is used to draw data from disparate sources and present a single unified view of the data in the form of a data mart. The BI toolkit allows for complex and demanding data process, such as data conversions and error validation to be undertaken simply. The output data mart would then be interrogated by the user utilising the Microsoft Office suite of applications, some of which are shown in Figure 24 Microsoft Office Integration. visio doc Excel Word PPT mdb, accdb xls XML Project Access Outlook Figure 24 Microsoft Office Integration The use and application of the Office suite provides business decision makers with a familiar interface from which to build questions and queries of the data mart. Slicing and Dicing the Cube We start with a data point D1 Figure 25 Single Data Point We then build a Figure 26 one Dimension of Data Extension of data point • Customer ID Region ID Product ID Sales Value Date ID D1 D2 • • • • • Customer ID Region ID Product ID Sales Value Date ID • • • • • Customer ID Region ID Product ID Sales Value Date ID D3 Figure 27 Two Dimensions of Data Then we constrict a Cube, an example of this would be: Cubic on Date X Date Y Customer Z Product Cubic on Amount X Customer Y Product Z Amount Figure 28 Three Dimensional Data Finally we build our cube Date Geographic Product Customer • Date ID • Date • Region • Country • Product ID • Product Name • Customer ID • Customer Name Reports Reports are simply to produce utilising Excel, Access or even Word. Ad hoc Queries Ad-hoc queries become simple to create as the star schema data mart is specifically designed with these in mind. OLAP As already stated a number of times (Saeed, 2011), “in a Data Warehouse decision support environment we are interested in the big picture”, we want to look at the data from a macro level instead of the micro level. For a macroscopic view aggregates are used. In this example we look at the sales volume i.e. number of items sold as a function of: o product o time o geography Note that all three of them are dimensions. The proceed with the analysis, a cube structure will be first created such that each dimension of the cube will correspond to each identified dimension, and within each dimension will be the corresponding hierarchy. The example further shows how the dimensions are “rolled-out” i.e. Province into divisions, then division into district, then district into city and finally cities into zones. Note that weeks could be rolled into a year and at the same time months can be rolled into quarters and quarters rolled into years. Based on these three dimensions a cube is created and shown in Cube Operations Rollup: summarize data e.g., given sales data, summarize sales for last year by product category and region Drill down: get more details e.g., given summarized sales as above, find breakup of sales by city within each region, or within province or county Slice and dice: select and project e.g.: Sales of soft-bicycles in Chester during last quarter Pivot: change the view of data How we view the data stored in a cube. There are four fundamental cubes operations which are: (i) (ii) (iii) (iv) rollup drill down slice and dice pivoting Data Mining Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc. Data Mining commonly involves four classes of task: (i) (ii) (iii) (iv) Classification Arranges the data into predefined groups. For example an email program might attempt to classify an email as legitimate or spam. Common algorithms include Nearest neighbor, Naive Bayes classifier and Neural network. Clustering Is like classification but the groups are not predefined, so the algorithm will try to group similar items together. Regression Attempts to find a function which models the data with the least error. A common method is to use Genetic Programming. Association rule learning Searches for relationships between variables. For example a store might gather data of what each customer buys. Using association rule learning, AdventureWorks Ltd can work out what products are frequently bought together, which is useful for marketing purposes. This is sometimes referred to as “market basket analysis”. Decision Support Knowledge Cloud Cloud Cloud In te r pr e ta tio n Cloud A1 C1 Mo de l D1 Co ns tru c tio n Patterns B1 D C A B Processed Data Original Data Target Data D1 Austrailia D2 Germany D3 UK D4 US e nt aI t Da n io at gr & c le se n tio Figure 29 Decision Support Pr ce ro e-p ng ssi