* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Warehousing - Augmented Intelligence
Survey
Document related concepts
Transcript
Data Warehousing Course Outline for Intro to Business Intelligence Unit 1 by Don Krapohl • Overview – Basic Definitions – Normalization • Entity Relationship Diagrams (ERDs) • Normal Forms • Many to Many relationships – Warehouse Considerations • Dimension Tables • Fact Tables • Star Schema • Snowflake Schema • Further Warehouse Design Considerations – Changing Dimensions – Conformed Dimensions • Data warehouse • OLTP - On-Line Transaction Processing – A data warehouse is a copy of transaction data specifically structured for querying and reporting. – “a collection of computerized data that is organized to most optimally support reporting and analysis activity” – OLTP describes a type of processing that databases are designed to support. – OLTP applications need to support a high number of transactions per unit of time. – A transaction is a set of Insert, Update, and sometimes Delete statements that must succeed or fail as a unit. Transactions typically perform such functions as recording orders, depleting inventory, etc. – Electronic banking and order processing are common OLTP applications. • OLAP - On-Line Analytical Processing – In its broadest usage, the term "OLAP" is used as a synonym for "data warehousing". – The term "On-Line Analytical Processing" was developed to distinguish data warehousing activities from On-Line Transaction Processing. – In a narrower usage, the term OLAP is used to refer to the tools used for Multidimensional Analysis… • Sample Star Schema: … but when people speak of OLAP they may properly be referring to a schema like this one in a relational database. • Database Normalization – Normalization reduces redundant data storage by organizing data efficiently. – There are many ways to normalize a database consistently within a set of business requirements. – Normalization reduces the potential for anomalies during data manipulation operations. – Non-normalized databases are vulnerable to data anomalies when they store data redundantly. • If data is stored in two locations, but is later updated in only one location, then the data becomes inconsistent; this is referred to as an update anomaly. • To avoid data anomalies, non-primary key data in a normalized database are stored in only one location. – If you need a Department’s physical location, you should need to look in the Department Table. • Unnormalized Table • We could design a database so that each record we would read about specific types of business object would have all the information we’d typically need about those object types. But This schema is more typical of a normalized database. • We could generate the information on the previous page with this query: Select e.EmployeeID, e.LastName, e.FirstName, d.DeptID, d.Name, d.Location From Department d Inner Join Employee e on d.DeptID = e.DeptID When we normalize, we’re building a logical hierarchy. • Entities – classes of objects that are of interest from a business standpoint, about which information needs to be maintained – In the process of modeling, they evolve into database tables. • Entities are always nouns in business narratives (but not all nouns in business narratives are entities). – Examples: Employee, Department, Project • Entities must have attributes, or properties, that need to be known, which become columns. – Employee: • Name, Birth Date, Salary – Department: • Name, Number, Location • Each entity is representative of a class of objects, and each instance of an wellformed entity will map to a row in a table. Each instance of an entity must be uniquely distinguishable from other instances of the same entity. – An attribute or set of attributes that uniquely identify an entity is called a Unique Identifier (UID). • • Relationship – A bi-directional, significant association between two entities, or between an entity and itself – Each (direction of a) relationship has: • Name • Optionality – Either Must Be or May Be • Degree/Cardinality/Ordinality – 1:1 or 1:M ( or M:M) – Degree = 0 is expressed as “may be.” • Each employee must be assigned to one and only one department. • Each department may be responsible for one or more employees. • Our definitions for entities, attributes and relationships must have equal validity for each instance; not the normal case only. • This point is critically important. • First Normal Form Each cell, which is the intersection of a row and a column, can contain only one value. – 1NF requires that each attribute store only one value. – There can be no repeating groups ( = no “multivalued attributes”). – Each attribute of the table is said to be “atomic”. • For example, each record in the Home table below should have only one owner. Mention the PK convention. •Unnormalized Entity What if some homes have more than three owners? How would we write stored procs to read from this table? • To support multiple owners we need another entity: – This will always be the case when an entity has a 1:M relationship with one of its attributes. Both entities are now in 1NF. Second Normal Form – To be in 2NF, a table must be in 1NF. – In addition, each non-key attribute must be dependent on all parts of its primary key. • There must be no “partial key dependencies”. – In the previous example: – The Home entity is not in 2NF. » The Mayor attribute doesn’t depend on the entire primary key. » We need a new entity. – The Owner entity is not in 2NF. » The Price of Tea does not depend on the Owner. » “We decide not to track this attribute.” • In normalizing to 2NF, we attempt to reduce the amount of redundant data in a table by extracting it, placing it in new tables, and creating relationships between tables. Tables are now in 2NF. • Third Normal Form – To be in 3NF, a table must be in 2NF. – Additionally, all attributes that are not wholly dependent upon the primary key must be remodeled. • Each table attribute can depend on nothing other than its primary key. • 3NF = “Every non-key attribute must depend on the key, the whole key, and nothing but the key.” – In the previous example: • Sun sign depends on birth date, so it should be stored in a different table. • A general modeling principle we see here is that when an attribute depends on another attribute, a new table will be necessary to model the relationship. Entities are now in 3NF. • Modeling the M:M relationship – How do we record the owners of individual homes? • We need an intermediate table that has a M:1 relationship with each of its parent tables. • The query below shows the name of each home’s owner(s). • General Remarks: • The definitions of normal forms provide guidelines for relational database design. Occasionally, it is necessary to stray from them to meet practical business requirements in an OLTP environment. • There is not a single best way to normalize a database to conform with a specific set of business requirements. • Insert, Update, and Delete operations run more quickly in a normalized database. • Complex Select statements run more slowly. • Reasons to denormalize • The fundamental reason to denormalize is to improve query performance. • Consider the case of City, State, and CityStateZip tables. – These tables can be designed to conform to the third normal form. – But each time you need to write a query to extract Customer data, you will need to join data from four tables. • If no valid business reason exists to divide city, state, and ZIP Code information into separate tables, then it may make sense to denormalize. • Dimension tables in a star schema are intentionally denormalized. • Normalized database: – Many “narrow” tables (i.e. fewer columns) – Optimized for Insert Update, and Delete operations – Slower Select statements because of the need for frequent join operations – Few indexes – Necessary for large OLTP applications • Non-normalized database: – Fewer (but “wider”) tables – Faster Select statements because we don’t need to join as often – Transactions are more problematic because of the need to maintain redundant instances of data during Insert, Update, and Delete operations – Many indexes because data is relatively static – Necessary for large relational OLAP applications • Data Warehouses • Data warehouses and data marts are storage mechanisms for read-only, historical, aggregated data. • Consider this example: we sell 2 products, dog food and cat food. Each day, we record the sales of each product. Here is some sample OLTP data for a couple of days: • Our data warehouse would usually not record this level of detail. – Instead, in a warehouse we would summarize, or aggregate, the data to daily totals. Our records in the data warehouse might look something like this: – Here we have reduced the number of records by aggregating the individual transaction records into daily records that show the number of each product purchased each day. – We can certainly generate this data set from the OLTP system by running a query… • … but if we want to view our data as aggregated numbers broken down along a series of criteria (i.e. so-called “by conditions”), then query performance will improve if we store data in a denormalized format. – That’s exactly what we do when implementing a star schema. • It’s important to realize that OLTP is not meant to be the basis of a decision support system. OLTP applications are optimized for activities such as recording (high numbers of) orders, etc. • A system optimized for processing transactions is not optimized to perform complex analyses designed to uncover hidden trends. • Therefore, rather than tie up our OLTP system by performing expensive queries, we should build a less normalized structure that conforms better to our query needs. • The Warehouse – Typical business questions that drive warehouse design: • How many units did we sell last week? • Are overall sales of individual products or product categories higher or lower this year than in previous years? • On a quarterly or monthly basis, are sales for some products/categories cyclical? • In what regions are sales down this year? – What products/categories in those regions account for the greatest percentage of the decrease? – Some characteristics of warehouse business questions: • Many concern the element of time. • Many questions require the aggregation of data; sums and counts are important in an OLAP environment, whereas individual transactions are important in an OLTP environment. • Each questions looks at data in terms of “by” conditions. – “On a quarterly and then monthly basis, are Dairy Product sales cyclical?” = “We need to see total sales of Dairy Products by quarter and by month.” These “by” conditions drive the design of our star schema. Each “by” condition is represented by a Dimension table. • Dimension Tables – General Remarks • Product and Geography are common dimensions. • Date/Time information is almost always stored in a Dimension table. • If our data happen to start on a particular date, do we care what sales have been since that date, or do we care more about how one year’s sales compares to other years’? – Comparing one year to another is a common form of trend analysis accomplished through the use of a star schema. Dimension Table Structure – Dimension tables should have a single-field primary key. This key is often an identity column. • The value of the primary key is irrelevant; our information is stored in the other fields in the table. • Because the fields are the full descriptions, the dimension tables are often “wide”, i.e. they contain many large fields. For example, if we have a Product dimension, then we’ll have fields in it that contain the description, the category name, the sub-category name, etc. These fields do not contain codes that link us to other tables. Dimension tables are often small in terms of row count relative to Fact tables. • Dimensional Hierarchies (Denormalization): • In a star schema, the entire hierarchy for a dimension is stored in its corresponding Dimension table in the data warehouse. • The product dimension, for example, contains individual products. – Products are normally grouped into categories, and these categories may contain sub-categories. • For example, a product with a product number of M1652 may be a refrigerator. Thus it belongs in the major appliance category, and in the refrigerator sub-category. – We may have more levels of sub-categories to further classify each product. – In an OLAP environment, it is preferable to maintain the product hierarchy in a single table, although this hierarchy would certainly be distributed among Product, Category, and SubCategory tables in an OLTP environment. • This hierarchy allows us to perform “drill-down” functions on the data. We can perform a query that performs sums by category. We can then drill-down into that category by calculating sums for the subcategories for that category. We can the calculate the sums for the individual products in a particular subcategory. – The actual sums we are calculating are based on numbers stored in the fact table. – Fact tables • When we talk about the way we want to look at data, we usually want to see some sort of aggregated data. These data are called measures. • Measures are numeric values that are measurable and additive. – Sales dollars are a very common measure. – The Number of Customers we have is also a typical measure. – We’d probably track both of these by day. • Fact tables are used to store measures, or facts, which are numeric and additive across some or all dimensions. • In the following star schema, sales dollars are numeric, and we can examine total sales in terms of product, category, and time period. • Fact tables are “narrow” in the sense that they contain few (and numeric) columns, but they do contain large numbers of rows. – Fact tables are responsible for most of the disk space used in a warehouse. • Fact Table Granularity – Granularity refers to the level of detail in a fact table and is one of the most important design decisions in data warehouse planning. • Granularity is often determined by the time dimension. – For example, you may elect to store only weekly or monthly totals for sales dollars. • Granularity determines how far we can drill down without recourse to the source OLTP data. – Many if not most OLAP systems have daily grain in the Time dimension. • Selecting a finer grain results in more records in the fact table. • Choose data types for fact table columns that keep the table as small as possible. • Aggregations – Fact table data consists of aggregations that are based on the fact table’s granularity. – Frequently we’ll want to aggregate to a higher level. • We may choose to keep total sales dollars at a quarterly or monthly level. • We may be interested in only a particular product or category in this case. • A better alternative is to build a cube structure… Simple Star Schema: • To obtain total sales for all major appliances during March of 1999: Select Sum (sf.SalesDollars) as TotalSales From SalesFact sf Inner Join TimeDimension td On td.TimeID = sf.TimeID Inner Join ProductDimension pd On pd.ProductID = sf.ProductID Where pd.Category = ‘Major Appliance’ And td.Month = 3 And td.Year = 1999 • Snowflake Schemas • Sometimes dimension tables have hierarchies broken out into separate tables. This will result in a different schema type known as a snowflake. – This is a more normalized structure, but leads to more difficult queries and slower response times. – It does conserve more disk space than a star schema that contains the same data. Graphical comparison of Star and Snowflake schemas Star Schema Snowflake Schema • Further Warehouse Design Considerations – Changing Dimensions – In the schema below, consider a scenario in which we have realigned some of our stores, placing them in different territories and regions. – In the StoreDimension table, we have each store in a particular region, territory, and zone. – If we simply update the StoreDimension table with new territory/region information, and then examine historical sales for a region, the numbers will no longer be accurate. – To address this issue, consider creating new records for affected stores. • Every new record will contain each store’s new region, but leaves old store records intact along with the old regional sales data. • This approach, however, prevents us from comparing this stores current sales to its historical sales unless we keep track of its previous StoreID. This may require an extra field called PreviousStoreID or something similar. – There are no right and wrong answers. Each case may require a different solution. • When building an enterprise warehouse from local data marts: – It is necessary to produce a set of conformed dimensions. • It will also be necessary to standardize the definitions of facts. – A conformed dimension is a dimension that means the same thing with every possible fact table to which it can be joined. – Generally, this means that a conformed dimension is identical in each data mart. – The conformed Product dimension is the enterprise’s agreed-upon master list of products, including all product attributes and all product rollups such as category, subcategory, and department. – The conformed Calendar dimension will almost always be a table of individual days, spanning a decade or more. Each day will have many useful attributes drawn from the legal calendars of the various states and countries the enterprise deals with, as well as special fiscal calendar periods and marketing seasons relevant only to internal managers. – Most conformed dimensions will naturally be defined at the most granular level possible. • The grain of the Customer dimension will be the individual customer. Simplified Star Schema with Conformed Dimensions • Permissible Variations of Conformed Dimensions – It is possible to create a subset of a conformed dimension table for certain data marts if you know that the domain of the associated fact table only contains that subset. – For example, the Product table for a specific data mart may be restricted so as to include only those products manufactured at that location, if the data mart in question pertains to that location only. Links: • • • • • • • Wikipedia page on normalization Datbases.About.Com page on normalization MSDN Glossary Oracle-specific site where I got some schema diagrams Ralph Kimball's Data Warehousing site Kimball on Fact and Dimension Tables BI and Data Warehouse Glossary