Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
19.09.15 Data from a DW Data Warehouse— Subject-Oriented n Organized around major subjects, such as customer, product, sales n Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing n Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process 1 19.09.15 Data Warehouse— Integrated n Constructed by integrating multiple, heterogeneous data sources n n relational databases, flat files, on-line transaction records Data cleaning and data integration techniques are applied. n Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. n When data is moved to the warehouse, it is converted. Data Warehouse— Nonvolatile n A physically separate store of data transformed from the operational environment n Operational update of data does not occur in the data warehouse environment n Does not require transaction processing, recovery, and concurrency control mechanisms n Requires only two operations in data accessing: • initial loading of data and access of data 2 19.09.15 What is OLAP? n The term OLAP („online analytical processing“) was coined in a white paper written for Arbor Software Corp. in 1993 n Interactive process of creating, managing, analyzing and reporting on data n Analyzing large quantities of data in realtime OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query 3 19.09.15 Conceptual Modeling of Data Warehouses n Modeling data warehouses: dimensions & measures instead of relational model n Subject, facilitates on-line data analysis oriented n Most popular model is the multidimensional model n Most common modeling paradigm: n n Star schema Data warehouse contains a large central table (fact table) • Contains the data without redundancy n A set of dimension tables (each for each dimension) n For two dimensions n Spreadsheet (Excel) with spreadsheet formulas calculations n For more than two dimensions n We will require several spreadsheet tables n -> Data explosion 4 19.09.15 n For two dimensions n Spreadsheet (Excel) with spreadsheet formulas calculations n For more than two dimensions n We will require several spreadsheet tables n -> Data explosion n We will look for one “Excel” table with several dimensions n How do we represent an Excel table in a Computer? n Multidimensional model n For Excel, two dimensions (pointers to the data) and the data itself 5 19.09.15 time Example of Star Schema item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch branch_key branch_name branch_type location_key units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city state_or_province country Measures Snowflake schema n Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake 6 19.09.15 time Example of Snowflake Schema time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city state_or_province country Fact constellations n Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 7 19.09.15 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_key item_name brand type supplier_type location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales time_key item_key shipper_key from_location branch_key branch Shipping Fact Table location to_location location_key street city province_or_state country dollars_cost Measures units_shipped shipper shipper_key shipper_name location_key shipper_type OLAP n Data is perceived and manipulated as though it were stored in a „multidimensional array“ n Ideas are explained in terms of conventional SQL-styled tables 8 19.09.15 Data aggregation n Data aggregation (agregação) in many different ways n The number of possible groupings quickly becomes large n The user has to consider all groupings n Analytical processing problem Queries for supplier-and-parts database 1) 2) 3) 4) Get the total shipment quantity Get total shipment quantities by supplier Get total shipment quantities by part Get the shipment by supplier and part 9 19.09.15 n SP S# P# QTY S1 P1 300 S1 P2 200 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 1. SELECT SUM(QTY) AS TOTQTY FROM SP GROUP BY () ; TOTQTY 1600 10 19.09.15 2. SELECT S#, SUM(QTY) AS TOTQTY FROM SP GROUP BY (S#) ; S# TOTQTY S1 500 S2 700 S3 200 S4 200 3. SELECT P#, SUM(QTY) AS TOTQTY FROM SP GROUP BY (P#) ; P# TOTQTY P1 600 P2 1000 11 19.09.15 4. SELECT S#, P#, SUM(QTY) AS TOTQTY FROM SP GROUP BY (S#,P#) , S# P# S1 P1 TOTQTY 300 S1 P2 200 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 Drawbacks n Formulation so many similar but distinct queries is tedious n Executing the queries is expensive n Make life easier, n more n Single efficient computation query n GROUPING SETS, ROLLUP, CUBE options n Added to SQL standard 1999 12 19.09.15 GROUPING SETS n Execute several queries simultaneously SELECT S#, P#, SUM (QTY) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( (S#), (P#) ) ; Single results table Not a relation !! null è missing information SELECT CASE GROUPING ( S# ) WHEN 1 THEN ‘??‘ ELSE S# AS S#, CASE GROUPING ( P# ) WHEN 1 THEN ‘!!‘ ELSE P# AS P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY GROUPING SETS ( ( S# ), S# P# S1 null TOTQTY 500 S2 null 700 S3 null 200 S4 null 200 null P1 600 null P2 1000 S# P# TOTQTY S1 !! 500 S2 !! 700 S3 !! 200 S4 !! 200 ?? P1 600 ?? P2 1000 ( P# ) ); 13 19.09.15 ROLLUP SELECT S#,P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY ROLLUP (S#, P#) ; S# P# TOTQTY S1 P1 300 S1 P2 200 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 S1 null 500 S2 null 700 S3 null 200 S4 null 200 null null 1600 GROUP BY GROUPING SETS ( ( S#, P# ), ( S# ) , ( ) ) ROLLUP n n The quantities have been „roll up“ (estender) for each supplier Rolled up „along supplier dimension“ GROUP BY ROLLUP (A,B,...,Z) (A,B,...,Z) (A,B,...) (A,B) (A) () GROUP BY ROLLUP (A,B) is not symmetric in A and B ! 14 19.09.15 CUBE SELECT S#, P#, SUM ( QTY ) AS TOTQTY FROM SP GROUP BY CUBE ( S#, P#) ; S# P# TOTQTY S1 P1 300 S1 P2 200 S2 P1 300 S2 P2 400 S3 P2 200 S4 P2 200 S1 null 500 S2 null 700 S3 null 200 S4 null 200 null P1 600 null P1 1000 null null 1600 GROUP BY GROUPING SETS ( (S#, P#), ( S# ), ( P# ), ( ) ) Cross Tabulations n Display query results as cross tabulations n More readable way n Formatted as a simple array n Example: two dimensions (supplier and parts) P1 P2 Total S1 300 200 500 S2 300 400 700 S3 0 200 200 S4 0 200 200 600 1000 1600 15 19.09.15 CUBE n Confusing term CUBE (?) n Derived from the fact that in multidimensional terminology,data values are stored in cells of a multidimensional array or a hypercube • The actual physical storage my differ n In our example • cube has just two dimensions (supplier, part) • The two dimensions are unequal (no square rectangle..) n Means „group“ by all possible subsets of the set {A, B, ..., Z } CUBE n Means „group“ by all possible subsets of the set {A, B, ..., Z } n M={A, B, ..., Z }, |M|=N n Power Set (Algebra) P(M):={U | U ⊆M}, |P(M)|=2N n ..proof by induction n n Subset represent different grade of summarization Data Mining: such a subset is called a Cuboid 16 19.09.15 n n For a cube with n dimensions, there are total 2n cuboids A cube operator was first proposed by Gray et. All 1997: n n Data Cube: A Relational Aggregation Operator Generalizing Group-By, CrossTab, and Sub-Totals; J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, H. Pirahesh: Data Mining and Knowledge Discovery 1(1), 1997, 29-53. http://research.microsoft.com/~Gray/ n The total number of data cuboids is 23=8 n n n n n {(city,item,year), (city,item), (city,year), (city),(item),(year), ()} (), the dimensions are not grouped These group-by’s form a lattice of cuboids for the data cube n The basic cuboid contains all three dimensions n 17 19.09.15 n Hasse-Diagram: Helmut Hasse 1898 - 1979 did fundamental work in algebra and number theory () (city) (city, item) (item) (city, year) (year) (item, year) (city, item, year) Cuboid (Data Mining Definition) n Names in data warehousing literature: n The n-D cuboid, which holds the lowest level of summarization, is called a base cuboid n .. {{A},{B},..} The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid n .. {∅} The lattice of cuboids forms a data cube 18 19.09.15 Cube: A Lattice of Cuboids ....(Power Set) all time 0-D(apex) cuboid item time,location time,item location item,location time,supplier supplier location,supplier item,supplier time,location,supplier time,item,location time,item,supplier 1-D cuboids 2-D cuboids 3-D cuboids item,location,supplier 4-D(base) cuboid time, item, location, supplier 24=16 Hierarchies n Independent variables are often related in hierarchies (taxonomy) n n Temporal hierarchy n n Determine ways in which dependent data can be aggregated Seconds, minutes, hours, days, weeks, months, years Same data can be aggregated in many different ways n Same independent variable can belong to different hierarchies 19 19.09.15 Hierarchy - Location all all Europe region Germany country city Frankfurt North_America Canada Vancouver ... ... ... Mexico Toronto M. Wind Storage space may explode... n n ... Spain L. Chan office n ... ... If there are no hierarchies the total number for ndimensional cube is 2n But.... n Many dimensions may have hierarchies, for example time • day < week < month < quarter < year n For a n-dimensional data cube, where Li is the number of all levels (for time Ltime=5), the total number of cuboids that can be generated is n T = ∏ (Li + 1) i=1 € 20 19.09.15 View of Warehouses and Hierarchies Specification of hierarchies n Schema hierarchy day < {month < quarter; week} < year n Set_grouping hierarchy {1..10} < inexpensive Multidimensional Data n Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Product Industry Region Year Category Country Quarter Product Month City Office Month Week Day 21 19.09.15 Drill up and down n Drill up: n n going from a lower level of aggregation to a higher Drill down: n means the opposite n Difference between drill up and roll up • Roll up: creating the desired groupings or aggregations • Drill up: accessing the aggregations n Example for drill down: • Given the total shipment quantity, get the total quantities for each individual supplier Typical OLAP Operations n n n n n Roll up (drill-up): summarize data n by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up n from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): n reorient the cube, visualization, 3D to series of 2D planes Other operations n drill across: involving (across) more than one fact table n drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 22 19.09.15 Fig. 3.10 Typical OLAP Operations Discovery-Driven Data Cubes SelExp InExp 23 19.09.15 Measures of Data Cube: Three Categories (Depending on the aggregate functions) n Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning • E.g., count(), sum(), min(), max() n Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function • E.g., avg(), min_N(), standard_deviation() n Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. • E.g., median(), mode(), rank() n Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function 24 19.09.15 Statistics for one Variable n Sample Size n The sample size denoted by N, is the number of data items in a sample n Mean n The arithmetic mean is the average value, the sum of all values in the sample divided by the number of values N x =∑ i=1 xi N € n Maximum, Minimum, Range n Range is the difference between maximum and minimum 25 19.09.15 Standard Deviation and Variance Square root of the variance, which is the sum of squared distances between each value and the mean divided by population size (finite population) n 1 N ∗ ∑ xi − x N i=1 ( σ= n ) 2 Example • 1,2,15 Mean=6 • (1− 6) 2 € + (2 − 6) 2 + (15 − 6) 2 = 40.66 3 σ=6.37 € Sample Standard Deviation and Sample Variance n Square root of the variance, which is the sum of squared distances between each value and the mean divided by sample size s= n Example N 1 ∗ ∑ xi − x N −1 i=1 ( ) 2 • 1,2,15 Mean=6 • (1− 6) 2 + (2 − 6)€2 + (15 − 6) 2 = 61 3 −1 s=7.81 € 26 19.09.15 n Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. Statistics for one Variable n Median n If the values in the sample are sorted into a non decreasing order, the median is the value that splits the distribution in half n (1 1 1 2 3 4 5) the median is 2 n If N is even, the sample has middle values, and the median can be found by interpolating between them or by selecting one of the arbitrary 27 19.09.15 Mode The mode is the most common value in the distribution n (1 2 2 3 4 4 4) the mode is 4 n If the data are real numbers mode nearly no information n • Low probability that two or more data will have exactly the same value Solution: map into discrete numbers, by rounding or sorting into bins for frequency histograms n We often speak of a distribution having two or more modes n • Distributions has two or more values that are common Outliers n Because they are averages, both the mean and the variance are sensitive to outliers n Big effects that can wreck our interpretation of data n For example: n Presence of a single outlier in a distribution over 200 values can render some statistical comparisons insignificant 28 19.09.15 The Problem of Outliers n One cannot do much about outliers expect find them, and sometimes, remove them n Removing requires judgment and depend on one‘s purpose Trimmed mean n Another robust alternative to the mean is the trimmed mean n Lop off a fraction of the upper and lower ends of the distribution, and take the mean of the rest • 0,0,1,2,5,8,12,17,18,18,19,19,20,26,86,116 n Lop off two smallest and two larges values and take the mean of the rest • Trimmed mean is 13.75 • The arithmetic mean 22.75 29 19.09.15 n Interquartile Range n Interquartile range is found by dividing a sorted distribution into four containing parts, each containing the same number n Each part is called quartile n The difference between the highest value in the third quartile and the lowest value in the second quartile is the interquartile range Quartile example n 1,1,2,3,3,5,5,5,5,6,6,100 n The quartiles are n (1 1 2),(3 3 5),(5 5 5), (6,6,100) n Interquartile range 5-3=2 n Range 100-1=99 n Interquartile range is robust against outliers 30