Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN 1558604898 Data Mining: Concepts and Techniques 1 OLTP • it design for optimal transaction Data Mining: Concepts and Techniques 2 OLAP • It design to give overview analysis of what happened! • It is uses to built report answer the following : – Q1: who the supervisor that gave most discount? – Q2:in which Zip code did product a sell the most? To answer the questions, OLAP Cube are created. Data Mining: Concepts and Techniques 3 Example • Assume we have made a record of the weather conditions during a two-week period, along with the decisions of a tennis player whether or not to play tennis on each particular day. • We have values of four independent variables (outlook, temperature, humidity, windy) and one dependent variable (play) Consider our data stored in a relational table as follows: Data Mining: Concepts and Techniques 4 Example (cont.) Day outlook Temperature humidity windy Play 1 Sunny 85 85 false No 2 Sunny 80 90 true no 3 overcast 83 86 false Yes 4 Rainy 70 96 false Yes 5 Rainy 68 80 false Yes 6 Rainy 65 70 true No 7 overcast 64 65 true yes 8 Sunny 72 95 false no 9 Sunny 69 70 false yes 10 Rainy 75 80 false yes 11 Sunny 75 70 true yes 12 overcast 72 90 true yes 13 overcast 81 75 false yes 14 Rainy 71 91 true no Data Mining: Concepts and Techniques 5 Example (cont.) • By querying a DBMS containing the above table we may answer questions like: What was the temperature in the sunny days? {85, 80, 72, 69, 75} Which days the humidity was less than 75? {6, 7, 9, 11} Which days the temperature was greater than 70? {1, 2, 3, 8, 10, 11, 12, 13, 14} Which days the temperature was greater than 70 and the humidity was less than 75? The intersection of the above two: {11} Data Mining: Concepts and Techniques 6 Example (cont.) • OLAP: Using OLAP we can create a Multidimensional Model of our data (Data Cube). For example using the dimensions: time, outlook and play we can create the following model. Data Mining: Concepts and Techniques 7 Example (cont.) Yes/No sunny rainy overcast Week 1 0/2 2/1 2/0 Week 2 2/1 1/1 2/0 Obviously here time represents the days grouped in weeks (week 1 - days 1, 2, 3, 4, 5, 6, 7; week 2 - days 8, 9, 10, 11, 12, 13, 14) over the vertical axis. The outlook is shown along the horizontal axis and the third dimension play is shown in each individual cell as a pair of values corresponding to the two values along this dimension - yes / no. Thus in the upper left corner of the cube we have the total over all weeks and all outlook values. Data Mining: Concepts and Techniques 8 Example (cont.) • By apply "Drill-down" to our data cube over the time dimension. • This assumes the existence of a concept hierarchy for this attribute. We can show this as a horizontal tree as follows: Data Mining: Concepts and Techniques 9 Example (cont.) Time week1 Data Mining: Concepts and Techniques week2 day1 day1 day2 day2 day3 day3 day4 day4 day5 day5 day6 day6 day7 day7 10 Example (cont.) • The drill-down operation is based on climbing down the concept hierarchy, so that we get the following data cube: Yes/ No 1 2 3 4 5 6 7 8 9 10 11 12 13 Data Mining: Concepts and Techniques sunny rainy overcast 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/1 1/0 0/0 1/0 0/0 0/0 0/0 0/0 0/0 1/0 1/0 0/1 0/0 0/0 0/0 1/0 0/0 0/0 0/0 0/0 0/0 1/0 0/0 0/0 0/0 1/0 0/0 0/0 0/0 0/0 1/0 1/0 11 Multidimensional data model By using same example and change some values: • play has just two values - yes and no, it can replace them by 1 and 0 This will allows us to add up values and thus get the total number of days when tennis was played and at the same time the number of days tennis was not played • Rename the day attribute into time, which is more general and will allow us to use other time units (e.g. weeks). Thus we get the following relational table: Data Mining: Concepts and Techniques 12 Multidimensional data model (cont.) time outlook temperature humidity windy play 1 sunny 85 85 false 0 2 sunny 80 90 true 0 3 overcast 83 86 false 1 4 rainy 70 96 false 1 5 rainy 68 80 false 1 6 rainy 65 70 true 0 7 overcast 64 65 true 1 8 sunny 72 95 false 0 9 sunny 69 70 false 1 10 rainy 75 80 false 1 11 sunny 75 70 true 1 12 overcast 72 90 true 1 13 overcast 81 75 false 1 14 rainy 71 91 true 0 Data Mining: Concepts and Techniques 13 Concept hierarchies 1- attributes day, temperature and humidity we can group values in subsets and name these subsets as following : Day: all ______|_________ | week 1 _____|_____ | | | | | | | 1 2 3 4 5 6 7 Data Mining: Concepts and Techniques | week 2 _______|_______ | | | | | | | 8 9 10 11 12 13 14 14 Concept hierarchies (cont.) Temperature: all ____________|_____________ | | | hot mild cool _ |___ __|____ ___|____ | | | | | | | | | | | | 80 81 83 85 70 71 72 75 64 65 68 69 Data Mining: Concepts and Techniques 15 Concept hierarchies (cont.) Humidity: all ___|___________ | | high normal ______|_______ ___|____ | | | | | | | | | | 85 86 90 91 95 96 65 70 75 80 Data Mining: Concepts and Techniques 16 Concept hierarchies (cont.) • We may also extend the sets of numbers or replace them with intervals, which will make the hierarchy complete (covering all possible values). For example, humidity may look like this: all ____|____ | | high normal | | [85,96] [65,84] Data Mining: Concepts and Techniques 17 Concept hierarchies (cont.) 2- For the nominal (non numeric) attributes outlook and windy we define one-level hierarchies, as their values cannot be ordered or grouped. outlook: all _______|________ | | | sunny rainy overcast Data Mining: Concepts and Techniques 18 Concept hierarchies (cont.) windy: all ___|____ | | true false Data Mining: Concepts and Techniques 19 Data cube • The number of dimensions define the total number of data cubes that can be created. number of elements is 2N elements; N is an number attributes Data Mining: Concepts and Techniques 20 Data cube (cont.) To create a data cube we have to: 1- Select dimensions, that is select a subset of attributes. For example, select time and temperature. Thus we will create a twodimensional data cube. 2- Select levels in the concept hierarchies. For example, let us select weeks for time and degrees for temperature. 3- Select a measure to populate the cube. This is the attribute whose values will be aggregated across the dimensions (obviously it has to be numeric). For example, Let us select play. Data Mining: Concepts and Techniques 21 Data cube (cont.) • By placing the time values in the rows and the temperature values in the columns we get the following cube: 64 65 68 69 70 71 72 75 80 81 83 85 Week1 1 0 1 0 1 0 0 0 0 0 1 0 week2 0 0 0 1 0 0 1 2 0 1 0 0 The numbers in the internal cells are obtained by adding up the values of the play attribute, where the time and the temperature attribute are equal to the values in the corresponding row and column • For example the value 2 (row 2, column 8) means that tennis was played two days during week 2 when the temperature was 75. Data Mining: Concepts and Techniques 22 OLAP operations Rollup: • assume we want to change the level that we selected for the temperature hierarchy to the intermediate level (hot, mild, cool). • Roll up produces the following cube: cool mild hot week 1 2 1 1 week 2 1 3 1 Data Mining: Concepts and Techniques 23 OLAP operations (cont.) Drill-down • the drill down of the pervious data cube over the time dimension produces the following: Data Mining: Concepts and Techniques 24 OLAP operations (cont.) cool mild hot day 1 0 0 0 day 2 0 0 0 day 3 0 0 1 day 4 0 1 0 day 5 1 0 0 day 6 0 0 0 day 7 1 0 0 day 8 0 0 0 day 9 1 0 0 day 10 0 1 0 day 11 0 1 0 day 12 0 1 0 day 13 0 0 1 day 14 0 0 0 Data Mining: Concepts and Techniques 25 Lattice of cubes, slice and dice operations Lattice : there are five dimension: Time, outlook, temperature, humidity, windy. Data Mining: Concepts and Techniques 26 Lattice of cubes, slice and dice operations (cont.) 0-D (apex) cuboids : { all} 1-D cuboids:{ Time}, {Outlook}, {Temperature}, {Humidity}, { Windy} 2-D cuboids: { {Time, Outlook}, {Time, Temperature}, { Time, Humidity}, {Time, Windy}, {Outlook, Temperature}, {Outlook, Humidity}, {Outlook, Windy}, {Temperature, Humidity}, { Temperature, Windy}, {Humidity, Windy} } Data Mining: Concepts and Techniques 27 Lattice of cubes, slice and dice operations (cont.) 3- D Cuboids : { { Time, Outlook, Temperature}, {Time, Outlook, Humidity}, {Time, Outlook, Windy}, {Time, Temperature, Humidity}, {Time, Temperature, Windy}, {Time, Humidity, Windy} {Outlook, Temperature, Humidity}, { Outlook, Temperature, Windy}, {Outlook, Humidity, Windy} {Temperature, Humidity, Windy} } Data Mining: Concepts and Techniques 28 Lattice of cubes, slice and dice operations (cont.) 4-D cuboids: { { Time, Outlook, Temperature, Humidity}, {Time, Outlook, Temperature, Windy}, {Time, Outlook, Humidity, Windy}, {Time, Temperature, Humidity, Windy} {Outlook, Temperature, Humidity, Windy} } 5- D cuboids { Time, Outlook, Temperature, Humidity, Windy} Data Mining: Concepts and Techniques 29 Lattice of cubes, slice and dice operations (cont.) • There are two other OLAP operations that are related to the selection of a cube - slice and dice. Slice : performs a selection on one dimension of the given cube, thus resulting in a subcube. For example, if we make the selection (temperature=cool) we will reduce the dimensions of the cube from two to one, resulting in just a single column from the pervious tables. So, the result will be as following: Data Mining: Concepts and Techniques 30 Lattice of cubes, slice and dice operations (cont.) Cool Data Mining: Concepts and Techniques day 1 0 day 2 0 day 3 0 day 4 0 day 5 1 day 6 0 day 7 1 day 8 0 day 9 1 day 10 0 day 11 0 day 12 0 day 13 0 day 14 0 31 Lattice of cubes, slice and dice operations (cont.) • The dice operation works similarly and performs a selection on two or more dimensions. For example, applying the selection (time = day 3 OR time = day 4) AND (temperature = cool OR temperature = hot) to the original cube we get the following subcube (still two-dimensional): Cool Hot day 3 0 1 day 4 0 0 Data Mining: Concepts and Techniques 32 The End Data Mining: Concepts and Techniques 33