Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Design of DW Remember item, city, year, and sales_in_Euro () (city) (city, item) (item) (city, year) (year) (item, year) (city, item, year) 1 2 leveles of hierarchies for each dimension Item(part,color) City(downtown,suburb) Year(good_year,bad_year) i1,i2 c1,c2 y1,y2 For a 3-dimensional data cube, where Li is the number of all levels (L1,2,3=2), the total number of cuboids that can be generated is 3 " (2 + 1) = 3 3 = 27 i=1 ! {(), (c1),(c2),(i1),(i2),(y1),(y2), (c1,i1),(c1,i2),(c2,i1),(c2,i2), (c1,y1),(c1,y2),(c2,y1),(c2,y2), (i1,y1),(i1,y2),(i2,y1),(i2,y2), (c1,i1,y1),(c1,i1,y2),(c1,i2,y1),(c1,i2,y2), (c2,i1,y1),(c2,i1,y2),(c2,i2,y1),(c2,i2,y2)} 2 DMQL Data Mining Query Language Relational database schema Translation into SQL query Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models DMQL DMQL: A Data Mining Query Language for Relational Databases (Han et al, Simon Fraser University) Data warehouses and data marts can be defined by cube definition and dimension definition 3 DMQL Create and manipulate data mining models through a SQL-based interface (“Commanddriven” data mining) Abstract away the data mining particulars Data mining should be performed on data in the database (should not need to export to a special-purpose environment) Approaches differ on what kinds of models should be created, and what operations we should be able to perform Cube Definition Syntax in DMQL Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> Dimension Definition (Dimension Table) define dimension <dimension_name> as (<attribute_or_subdimension_list>) 4 time Example of Star Schema item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch branch_key branch_name branch_type location_key units_sold dollars_sold avg_sales item_key item_name brand type supplier_type location location_key street city state_or_province country Measures Defining Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 5 The star schema contains two measures dollars_sold and units_sold How are the DMQL commands interpreted to generate a specified data cube? Relational database schema time(time_key,day_of_week,month,quater,year) item(item_key,item_name,brand,type,supplier_type) branch(branch_key,branch_name,branch_type) location(location_key,street,city,province_or_state,country) sales(time_key,item_key,branch_key,location_key,number_of _units_sold,price) 6 The DMQL specification is translated into the following SQL query which generates the base cuboid SELECT s.time_key,s.item_key,s.branch_key,s.location_key, SUM(s.number_of_units_sold*s.price), SUM(s.number_of_units_sold) FROM time t, item i, branch b, location l, sales s, WHERE s.time_key=t.time_key AND s.item_key=i.item_key AND s.branch_key=b.branch_key AND s.location_key=l.location_key GROUP BY (s.time_key,s.item_key,s.branch_key,s.location_key); The granularity (resolution) of each dimension is at the join key level A join key is the key that links a fact table and the dimension table The fact table associated with a base cuboid is sometimes referred as base fact table 7 By changing GROUP BY we may generate other cuboids The apex cuboid representing the total sum of dollars_sold and total count of units_sold is generated by GROUP BY (); Other cuboids may be generated by applying selection and projection operations on the base cuboid To generate a data cube we may as well use GROUP BY CUBE (s.time_key,s.item_key,s.branch_key,s.location_key); Defining Snowflake Schema in DMQL define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city(city_key, province_or_state, country)) 8 Defining Fact Constellation in DMQL define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) define cube shipping [time, item, shipper, from_location, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time as time in cube sales define dimension item as item in cube sales define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type) define dimension from_location as location in cube sales define dimension to_location as location in cube sales Example (Exercício 7) Suponha um datawarehouse que contém as seguintes quatro dimensões: Data, Espectador, Localizaçaõ e Jogo e o facto “preço” que consiste no valor, em euros Valor pago por um espectador quando assiste a um determinado jogo numa data Espectadores podem ser estudantes, adultos ou séniores, e que cada uma destas categorias tem o seu preço de bilhete Dimensão data contenha o dia, mês e ano; que a dimensão Localização contenha o nome do estádio, e que a dimensão Jogo contenha o nome das duas equipas defrontadas 9 Diagrama em estrela para este DW jogo(jogoId, equipa1, equipa2) data (dataId, dia, mes, ano) espectador (espId, nome, categoria) localizacao(localId, estadio) factos(jogoId, dataId, espId, localId, preco) Escreva em SQL a interrogaçã o que devolve o preço total pago por espectadores estudantes para assistir ao jogo que se realizou no Estádio da Luz no dia 1 de Março de 2005 10 The corresponding SQL querry SELECT SUM(preco) FROM Factos F, Data D, Localizacao L, Espectador E WHERE F.dataId = D.dataId AND F.espId = E.espId AND F.localId = L.localId AND D.dia = 1 AND D.mes = 3 AND D.ano = 2005 AND L.estadio = ‘Estadio da Luz’ AND E.categoria= ‘Estudante’; Esquema relacional que modele a mesma informação e uma interrogaçã o SQL que devolva a mesma informação: jogo(jogoId, equipa1, equipa2, localId, data) localizacao(localId, estadio) espectador(espId, nome, categoriaId) categoria (categoriaId, nomeC, preco) jogoEspectador(jogoId, espId) 11 The corresponding SQL querry SELECT SUM (C.preco) FROM Categoria C, Espectador E, JogoEspectador JE, Jogo, J, Localizacao L WHERE C.nomeC = ‘Estudante’ AND J.data = 1/3/2005 AND L.estadio = ‘Estadio da Luz’ AND C.categoriaId = E.categoriaId AND E.espId = JE.espId AND JE.jogoId = J.jogoId AND J.localId = L.localId; Difference between both approaches Com o modelo em estrela, existe três joins, da tabela Factos com cada uma das 3 dimensões relevantes Com o esquema relacional, existem 4 joins In the star schema has less joins, corresponding to the relevant dimensions In multidimensional model the base cuboid is already precomputed 12 MDX Multidimensional Expressions (MDX) as a Language MDX emerged circa 1998, when it first began to appear in commercial applications. MDX was created to query OLAP databases, and has become widely adopted within the realm of analytical applications Provide the total sales and total cost amounts for the years 1997 and 1998 individually for all USA-based stores (including all products) We are asked, moreover, to provide the information in a twodimensional grid, with the sales and cost amounts (called measures in our data warehouse) in the rows and the years (1997 and 1998) in the columns SELECT {[Time].[1997],[Time].[1998]} ON COLUMNS, {[Measures].[Warehouse Sales],[Measures].[Warehouse Cost]} ON ROWS FROM Warehouse WHERE ([Store].[All Stores].[USA]) 13 The cube that is targeted by the query (the query scope) appears in the FROM clause of the query The FROM clause in MDX works much as it does in SQL, where it stipulates the tables used as sources for the query The query syntax also uses other keywords that are common in SQL, such as SELECT and WHERE. Important difference is that the output of an MDX query, which uses a cuboid as a data source, is another cuboid, whereas the output of an SQL query (which uses a columnar table as a source) is typically columnar A query has one or more dimensions. The query above has two. (The first three dimensions (=axes) that are found in MDX queries are known as rows, columns and pages) SELECT{[Time].[1997].[Q1],[Time].[1997].[Q2]}ON COLUMNS,{[Warehouse].[All Warehouses].[USA]} ON ROWSFROM WarehouseWHERE ([Measures].[Warehouse Sales]) Curled brackets "{}" are used in MDX to represent a set of members of a dimension or group of dimensions 14 Complex Aggregation at Multiple Granularities Multifeature cubes compute complex queries involving multiple aggregates at multiple granularities (resolution) Example item is purchased in a sales region on a business day (year,month,day) The shelf life in months of a given item is stored in shelf The item price and sales is stored in price and sales Find the total sales in 2000, broken down by item, region, and month with subtotal for each dimension A data cube is constructed {(item,region,month),(item,region),(item,month), (month,region),(item),(month),(region,()} Simple data cube, since it does not involve any dependent aggregates What are dependent aggregates? 15 Example Grouping by all subsets (cuboids) {item,region,month} (=data cube) Find maximum price for each group (cuboid) in 2000 Among the maximum price tuples find the minimum and maximum shelf lives Multifeature cube graph for the example query R2 cube {=MIN(R1.shelf)} R3 cube {=MAX(R1.shelf)} R1 cube {=MAX(price} R0 cube 16 The multifeature graph illustrates the aggregate dependencies R0,R1,R2,R3 are the grouping variables The grouping variables R2,R3 are dependent on R1 In extended SQL R2 in R1 R3 in R1 Query in extended SQL SELECT item,region,month,MAX(price),MIN(R1.shelf),MAX(R1.shelf) FROM Purchases WHERE year=2000 CUBE BY item,region,month:R1,R2,R3 SUCH THAT R1.price=MAX(price) AND R2 IN R1 and R2.shelf=MIN(R1.shelf) AND R3 IN R1 and R3.shelf=MAX(R1.shelf); 17 Design of Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view • allows selection of the relevant information necessary for the data warehouse Data source view • exposes the information being captured, stored, and managed by operational systems Data warehouse view • consists of fact tables and dimension tables Business query view • sees the perspectives of data in the warehouse from the view of enduser Data Warehouse Design Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around 18 Lifecycle planning Translation from user requirements into software requirements Transformation of the software requirements into software design Implementation of the design into programming code The sequence of this steps is defined by the lifecyle model A software lifecycle model must be defined for every project! The lifecycle model you choose has as much influence over your project’s success as any other planning decision you make! 19 Pure Waterfall model Software Concept Requirements Analysis Architecural Design Detailed Design Coding and Debugging System Testing Pure Waterfal model Document driven model which means that the main work products that are carried from phase to phase are documents The disadvantage of the pure waterfall model arise from the difficulty of fully specifying requirements at the beginning of the project, before any design works has been done and before any code has been written 20 Salmon model Software Concept Requirements Analysis Architecural Design Detailed Design Coding and Debugging System Testing Code-and-Fix I System Specification (maybe) Code-And-Fix Release 21 Code-and-Fix II Advantages No overhead, you don’t spend any time on planning, documentation, quality assurance, enforcement, or other activities than pure coding Since you jump right into coding, you can show signs of progress immediately It requires little expertise Code-and-Fix III Maintainability and reliability decrease with the complexity and the time For any kind of project other than a tiny project, this model is dangerous. It might have no overhead, but it also provides no means of assessing progress, you just code until you’re done 22 Spiral I Determine objectives, Alternatives, and constraints Review Risk analysis Start I II IV Evaluate III Prototypes Simulation models Reqirements Development Plan Realse code test Spiral II The basic idea behind the diagram is that you start on a small scale in the middle of the spine, explore the risks, make a plan to handle the risks, and then commit to an approach of the next iteration. Each iteration moves your project to a larger scale 23 Spiral III The spiral model is a risk-oriented lifecycle model that breaks a software project up into mini projects. Each mini project addresses one or more major risks until all the major risks have been addressed Spiral IV Determine objectives, and constraints Identify and resolve risks Evaluate alternatives Develop the deliverables for that iteration, and verify that they are correct Plan next iteration Commit to an approach for the next iteration One of the most important advantages of the spiral model is that as costs increase, risk decrease. The more time and money you spend, the less risk your’re taking 24 Data Warehouse: A Multi-Tiered Architecture Monitor & Integrator Metadata Other sources Operational DBs Extract Transform Load Refresh Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage OLAP Engine Front-End Tools Three Data Warehouse Models Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart • Independent vs. dependent (directly from warehouse) data mart Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized 25 Data Warehouse Usage Three kinds of data warehouse applications Information processing • supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs Analytical processing • multidimensional analysis of data warehouse data • supports basic OLAP operations, slice-dice, drilling, pivoting Data mining • knowledge discovery from hidden patterns • supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools Data Warehouse Back-End Tools and Utilities Data extraction Data cleaning Data transformation detect errors in the data and rectify them when possible convert data from legacy or host format to warehouse format Load get data from multiple, heterogeneous, and external sources sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions Refresh propagate the updates from the data sources to the warehouse 26 DMQL Data Mining Query Language Relational database schema Translation into SQL query Example, star schema, and relational data base MDX Multifeature cubes Design of a Data Warehouse Lifecycle models Data Warehouse models Data Cleaning (De)normalization(?) Missing Values ... 27