* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Analytical databases
Survey
Document related concepts
Transcript
Analytical data bases Data cubes Database lectures for mathematics students Zbigniew Jurkiewicz, Institute of Informatics UW May 22, 2016 Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Decision support systems From the perspective of the time span all decisions in the organization could be divided into three categories: operational decisions within the scope of days or weeks; tactical decisions, whose effects range from a few months to one year, strategic decisions, which impact the organization development for the next few years. It has been observed that when moving from operational decisions towards the strategic ones, the procedures used become less and less algorithmic and formalized. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Decision support systems Initially in business activities computer systems had been used mostly for operational data processing with such applications as sales order management, invoicing, or magazine inventory. Gradually the computers has been used for less routine activities, called Decision Support Systems (DSS). They are also knonw under other popular names BIS/BIT (Business Intelligence System/Technology) EIS (Executive Information System). In addition to “mechanical” data processing they also provide various mechanisms for deducing new information from the facts contained in the database. This lead to a division of “database” applications into operational (transactional) and analytical. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Requirements for decision support systems Information should usually be presented in a summarized form. No standard access path, very varied methods of selection and formatting the information to be presented, dynamics. Associating the selected information which other computational resources (spreadsheets, specialized statistical packages). Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Analytical data processing Commonly known as On-Line Analytical Processing, abbreviated usually to OLAP. Typical applications: trend analysis, looking for patterns of behavior, looking for anomalies. Used interactively, so efficiency is very important, especially time efficiancy. If a user observes that some queries (e.g. based on 5 and more joins) are executed very slowly, she will try to avoid them. It is assumed that the answers for 90% of queries should be available within 10 seconds. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Analytical data bases Also called On-Line Analytical Processing (OLAP) databases. Growing in importance. From personal computers to large client-server configurations. Many buzzwords roll-up and drill-down, drill-through, MOLAP, pivoting. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Main issues What is an analytical database? Models and operations Implementing analytical database Development trends Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Matrix reports Data analysis was first supported by matrix reports. Matrix reports look like spreadsheets. They are often concerned with finances or management. Sales system should for example contain a report about customers and their buying patterns, divided by country regions. However, instead of analysing buying patterns for each product, we divide product into categories. So the report would have product categories as columns, country regions as rows, and each report cell will show the number of items sold in this category in this region. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Data Mart Small data warehouse, sometimes called thematic database Covers only some areas (themes) of the enterprise, e.g. marketing: customers, products, sales Model adapted to the needes of a department. Usually the information is initially preaggregated Elimination of unnecessary details Some critical level of details selected. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Tools for querying and analysis Query builders Report generators comparisons: growth, decrease trends, graphs Spreadsheets WWW interface Data mining Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Other operations Functions over time e.g. averages on different periods Computed attributes e.g.. profit = sales * rate Textual queries, e.g. find all documents containing words A and B order documents by frequency of occurence for words X , Y and Z Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Data models and operators Data models relation star and snowflake cube: extension of spreadsheet idea (multidimensional tables, dimensions indexed by database values) Operators slice & dice roll-up, drill down pivoting other Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Multidimensional data model Multidimensional databases are most popular due to analytical data model of the form of multidimensional cube containing: facts (also called measures), e.g. the number of cars sold; dimensions, e.g. months, regions of sale. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Dimensions Dimensions usually form hierarchies, e.g. for time dimension the hierarchy will be year-quarter-month-day. Hierarchies enable the interactive change of detail level (granularity) of the information presented. In more complex models the hierachies can branch, e.g. division into weeks is incompatible with division into months. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Time Time dimension needs a special treatment It is hidden — there is no separate table for time. Time is unique as a dimension because it is sequential in character. We might ask to see the sales for May or the sales for the first three months of 2007. But we would rarely ask to see the sales for the first five goods (ever assuming they are ordered by name). Method of aggregation for time depends on the meaning of the measure. If a company sold 10 computers in January, 15 computers in February, and 10 computers in March, then typical query would ask for total number (i.e. sum) sold for the first quarter. On the other side, if a company had employeed 10 people in January, 7 in February, and 10 again in March, then we would usually ask about the average count for the quarter. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Database The data is usually taken from data warehouse (real or virtual). Direct storage of information for all facts and all levels of detail in database could be very costly in terms of space, so Store only data for the most often used levels of hierarchies. Other data is computed from stored data on the fly if needed. When aggregating measures it is important to take into account various rules of aggregation, e.g. Sales amount is usually summed. Temperature or price will rather be averaged. The analytical database stores as a rule only aggregated data. To see the detail data (drill-through) it is necessary to fetch it from data warehouse or operational database. Because this takes a lot of time, such need should not occur too often. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Operations on data Cutting and projecting on the cross-section surface (slice and dice) Change of detail level: drill-down and roll-up) Turning (pivot): changes the visible dimensions on the “image”. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Approaches to building the OLAP database 1 ROLAP = “Relational OLAP”: we adapt the relational DBMS to star or snowflake schema. 2 MOLAP = “Multidimensional OLAP”: we use specialized DBMS based on “datacube” model. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Star schema Star schema is a typical method of data organization in relational database for OLAP. It is composed of: Fact table: large set of facts such as informations about the amount of sale. Dimension tables: smaller, statical information about the objects that the facts deal with. Generalization: snowflake model. Hierarchies of tables for particular dimensions: dimension table normalization. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example star schema We want to have in OLAP database informations about the selling of beers: pub, beer name, drinker who bought it, day, hour, and price. We take the following relation as our fact table: Sales(pub,beer,drinker,day,hour,price) Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example, cont. Dimension tables contain informations about pubs, beers and drinkers: Pubs(pub, address, licence) Beers(beer, prod) Drinkers(drinker, address, phone) Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Dimension attributes and dependent attributes Two kinds of attributes exist in fact table: Dimension attributes: the keys for dimension tables. Dependent attributes: the values associated with particular combinations of dimension attributes values. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example: dependent attribute price is a dependent attribute in the example relation Sales. Its value is determined by the combination of dimension attributes: pub, beer, drinker and time (the combination of date and hour). Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures ROLAP optimization techniques Bitmap indexes: for each value of the index key in a dimension table (e.g. for each beer in the Beers table) we create a bit vector showing which tuples in a fact table contain this value. Materialized views: the OLAP database (or ever the data warehouse) stores precomputed answers for some useful queries (perspectives). Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Typical OLAP query OLAP query often starts with “star join”: the natural join of the fact table with all or most dimension tables. Example: SELECT * FROM Sales,Pubs,Beers,Drinkers WHERE Sales.pub = Beers.pub AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker; Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Typical OLAP query Starts with a star join. Selects interesting tuples using data from dimension tables. Groups on one or more dimensions. Aggregates some attributes of the result. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example OLAP query For each pub in Poznań show the total sale of each beer produced by Anheuser-Busch brewery. Filter: address = “Poznań” and prod = “Anheuser-Busch”. Grouping: by pub and beer. Aggregation: Sum over price. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example: SQL SELECT pub, beer, SUM(price) FROM Sales NATURAL JOIN Pubs NATURAL JOIN Beers WHERE addr = ’Poznań’ AND prod = ’Anheuser-Busch’ GROUP BY pub, beer; Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Materialized views Direct execution of our query for the table Sales and dimension tables may take a lot more time than we accept. If we would create a materialized view containg the appropriate information, we could give the answer much faster. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example: materialized view Which view could help us? Basic requirements: 1 2 3 4 Must join at least Sales, Pubs and Beers. Must group at least by pub and beer. Does not need to select pubs in Poznań nor beers from Anheuser-Busch. Does not need to omit columns address and prod. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example Here is a useful view: CREATE VIEW PuBeS(pub, address, beer, prod, sale) AS SELECT pub, address, beer, prod, SUM(price) AS sale FROM Sales NATURAL JOIN Pubs NATURAL JOIN Beers GROUP BY pub, address, beer, prod; Because pub → address and beer → prod, some grouping is superficial, but it is necessary because address and prod occur in the SELECT phrase. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example — finale The reformulated query (now it uses the materialized view BaBaS): SELECT pub, beer, sale FROM PuBeS WHERE address = ’Poznań’ AND prod = ’Anheuser-Busch’; Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Materialization aspects Type and frequency of queries Computing time for queries Storage costs Updating costs Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures MOLAP and datacubes The (keys of) dimension tables become the dimensions of hypercube. Example: for data from Sales table we have 4 dimensions: pub, beer, drinker i time. Dependent attributes (e.g. price) are located in points (cells) of the hypercube. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Visualization — hypercubes Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Borders Often a cube should also contain aggregations (usually SUM or AVG) along the hyperedges of the cube. Borders contain one-dimensional, two-dimensional, . . . aggregations. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example: borders Our 4-dimensional hypercube Sales may contain sums of price for each pub, each beer, each drinker (ummm... sensible personal data) and each time unit (probably days). It could also contain sums of price for all pairs pub-beer, triples pub-drinker-day, . . . Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Structure of the cube We extend each dimension to have one additional value *. Internal cell with one or more coordinate being * contains aggregates for grouping by dimensions with *. Example: Sales(’Pod Żaglem’, ’Bud’, *, *) contains the sum of the cost of the beer Bud which has been drunk in the pub “Pod Żaglem” by all drinkers at any time. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Drill-down Drill-down = “deaggregation” — decompose the aggregation into its components. Example: after finding that “Pod Żaglem” sells few Okocim beer, one may try to decompose this sales into particular kinds of Okocim. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Roll-up Roll-up = additional aggregation on one or more dimensions. Example: having the table showing how much Okocim beer is drunk by each drinker in each pub, we roll it into a table giving the total amount of Okocim beer drunk by each of drinkers. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Roll-Up i Drill-Down Anheuser-Busch for drinker/pub Joe’s Pub Nut-House Blue Chalk Jim 45 50 38 Bob 33 36 31 Mary 30 42 40 Rolling-up by Pubs A-B / drinker Jim 133 Mary 100 Bob 112 Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Roll-Up i Drill-Down Drill-down by Beers Beers A-B / drinker Bud M’lob Bud Light Jim 40 45 48 Bob 29 31 40 Zbigniew Jurkiewicz, Institute of Informatics UW Mary 40 37 35 Analytical data bases Data cubes Database lectures Materialized views for datacubes Useful materialized views for datacubes should aggregate by one or more dimensions. The dimensions should not be totally aggregated, but possibly grouped by some attribute from a dimension table. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example A materialized view for our Sales hypercube could: 1 2 3 4 Aggregate totally by drinker. Do not aggregate at all by beer. Aggregate by time using week. Aggregate by town for pubs. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Indexes Traditional techniques B-trees, hashing tables, R-trees, grids, ... Specific inverted lists bitmap indexes join indexes Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Using inverted lists Query: Find people with age = 20 and name = “Fred” List for age = 20: r4, r18, r34, r35 List for name = “Fred”: r18, r52 The answer is obtained as intersection: r18 Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures MDX Multidimensional Expressions (MDX): query language for MOLAP, initially part OLE DB (Microsoft 1997). Then used by Microsoft OLAP Services 7.0 and Microsoft Analysis Services. XML for Analysis contains MDX as query language. Supported by Applix, Oracle, SAS, SAP, Panorama Software, Cognos, Hyperion Solutions and others. In 2001 XMLA Council (www.xmla.org) publishes the standard for XML for Analysis, with query language mdXML (MDX enclosed with <Statement> tag from XML. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Example query in MDX SELECT { [Measures].[Sales in shops] } ON COLUMNS, { [Date].[2002], [Date].[2003] } ON ROWS FROM Sales WHERE ( [Shop].[Europe].[Poland] ) The SELECT clause determines the “axes” of query as Sales in shops from Measures dimension and 2002 plus 2003 from Date dimension. The FROM clause indicates, that the data source is the hypercube Sales. The WHERE clause defines the “cross-section” as the element Poland of the dimension Shop. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Trends Oracle: Essbase (after taking over Hyperion), BI Server. IBM: Cognos 8 BI (together with ‘PowerPlay Studio’), database TM1 (Applix). Microsoft: database Panorama (included into SQL Server 7), two analysis tools (Maximal i ProClarity), integration with Excel, SharePoint and Visio. Planned in-memory tool Gemini. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Data Mining Automatic search for “interesting” patterns and trends in data. The term data mining is mostly used for describing the summarization of large data sets in a useful way. Showing regularities, often written using rules Inductive methods used, opposite to deductive data bases (like Datalog) Consequence: the results are never universally guaranteed, they could be the effect of the momentary contents of the data base. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Examples Grouping all WWW Internet pages according to subjects. Preventing credit frauds: finding characteristic properties of illegal transctions with credit cards. Searching for associations, e.g. finding goods often bought together. Finding similar sequences of behavior, e.g. shares with similar oscillations of quotations. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Characteristics Basically a nontrivial automatic extraction of unknown and potentially useful information contained implicite in database. Based on searching for patterns in data, without previous construction of hypotheses This differs from classical statistical approach, where analyst builds hypotheses and tries to verify them on a smaple from data bases. More troubles in situations, when patterns are discovered in recursive decision process. Information in database is often disturbed and incomplete, so some statistic knowledge is necessary anyway. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Technology Generally artificial intelligence, machine learning, neural networks, association rules, rough sets. Classification and forecasting: building a classifier for categories given in advance. Custer analysis: category defining during analysis. Pattern recognition and searching. Decision trees. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Clustering: issues Partitioning data into into automatically generated categories Do we have the expected number of groups? How to find the ,,best” groups? Are groups semantically meaningful? e.g. “yuppies” Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Market-basket analysis Market baskets = sets of goods which are bought together by a customer during one visit in the shop. Summary of market-baskets: frequent sets of items — sets of goods often found together. Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures Tools Weka: New Zealand Rses i Rses-lib: MIMUW. SAS Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Data cubes Database lectures