Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 6 Data management Technology of information systems Contents • Different types of data • Data processing • Data warehouse systems Technology of information systems Different types of data (1) • In an enterprise we distinguish the following business processes: – Primary processes: delivery of products and services – Secondary processes: taking care of resources – Tertiary processes: management over the other processes, i.e. planning, control and coordination. • Different terms: – Primary processes: production processes – Secondary processes: support processes – Tertiary processes: management processes – Operational processes: primary + secundary • Each of these type of processes generates its own data and has different information needs Technology of information systems Different types of data (2) • Primary processes generate information about: – Orders generated by • A client, in case of Make-To-Order • The (master) production plan in case of Make-To-Stock – Execution of tasks for the processing of the orders, such as: • When the task was performed • Which durable resources were involved • How much or which consumable resources were involved • Primary processes need information about the clients, the (master) production plan and the availablity of resources. Note: For us “information” and ”data” are synonyms; others use data for basic facts and information for transformed facts that have a meaning to people. Technology of information systems Different types of data (3) • Secondary processes generate information about: – Status of resources: • • • • Quality Availablity Maintenance Purchase and removal – Use of resources in primary business processes • They need information about: – Need for resources in the future – Availability of resources in the market (external info) Technology of information systems Different types of data (4) • Tertiary processes generate periodically or on request for external stakeholders (e.g. share holders and government) information about status and forecasts of, such as: – – – – – – – Financial resources Turnover (earnings) Costs (expenditures) Profit Efficiency Energy consumption and pollution Diversity • For internal stakeholders (employees) they generate plans, procedures and decisions. • They need information about: – The primary and secundary processes Technology of information systems Different types of data (5) • Operational information (from primary and secondary processes) is about actual facts • The processing of operational information is called: On Line Transaction Processing (OLTP) • The management information is aggregated historical facts. • The processing of management information is called: On Line Analytical Processing (OLAP) • Note that OLTP needs the actual data, while OLAP can use “old data” (since aggregates change slowly) Technology of information systems Different types of data (6) • Discovery of patterns in this data is called: Data mining (process mining is a special subset) • Management information is transformed into – Forecasts – Plans – Decisions • These transformations are performed by decision support systems. Technology of information systems Different types of data (7) Each kind of information has its own preferred data type. This is not a “law”, but a “guideline”: • Order-related information-------Document type • Resources-related information----Relational type • Management information--------Hyper-cube type For each type we have specfic COTS: • Document typeXML databases or dedicated document managers • Relational type relational database systems • Hyper-cube type data warehouse systems Technology of information systems Different types of data (8) • For order-related data use the document type: – Each case has its own document, which is normally a tree structure. – Only actual data, if the case is handled there is no use of the information in the primary processes. – An XML-database is a good candidate for this. – Storing this type of data in a relational database means that all case data is divided over several tables – In the operations we seldom have to relate cases, we work on one case at a time – So retrieval of a case means a query over many tables and taking out one or a few tuple(s) (tuple = record) and for updating the case data has to be divided into small pieces and stored in different tables. – Disadvantage: redundancy, if a clients name changes, we have to update all documents. Technology of information systems Different types of data (9) • For resource-related data use the relational type: – There are many relationships among resources and between resources and orders. – Resources (e.g. some stock item) have a status that can be changed and they perform activities. Both are recorded and they are related. – For instance if one replenish order is made for a supplier, the inventory of all resource types supplied by this supplier are retrieved. – Sometimes it is also case oriented data, e.g. • Personnel files • Maintenance of equipment Technology of information systems Different types of data (10) • For management infomation use the hyper-cube • Aggregation means that we are not interested in simple facts but in statistics, i.e. quantification over attributes of event types. • So events have one or more attributes, some of them are numerical. For instance a sales event might have the following attributes: – Sales date – Sales person – Client – Product type – Volume of the order – Price of the order Technology of information systems Different types of data (11) • Management is not interested in this fact but wants to know for instance: What is the total value (or volume) of the orders: – Of a certain product type – In a particular period – In a particular region • Dataware house terminology: – Events with attributes are called facts – The different attributes are called dimensions, typically: time, geography and organization structure – The attributes with numerical values, used as statistics, are called measures Technology of information systems Different types of data (12) • The most important statistics are: – Frequencies: How often did an event of a special type occur? Often a histogram is a good presentation, but a pie chart will also do. – Totals: What is the sum of all values associated with these events? – Averages: What is the average (mean) value of an event type? – Variances: What is the spread of a value of an event type? (spread is square root of variance) – Coefficient of variation = spread divided by mean value. Give some usefull examples for each of them Technology of information systems Data processing(1) There is a natural direction of data processing: 1. Operational processes generate actual facts and the are collected in data manager (either document managers or relational data bases). 2. As soon as the transactions are closed, the data is not of interest anymore and it should be archived in a historical data manager. 3. Concurrently with this archiving some useful information is collected in a data warehouse for management infomation (this will be elaborated) 4. However a data warehouse can also become full, so we need a strategy to refresh a data warehouse as well. Technology of information systems Data processing(2) There is a natural direction of data processing: New facts Operational Data Managers Old facts Extract, Transform, Load Old facts Archive Extracted and transformed update Technology of information systems Data warehouse Data processing (3) Use ideas from statistics: • • • • Sequence of measurements: X1,X2,X3,…. Assume n>1 Mean estimator: Mn = (i=1…n Xi)/n So the update is: Mn+1 = (n.Mn + Xn+1)/(n + 1) • Variance estimator: Sn = (i=1…n Xi2 – n.Mn2)/(n - 1) • The update is: Sn+1 =((n - 1)/n).Sn + (Mn - Xn+1)2/(n + 1) • So we have to memorize only: n, Mn, Sn (M0=S0=0) Technology of information systems Data processing (4) Updating statistics: • Sequence of measurements: X1,X2,X3,…. • The moving average (history length k): Mn= (i=n-k+1..n Xi)/k • Weighted moving average: Mn= (i=n-k+1..n ai.Xi)/k • Usual case: exponential smoothing (0<a<1) Mn= (i=n-k+1..n a(n-i).Xi)/k so Xn-k+1 obtains weight ak-1 and X1 weight 1 Technology of information systems Data warehousing (1) Three-Tier Architecture other Metadata sources Operational Data Bases Extract Transform Load Refresh OLAP Server Monitor & Integrator Analysis Query/Reporting Operational Data Store Serve Data Warehouse Data Mining ROLAP Server Data Marts Data Sources Data Storage OLAP Engine Front-End Tools Technology of information systems Data warehousing (2) About the architecture: • ETL-tools take care of: – Extraction: take the relevant data out of the Operational Data Bases (ODB) – Transformation: • Cleansing: resolving inconsistencies • Conversion: format unification • Consolidation: different sources have different levels of detail – Loading: uploading in the data warehouse • Often a data warehouse has a Operational Data Store (ODS). This is not an ODB but a intermediate layer between ODB and the data ware house. It contains detailed information, including history. • Data marts: specialized subsets of a data warehouse for a specific application domain. Technology of information systems Data warehousing (3) •Multidimensional view on facts (events with attributes) •In each dimension we may have an aggregation hierarchy, e.g. second, minute, hour, day, week,... Value or volume sales location client product Technology of information systems Data warehousing (4) Data Cubes 2Qtr 3Qtr 4Qtr sum Ireland France Germany sum Technology of information systems Country TV PC VCR sum 1Qtr Date Data warehousing (5) Hierarchy provides a lattice of cubes all product date country date, country product, date product, country product, date, country Technology of information systems Datawarehousing (6) • Result of a query is: – One cell – Two dimensional slice (or cross table) – Multi dimensional sub cube Product: color Date:month, JuneAugust 2006 Blue Red Orange Total June 51 25 158 234 July 58 20 120 198 August 65 22 51 138 Total 174 67 329 570 Technology of information systems Data warehousing (7) Cross tabulation Dimensions June Blue Red Orange Total 51 25 158 234 August Data Points/ 58 st 20 120 1 level of aggregation 65 22 51 Total 174 July 67 329 Aggregated w.r.t.ofY-dim Technology information systems 198 Aggregated w.r.t. X-dim 138 570 Aggregated w.r.t. X and Y Data warehousing (8) Operations with Data Cubes (1) • Roll up, aggregation – store city – cities country – product product type • Drill down city store – country cities – product type product • Drill through: go back to the original, individual data records Technology of information systems Data warehousing (9) Operations with Data Cubes (2) • Pivoting: change the dimensions that are “displayed”; select a cross-tab. – look at the cross-table for product-date – display cross-table for date-customer • Slice and dice: select a part of the cube by restricting one or more dimensions – restrict analysis to “product type = television” and “region = France” Technology of information systems Data warehousing (10) Implementation Three categories: – ROLAP (Relational OLAP) • OLAP supported on top of a relational database (SQL has special OLAP features since 1999) – MOLAP (Multi-Dimensional OLAP) • Use of special multi-dimensional data structures, special index techniques for sparse matrices – HOLAP: (Hybrid) • combination of previous two Technology of information systems Data warehousing (11) ROLAP • Cubes can easily be represented in relational tables: special value “all” Month Jan Jan Jan Feb Prod. p1 p2 p1 p1 all Jan Jan all p1 all p1 all all all Cust. c1 c1 c2 c1 … c1 c1 all c1 … all Price 10 8 10 9 102 18 1 230 4 235 1 253 458 Technology of information systems Datawarehousing (12) Data models for ROLAP data warehousing • Star schema • Snowflake schema: further normalization of dimensions • Extension: more stars (i.e. more facts!) with common dimensions. Technology of information systems Datawarehousing (13) Time Example of a Star Schema Geography Time code Quarter code Quarter name Month code Month name Sales Date Account Account code KeyAccount code Geography code Region code Region manager State code City code Geography code Time code Account code KeyAccount name Product code Account name Value Account type Volume Account market Technology of information systems Product Product code Product name Brand code Brand name Product type code Product type name Datawarehousing (14) Example of a Snowflake Schema Quarter Geography Time Quarter code Quarter name Geography code Region code Region manager State code City code Time code Quarter code Month Month codeDate Month code Month name Account Account code KeyAccount code Sales Geography code Time code Product type Product type code Product type name Brand Brand code Brand name Account code KeyAccount name Product code Account name Value Account type Volume Product code Product name Brand code Product type code Account market Technology of information systems Product