Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehouses Brief Overview Copyright © 2011 Curt Hill Introduction • Data warehouses have much in common with databases – This may include using the same software • What distinguishes them is the main purpose • The database is about transactions • The data warehouse is about decision support Copyright © 2011 Curt Hill Example(1) • Consider the operational database for a large retailer • The database is used to monitor day to day operations – How many stock items of any type are present? – How much was sold yesterday? – What will the next payroll need? • Quick response and ACID are very important • Used by lower management Copyright © 2011 Curt Hill Example(2) • Then there is corporate data warehouse for the same retailer • This stores any type of data for a long period of time – Tickets from sales – Almost everything the operational database has had at one time with years of history • Trends are very important • This is used to guide upper managementCopyright © 2011 Curt Hill Operational Databases • Operational database have: – – – – Strict performance requirements Predictable workloads Small units of work High utilization • Data warehouses are contrary in every respect Copyright © Curt Hill 2003-2011 Some definitions • A data warehouse – Subject oriented, integrated, nonvolatile, time-variant collection of data in support of management decsions • Data warehouses support – OLAP (OnLine Analytical Processing) – DSS/EIS (Decision Support Systems or Executive Information Systems) – Data mining Copyright © 2011 Curt Hill Operational Database • The operational database needs to retain very little history • The retailer’s ticket will: – Update stock quantities – Generate a credit card request • Once this is successful there is very little need to retain it • Similarly, once a product is no longer stocked, there is no need to retain information on it Copyright © 2011 Curt Hill Cleaning • Once data is of no value to the operational database it is transferred to the data warehouse • It may need some reformatting • Data is frequently deleted from the operational database • Once in the data warehouse it becomes a permanent addition – Hence the expression non-volatile Copyright © 2011 Curt Hill Data Models • We want to organize the data in the way that will be most useful • One common model is a data cube • Example – Region is one dimension – Product is a second – Month of sale is the third • Hypercubes contain more dimensions • Prior to data warehouses this was often done with spreadsheets – Dimensionality gets in the way Copyright © 2011 Curt Hill Figure P1 P2 P3 Copyright © 2011 Curt Hill Viewing • The cube presents three distinct faces: – Product vs. Region – Region vs. Quarter – Product vs. Quarter • A hypercube would present more • Each of these looks like a spreadsheet display • To pivot or rotate the cube is to present another face Copyright © 2011 Curt Hill Viewing Again • It is also desirable to condense the dimensions in a roll-up display – Condensing days into weeks into months into quarters – Condensing single stores into larger and larger groupings – Condensing single products into related products by functionality or brand • The opposite of this is the drill-down Copyright © 2011 Curt Hill Operations • What common operations exist, besides pivot, roll-up, drill-down • Slice and dice – Take sectional slices in the hypercube • Sorting – Arrange the data in an order, not necessarily that of the dimensions of the hypercube • Compute attributes – Arithmetic results based on existing values Copyright © 2011 Curt Hill Multiple Tables • As you might think the data cube is more complicated than it first appears • Two tables • Fact table – Tuples that have the actual data • Dimension table – Tuples of the attributes with selection criteria into the fact table Copyright © 2011 Curt Hill Said another way • The fact table contains the data that is aggregated into the cube entries – Pretty directly extracted from the operational database – This table may be enormous • The dimension table contains the selection criteria needed to condense the facts into a cube entry – How is that data summarized into the cube – Usually display sized, so small to medium Copyright © 2011 Curt Hill Fact Table Granularity • Since the fact table may be enormous, what is the smallest fact worth recording? • The minimum in retail may be every item sold in one day in one store – Straight off of ticket information • It may also be already aggregated – How many items of this product number sold in one day in one store Copyright © 2011 Curt Hill Fact/Dimension Schemas • Two typical ways to connect the two types of tables • Generally there is only one fact table but several dimension tables • Star – Each dimension is a single table • Snowflake – Each dimension is an hierarchy of tables Copyright © 2011 Curt Hill Dimension Example • Consider the product dimension • Each tuple specifies a range of products – As few as one – As many as entire brand or type • In the star model all the accessible data is in these tuples • In the snowflake model these tuples may reference further tables with more extensive data Copyright © 2011 Curt Hill Building the Warehouse • To build a warehouse the following steps are often used • Extraction • Formatting • Cleaning • Fit into the model • Loaded Copyright © 2011 Curt Hill Non Warehouse • We see the same process in moving data from one database to another • The acronym is ETL • Extract • Transform • Load • Typically do not need the clean Copyright © 2011 Curt Hill Extraction • Obtain data from one or more sources • Often, but not always, one or more operational databases • Any data stream of interest may also be used – Sensors at the Large Hadron Collider at CERN store about 15 petabytes a year – generates about a Terabyte/second – Financial market data Copyright © 2011 Curt Hill Formatting • There are often multiple sources to the data which means that we have a variety of fields and meanings • Mapping different data sources into a common meaning and format • Reconciling different dates, such as range of fiscal year • Making the data conform to the table formats required so that every field has the same meaning and units Copyright © 2011 Curt Hill Cleaning • The data must be checked for validity before entering the warehouse • Most labor intensive portion of the build • The size of the incoming data requires an automated approach • Each data source may require a different approach • Backflushing: returning cleaned data to original source for updating their own tables Copyright © 2011 Curt Hill Fitting • Putting the data into a form suitable for the data model of the warehouse • Usually converted from the form of the source database into the cube or hypercube model of the warehouse Copyright © 2011 Curt Hill Loading • Insert into the warehouse • The ability to check that the load completed properly is needed • The ability to remove incomplete loads and try again is also required Copyright © 2011 Curt Hill Software • Data warehouses may use traditional RDBMS or so-called NoSQL database software • The multidimesional hypercube format does not favor a normal DBMS • Once data is loaded it is retained – SQL Insert, Remove, Update statements are never used on data after it is successfully loaded Copyright © 2011 Curt Hill Warehouse vs. DBMS • Operational databases are crisp and up to date • Warehouses do not need the transactional ACID of a operational database • Warehouses may also lag operational databases by days to weeks Copyright © 2011 Curt Hill Knowledge Workers • Have a different skill set than many others • Business analyst – Understands the business processes of the organization • Programming skills – The organization of complicated queries is often much more than simple SQL – Usually involves considerable programmed search and aggregation Copyright © 2011 Curt Hill Finally • Much contrast between an operational database and a data warehouse • The warehouse is used to support managerial decisions – Usually at a much higher level than the operational database • There is another presentation on NoSQL databases Copyright © 2011 Curt Hill