* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data warehouse
Survey
Document related concepts
Transcript
Data Warehouses Brief Overview Copyright © 2011 Curt Hill Introduction • Data warehouses have much in common with databases – This may include using the same software • What distinguishes them is the main purpose • The database is about transactions • The data warehouse is about decision support Copyright © 2011 Curt Hill Example(1) • Consider the operational database for a large retailer • The database is used to monitor day to day operations – How many stock items of any type are present? – How much was sold yesterday? – What will the next payroll need? • Quick response and ACID are very important • Used by lower management Copyright © 2011 Curt Hill Example(2) • Then there is corporate data warehouse for the same retailer • This stores any type of data for a long period of time – Tickets from sales – Almost everything the operational database has had at one time with years of history • Trends are very important • This is used to guide upper managementCopyright © 2011 Curt Hill Operational Databases • Operational database have: – – – – Strict performance requirements Predictable workloads Small units of work High utilization • Data warehouses are contrary in every respect Copyright © Curt Hill 2003-2011 Some definitions • A data warehouse – Subject oriented, integrated, nonvolatile, time-variant collection of data in support of management decsions • Data warehouses support – OLAP (OnLine Analytical Processing) – DSS/EIS (Decision Support Systems or Executive Information Systems) – Data mining Copyright © 2011 Curt Hill Operational Database • The operational database needs to retain very little history • The retailer’s ticket will: – Update stock quantities – Generate a credit card request • Once this is successful there is very little need to retain it • Similarly, once a product is no longer stocked, there is no need to retain information on it Copyright © 2011 Curt Hill Cleaning • Once data is of no value to the operational database it is transferred to the data warehouse • It may need some reformatting • Data is frequently deleted from the operational database • Once in the data warehouse it becomes a permanent addition – Hence the expression non-volatile Copyright © 2011 Curt Hill Data Models • We want to organize the data in the way that will be most useful • One common model is a data cube • Example – Region is one dimension – Product is a second – Month of sale is the third • Hypercubes contain more dimensions • Prior to data warehouses this was often done with spreadsheets – Dimensionality gets in the way Copyright © 2011 Curt Hill Figure P1 P2 P3 Copyright © 2011 Curt Hill Viewing • The cube presents three distinct faces: – Product vs. Region – Region vs. Quarter – Product vs. Quarter • A hypercube would present more • Each of these looks like a spreadsheet display • To pivot or rotate the cube is to present another face Copyright © 2011 Curt Hill Viewing Again • It is also desirable to condense the dimensions in a roll-up display – Condensing days into weeks into months into quarters – Condensing single stores into larger and larger groupings – Condensing single products into related products by functionality or brand • The opposite of this is the drill-down Copyright © 2011 Curt Hill Operations • What common operations exist, besides pivot, roll-up, drill-down • Slice and dice – Take sectional slices in the hypercube • Sorting – Arrange the data in an order, not necessarily that of the dimensions of the hypercube • Compute attributes – Arithmetic results based on existing values Copyright © 2011 Curt Hill Multiple Tables • As you might think the data cube is more complicated than it first appears • Two tables • Fact table – Tuples that have the actual data • Dimension table – Tuples of the attributes with selection criteria into the fact table Copyright © 2011 Curt Hill Said another way • The fact table contains the data that is aggregated into the cube entries – Pretty directly extracted from the operational database – This table may be enormous • The dimension table contains the selection criteria needed to condense the facts into a cube entry – How is that data summarized into the cube – Usually display sized, so small to medium Copyright © 2011 Curt Hill Fact Table Granularity • Since the fact table may be enormous, what is the smallest fact worth recording? • The minimum in retail may be every item sold in one day in one store – Straight off of ticket information • It may also be already aggregated – How many items of this product number sold in one day in one store Copyright © 2011 Curt Hill Fact/Dimension Schemas • Two typical ways to connect the two types of tables • Generally there is only one fact table but several dimension tables • Star – Each dimension is a single table • Snowflake – Each dimension is an hierarchy of tables Copyright © 2011 Curt Hill Dimension Example • Consider the product dimension • Each tuple specifies a range of products – As few as one – As many as entire brand or type • In the star model all the accessible data is in these tuples • In the snowflake model these tuples may reference further tables with more extensive data Copyright © 2011 Curt Hill Building the Warehouse • To build a warehouse the following steps are often used • Extraction • Formatting • Cleaning • Fit into the model • Loaded Copyright © 2011 Curt Hill Non Warehouse • We see the same process in moving data from one database to another • The acronym is ETL • Extract • Transform • Load • Typically do not need the clean Copyright © 2011 Curt Hill Extraction • Obtain data from one or more sources • Often, but not always, one or more operational databases • Any data stream of interest may also be used – Sensors at the Large Hadron Collider at CERN store about 15 petabytes a year – generates about a Terabyte/second – Financial market data Copyright © 2011 Curt Hill Formatting • There are often multiple sources to the data which means that we have a variety of fields and meanings • Mapping different data sources into a common meaning and format • Reconciling different dates, such as range of fiscal year • Making the data conform to the table formats required so that every field has the same meaning and units Copyright © 2011 Curt Hill Cleaning • The data must be checked for validity before entering the warehouse • Most labor intensive portion of the build • The size of the incoming data requires an automated approach • Each data source may require a different approach • Backflushing: returning cleaned data to original source for updating their own tables Copyright © 2011 Curt Hill Fitting • Putting the data into a form suitable for the data model of the warehouse • Usually converted from the form of the source database into the cube or hypercube model of the warehouse Copyright © 2011 Curt Hill Loading • Insert into the warehouse • The ability to check that the load completed properly is needed • The ability to remove incomplete loads and try again is also required Copyright © 2011 Curt Hill Software • Data warehouses may use traditional RDBMS or so-called NoSQL database software • The multidimesional hypercube format does not favor a normal DBMS • Once data is loaded it is retained – SQL Insert, Remove, Update statements are never used on data after it is successfully loaded Copyright © 2011 Curt Hill Warehouse vs. DBMS • Operational databases are crisp and up to date • Warehouses do not need the transactional ACID of a operational database • Warehouses may also lag operational databases by days to weeks Copyright © 2011 Curt Hill Knowledge Workers • Have a different skill set than many others • Business analyst – Understands the business processes of the organization • Programming skills – The organization of complicated queries is often much more than simple SQL – Usually involves considerable programmed search and aggregation Copyright © 2011 Curt Hill Finally • Much contrast between an operational database and a data warehouse • The warehouse is used to support managerial decisions – Usually at a much higher level than the operational database • There is another presentation on NoSQL databases Copyright © 2011 Curt Hill