Download Data warehouse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data Warehouses
Brief Overview
Copyright © 2011 Curt Hill
Introduction
• Data warehouses have much in
common with databases
– This may include using the same
software
• What distinguishes them is the main
purpose
• The database is about transactions
• The data warehouse is about
decision support
Copyright © 2011 Curt Hill
Example(1)
• Consider the operational database
for a large retailer
• The database is used to monitor day
to day operations
– How many stock items of any type are
present?
– How much was sold yesterday?
– What will the next payroll need?
• Quick response and ACID are very
important
• Used by lower management
Copyright © 2011 Curt Hill
Example(2)
• Then there is corporate data
warehouse for the same retailer
• This stores any type of data for a
long period of time
– Tickets from sales
– Almost everything the operational
database has had at one time with
years of history
• Trends are very important
• This is used to guide upper
managementCopyright © 2011 Curt Hill
Operational Databases
• Operational database have:
–
–
–
–
Strict performance requirements
Predictable workloads
Small units of work
High utilization
• Data warehouses are contrary in
every respect
Copyright © Curt Hill 2003-2011
Some definitions
• A data warehouse
– Subject oriented, integrated, nonvolatile, time-variant collection of data
in support of management decsions
• Data warehouses support
– OLAP (OnLine Analytical Processing)
– DSS/EIS (Decision Support Systems or
Executive Information Systems)
– Data mining
Copyright © 2011 Curt Hill
Operational Database
• The operational database needs to
retain very little history
• The retailer’s ticket will:
– Update stock quantities
– Generate a credit card request
• Once this is successful there is very
little need to retain it
• Similarly, once a product is no
longer stocked, there is no need to
retain information on it
Copyright © 2011 Curt Hill
Cleaning
• Once data is of no value to the
operational database it is
transferred to the data warehouse
• It may need some reformatting
• Data is frequently deleted from the
operational database
• Once in the data warehouse it
becomes a permanent addition
– Hence the expression non-volatile
Copyright © 2011 Curt Hill
Data Models
• We want to organize the data in the
way that will be most useful
• One common model is a data cube
• Example
– Region is one dimension
– Product is a second
– Month of sale is the third
• Hypercubes contain more dimensions
• Prior to data warehouses this was
often done with spreadsheets
– Dimensionality gets in the way
Copyright © 2011 Curt Hill
Figure
P1
P2
P3
Copyright © 2011 Curt Hill
Viewing
• The cube presents three distinct
faces:
– Product vs. Region
– Region vs. Quarter
– Product vs. Quarter
• A hypercube would present more
• Each of these looks like a
spreadsheet display
• To pivot or rotate the cube is to
present another face
Copyright © 2011 Curt Hill
Viewing Again
• It is also desirable to condense the
dimensions in a roll-up display
– Condensing days into weeks into
months into quarters
– Condensing single stores into larger
and larger groupings
– Condensing single products into
related products by functionality or
brand
• The opposite of this is the drill-down
Copyright © 2011 Curt Hill
Operations
• What common operations exist,
besides pivot, roll-up, drill-down
• Slice and dice
– Take sectional slices in the hypercube
• Sorting
– Arrange the data in an order, not
necessarily that of the dimensions of
the hypercube
• Compute attributes
– Arithmetic results based on existing
values
Copyright © 2011 Curt Hill
Multiple Tables
• As you might think the data cube is
more complicated than it first
appears
• Two tables
• Fact table
– Tuples that have the actual data
• Dimension table
– Tuples of the attributes with selection
criteria into the fact table
Copyright © 2011 Curt Hill
Said another way
• The fact table contains the data that
is aggregated into the cube entries
– Pretty directly extracted from the
operational database
– This table may be enormous
• The dimension table contains the
selection criteria needed to
condense the facts into a cube entry
– How is that data summarized into the
cube
– Usually display sized, so small to
medium
Copyright © 2011 Curt Hill
Fact Table Granularity
• Since the fact table may be
enormous, what is the smallest fact
worth recording?
• The minimum in retail may be every
item sold in one day in one store
– Straight off of ticket information
• It may also be already aggregated
– How many items of this product number
sold in one day in one store
Copyright © 2011 Curt Hill
Fact/Dimension Schemas
• Two typical ways to connect the two
types of tables
• Generally there is only one fact table
but several dimension tables
• Star
– Each dimension is a single table
• Snowflake
– Each dimension is an hierarchy of
tables
Copyright © 2011 Curt Hill
Dimension Example
• Consider the product dimension
• Each tuple specifies a range of
products
– As few as one
– As many as entire brand or type
• In the star model all the accessible
data is in these tuples
• In the snowflake model these tuples
may reference further tables with
more extensive data
Copyright © 2011 Curt Hill
Building the Warehouse
• To build a warehouse the following
steps are often used
• Extraction
• Formatting
• Cleaning
• Fit into the model
• Loaded
Copyright © 2011 Curt Hill
Non Warehouse
• We see the same process in moving
data from one database to another
• The acronym is ETL
• Extract
• Transform
• Load
• Typically do not need the clean
Copyright © 2011 Curt Hill
Extraction
• Obtain data from one or more
sources
• Often, but not always, one or more
operational databases
• Any data stream of interest may also
be used
– Sensors at the Large Hadron Collider at
CERN store about 15 petabytes a year –
generates about a Terabyte/second
– Financial market data
Copyright © 2011 Curt Hill
Formatting
• There are often multiple sources to
the data which means that we have a
variety of fields and meanings
• Mapping different data sources into
a common meaning and format
• Reconciling different dates, such as
range of fiscal year
• Making the data conform to the table
formats required so that every field
has the same meaning and units
Copyright © 2011 Curt Hill
Cleaning
• The data must be checked for validity
before entering the warehouse
• Most labor intensive portion of the build
• The size of the incoming data requires
an automated approach
• Each data source may require a
different approach
• Backflushing: returning cleaned data to
original source for updating their own
tables
Copyright © 2011 Curt Hill
Fitting
• Putting the data into a form suitable
for the data model of the warehouse
• Usually converted from the form of
the source database into the cube or
hypercube model of the warehouse
Copyright © 2011 Curt Hill
Loading
• Insert into the warehouse
• The ability to check that the load
completed properly is needed
• The ability to remove incomplete
loads and try again is also required
Copyright © 2011 Curt Hill
Software
• Data warehouses may use
traditional RDBMS or so-called
NoSQL database software
• The multidimesional hypercube
format does not favor a normal
DBMS
• Once data is loaded it is retained
– SQL Insert, Remove, Update
statements are never used on data
after it is successfully loaded
Copyright © 2011 Curt Hill
Warehouse vs. DBMS
• Operational databases are crisp and
up to date
• Warehouses do not need the
transactional ACID of a operational
database
• Warehouses may also lag
operational databases by days to
weeks
Copyright © 2011 Curt Hill
Knowledge Workers
• Have a different skill set than many
others
• Business analyst
– Understands the business processes of
the organization
• Programming skills
– The organization of complicated
queries is often much more than simple
SQL
– Usually involves considerable
programmed search and aggregation
Copyright © 2011 Curt Hill
Finally
• Much contrast between an
operational database and a data
warehouse
• The warehouse is used to support
managerial decisions
– Usually at a much higher level than the
operational database
• There is another presentation on
NoSQL databases
Copyright © 2011 Curt Hill