Download Handout 1 - Computer Information Systems

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Handout 12
CS-605 – Spring’17
Page 1 of 6
Handout 12
Data Warehousing and Analytics.
Operational (aka transactional) system – a system that is used to run a business in real
time, based on current data; also called a system of record
Informational (analytical) system – a system designed to support decision making based on
historical point-in-time and prediction data for complex queries or data-mining applications
Collect business operational data
Reduce it to a form that can be used to analyze the behavior of the business.
Not limited to Database, but often using the Database technology.
Data warehouse (simple definition) – an archival database for decision support.
Decision Support
Support day-to-day business operations
Read/writeable: records may be
inserted, updated, deleted.
Not as big as ones used for Decision Support
Hold historical information integrated from
multiple sources
Primarily read-only
Updating limited to
o Load
o Refresh
o (i.e. Inserts, some Deletes, almost
never Updates)
Include a temporal component.
Tend to be very large (especially when
storing transaction data)
Integrity not a big concern
Usually designed in ad hoc manner
 Often involve complex logical expressions in
 Require access to many kinds of
facts/business objects, i.e. may require many
 Functionally complex: may involve complex
statistical computations
 Analytically complex: rarely answered in one
Data Warehouse:
A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of
management decision-making processes
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats, encoding structures; from multiple and
heterogeneous organizational data sources
– Time-variant: Can study trends and changes
– Nonupdatable (nonvolatile): Read-only, periodically refreshed
Handout 12
CS-605 – Spring’17
Page 2 of 6
Data Mart:
– A data warehouse that is limited in scope. Intended for use by a smaller, more specialized group
of people
Creating a Data Warehouse - ETL (Extract, Transform, Load )
Need to integrate uncoordinated and
inconsistent multiple databases in
Need to separate operational and
informational systems and data to improve
performance of data management
Static extract = capturing a snapshot
of the source data at a point in time
Incremental extract = capturing
changes that have occurred since the
last static extract
uses pattern recognition and AI
techniques to upgrade data quality
Problems: misspellings, erroneous
dates, incorrect field usage,
mismatched addresses, missing data,
duplicate data, inconsistencies
Figure 9-1 from MDM Examples of heterogeneous data
Establishing standard abbreviations and identifiers, replacing synonyms.
Transform and consolidate
convert data from format of operational system to format of data warehouse
split/combine source records
synchronize time information:
e.g. customer - revenue data stored by fiscal quarter
customer - salesperson data stored by calendar quarter
can’t tell which salesperson is responsible for what part of the customer revenue
Handout 12
CS-605 – Spring’17
Page 3 of 6
Place transformed data into the warehouse and create indexes
Move the data
Initial / Refresh mode: bulk rewriting of target data at periodic intervals
Check uniqueness constraints
CPU intensive process, especially if many indices are present – drop/reset indices could help.
Several Common Data Warehouse Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and Operational Data Store
 Logical Data Mart and @ctive Warehouse
Generic Two-Level Architecture
Operational Databases / One company-wide Warehouse
Benefit: single integrated view of organizational data
Problem: Periodic extraction  data is not completely current in warehouse
Independent Data Mart
Multiple Data marts - mini-warehouses, limited in scope
No single consolidated warehouse.
Benefits: easier to create than one integrated warehouse
redundancy, extra work in ETL for each data mart, potential lack of consistency,
complex querying across multiple data marts
users of individual marts must themselves provide an integrated view – this is
difficult and does not add up to having a single warehouse with well-defined known
Dependent Data Mart and Operational Data Store
Data loaded
– from Operational Data Store to single Data Warehouse
– from Data Warehouse to Data Marts
Benefits: single ETL – no redundancy
Logical Data Mart and @ctive Warehouse
Data marts are logical views of the warehouse.
Works well when data warehouse is not too large.
Used in e-commerce applications.
Problems: performance degrades with increasing size of the warehouse
Benefits: Data in marts always current, no redundancy in storage/ETL
Handout 12
CS-605 – Spring’17
Page 4 of 6
Data Warehouse Structure
Dimension tables – (often de-normalized for performance reasons) describe major business subjects
+ Time Period.
Fact table – an associative entity of the dimensions. Contains factual and quantitative summary data.
Examples (From MDM)
Fact table provides statistics for sales broken
down by product, period and store dimensions
Handout 12
CS-605 – Spring’17
Page 5 of 6
Dimension table keys must be surrogate (non-intelligent and non-business related) for the
following reasons
– Object descriptions may change over time
e.g.: decided to change size of product with business number 20.
– Length/format consistency
Across multiple organizational databases, the same product may have
different identification numbers/primary keys
Granularity of Fact Table – what level of detail do you want?
– Transactional grain – finest level – enter every transaction into warehouse
– Aggregated grain – more summarized – enter just summary data
– Finer grain => better analysis capability
more dimension tables => more rows in fact table
Modeling dates:
Data Mining
Knowledge discovery using a blend of statistical, AI, and computer graphics techniques
– Explain observed events or conditions
why sudden increase in turkey sales?
– Confirm hypotheses
do turkey sales increase in November?
do more students take Literature courses as sophomores than juniors?
– Explore data for new or unexpected relationships
what else are the customers that buy turkeys in November likely to buy?
which group of customers is likely to be interested in a product?
Data visualization – representing data in graphical/multimedia formats for analysis. Often used in
conjunction with data mining. Helps identify trends and patterns.
Handout 12
CS-605 – Spring’17
Page 6 of 6
Big Data
- evolving term
- usually refers to voluminous amount of structured, semi-structured and unstructured data
- can be mined for information
o Systematic analysis and interpretation of data—typically using mathematical,
statistical, and computational tools—to improve our understanding of a real-world
Big data characteristics
The Five Vs of Big Data
– Volume – much larger quantity of data than typical for relational databases
– Variety – lots of different data types and formats
– Velocity – data comes at very fast rate (e.g. mobile sensors, web click stream)
– Veracity – traditional data quality methods don’t apply; how to judge the data’s
accuracy and relevance?
– Value – big data is valuable to the bottom line, and for fostering good organizational
actions and decisions
- Schema on Read, rather than Schema on Write
Schema on Write– preexisting data model, how traditional databases are
designed (relational databases)
Schema on Read – data model determined later, depends on how you want to
use it
Capture and store the data, and worry about how you want to use it later
- Data Lake
o A large integrated repository for internal and external data that does not follow a
predefined schema
o Capture everything, dive in anywhere, flexible access
NoSQL = Not Only SQL databases
• A category of recently introduced data storage and retrieval technologies not based on the
relational model
• Supports schema on read
• Largely open source
• BASE – basically available, soft state, eventually consistent