Download Handout 1 - Computer Information Systems

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Handout 12
CS-605 – Spring’17
Page 1 of 6
Handout 12
Data Warehousing and Analytics.
•
•
Operational (aka transactional) system – a system that is used to run a business in real
time, based on current data; also called a system of record
Informational (analytical) system – a system designed to support decision making based on
historical point-in-time and prediction data for complex queries or data-mining applications
o
Collect business operational data
o
Reduce it to a form that can be used to analyze the behavior of the business.
o
Not limited to Database, but often using the Database technology.
Data warehouse (simple definition) – an archival database for decision support.
Operational
Databases
Decision Support
Databases
Support day-to-day business operations

Read/writeable: records may be
inserted, updated, deleted.


Not as big as ones used for Decision Support




Hold historical information integrated from
multiple sources
Primarily read-only
Updating limited to
o Load
o Refresh
o (i.e. Inserts, some Deletes, almost
never Updates)
Include a temporal component.
Tend to be very large (especially when
storing transaction data)
Integrity not a big concern
Usually designed in ad hoc manner
Queries
 Often involve complex logical expressions in
WHERE
 Require access to many kinds of
facts/business objects, i.e. may require many
joins.
 Functionally complex: may involve complex
statistical computations
 Analytically complex: rarely answered in one
query.

Data Warehouse:
A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of
management decision-making processes
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats, encoding structures; from multiple and
heterogeneous organizational data sources
– Time-variant: Can study trends and changes
– Nonupdatable (nonvolatile): Read-only, periodically refreshed
-1-
Handout 12

CS-605 – Spring’17
Page 2 of 6
Data Mart:
– A data warehouse that is limited in scope. Intended for use by a smaller, more specialized group
of people
Creating a Data Warehouse - ETL (Extract, Transform, Load )
Need to integrate uncoordinated and
inconsistent multiple databases in
organizations.
Need to separate operational and
informational systems and data to improve
performance of data management
Extract
Static extract = capturing a snapshot
of the source data at a point in time
Incremental extract = capturing
changes that have occurred since the
last static extract
Scrub/Cleanse
uses pattern recognition and AI
techniques to upgrade data quality
Problems: misspellings, erroneous
dates, incorrect field usage,
mismatched addresses, missing data,
duplicate data, inconsistencies
Figure 9-1 from MDM Examples of heterogeneous data
Establishing standard abbreviations and identifiers, replacing synonyms.
Transform and consolidate
convert data from format of operational system to format of data warehouse
split/combine source records
synchronize time information:
e.g. customer - revenue data stored by fiscal quarter
customer - salesperson data stored by calendar quarter
can’t tell which salesperson is responsible for what part of the customer revenue
-2-
Handout 12
CS-605 – Spring’17
Page 3 of 6
Load/Index
Place transformed data into the warehouse and create indexes
Move the data
Initial / Refresh mode: bulk rewriting of target data at periodic intervals
Check uniqueness constraints
CPU intensive process, especially if many indices are present – drop/reset indices could help.
Several Common Data Warehouse Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and Operational Data Store
 Logical Data Mart and @ctive Warehouse
Generic Two-Level Architecture
Operational Databases / One company-wide Warehouse
Benefit: single integrated view of organizational data
Problem: Periodic extraction  data is not completely current in warehouse
Independent Data Mart
Multiple Data marts - mini-warehouses, limited in scope
No single consolidated warehouse.
Benefits: easier to create than one integrated warehouse
Problems:
redundancy, extra work in ETL for each data mart, potential lack of consistency,
complex querying across multiple data marts
users of individual marts must themselves provide an integrated view – this is
difficult and does not add up to having a single warehouse with well-defined known
structure.
Dependent Data Mart and Operational Data Store
Data loaded
– from Operational Data Store to single Data Warehouse
– from Data Warehouse to Data Marts
Benefits: single ETL – no redundancy
Logical Data Mart and @ctive Warehouse
Data marts are logical views of the warehouse.
Works well when data warehouse is not too large.
Used in e-commerce applications.
Problems: performance degrades with increasing size of the warehouse
Benefits: Data in marts always current, no redundancy in storage/ETL
-3-
Handout 12
CS-605 – Spring’17
Page 4 of 6
Data Warehouse Structure
Star-schema:
Dimension tables – (often de-normalized for performance reasons) describe major business subjects
+ Time Period.
Fact table – an associative entity of the dimensions. Contains factual and quantitative summary data.
Examples (From MDM)
Fact table provides statistics for sales broken
down by product, period and store dimensions
-4-
Handout 12
CS-605 – Spring’17
Page 5 of 6
Issues:

Dimension table keys must be surrogate (non-intelligent and non-business related) for the
following reasons
– Object descriptions may change over time
e.g.: decided to change size of product with business number 20.
– Length/format consistency
Across multiple organizational databases, the same product may have
different identification numbers/primary keys

Granularity of Fact Table – what level of detail do you want?
– Transactional grain – finest level – enter every transaction into warehouse
– Aggregated grain – more summarized – enter just summary data
– Finer grain => better analysis capability
more dimension tables => more rows in fact table
Modeling dates:
Technologies
Data Mining
Knowledge discovery using a blend of statistical, AI, and computer graphics techniques
– Explain observed events or conditions

why sudden increase in turkey sales?
– Confirm hypotheses

do turkey sales increase in November?

do more students take Literature courses as sophomores than juniors?
– Explore data for new or unexpected relationships

what else are the customers that buy turkeys in November likely to buy?

which group of customers is likely to be interested in a product?
Data visualization – representing data in graphical/multimedia formats for analysis. Often used in
conjunction with data mining. Helps identify trends and patterns.
-5-
Handout 12
CS-605 – Spring’17
Page 6 of 6
Big Data
- evolving term
- usually refers to voluminous amount of structured, semi-structured and unstructured data
- can be mined for information
Analytics
o Systematic analysis and interpretation of data—typically using mathematical,
statistical, and computational tools—to improve our understanding of a real-world
domain.
Big data characteristics
•
The Five Vs of Big Data
– Volume – much larger quantity of data than typical for relational databases
– Variety – lots of different data types and formats
– Velocity – data comes at very fast rate (e.g. mobile sensors, web click stream)
– Veracity – traditional data quality methods don’t apply; how to judge the data’s
accuracy and relevance?
– Value – big data is valuable to the bottom line, and for fostering good organizational
actions and decisions
- Schema on Read, rather than Schema on Write



Schema on Write– preexisting data model, how traditional databases are
designed (relational databases)
Schema on Read – data model determined later, depends on how you want to
use it
Capture and store the data, and worry about how you want to use it later
- Data Lake
o A large integrated repository for internal and external data that does not follow a
predefined schema
o Capture everything, dive in anywhere, flexible access
NoSQL = Not Only SQL databases
• A category of recently introduced data storage and retrieval technologies not based on the
relational model
• Supports schema on read
• Largely open source
• BASE – basically available, soft state, eventually consistent
-6-