Download Introduction to Data Warehouse and Data Mining MIS 2502 Data Analytics 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to Data Warehouse and Data Mining
MIS 2502
Data Analytics
1
Importance of data
What organizations do with data?
 Transaction Processing
– (E)commerce: Amazon.com; PNC Bank
– B2B systems: Supply Chain Management
– Web Search

Decision Making
–
–
–
–
–
–
–
Financial reporting
Inventory management
Budget allocation
Customer Relationship Management
Target Marketing
Product Design and Promotions
Fraud Detection
2
Unnormalized data set
Patient
ID
Name
Address
DOB
Doc
Appt
Date
Location
DX
111111
Cindy
Marselis
2320 Edge
Hill Road
1/11/64
Armstrong
9/1/09
11:00 AM
Alter 2011
Herniated
Disc
Flu
111111
Cindy
Marselis
9331
Rising Sun
Avenue
1/11/64
Morningstar
9/1/09
11:00 AM
Alter 2011
Herniated
Disc
111111
Cindy
Marselis
2320 Edge
Hill Road
1/11/64
Allen
11/1/09
10:00 AM
Alter 2012
Psoriasis
222222
Kathryn
Marselis
2320 Edge
Hill Road
11/3/04
Dershaw
8/1/09
11:00 AM
Speakman
105
Well baby
check
111111
Cindy
Schwartz
9331
Rising Sun
Avenue
1/11/64
Armstrong
8/11/09
3:00 PM
Alter 105
Psoriasis
Herniated
Disc
3
Normalized db - before
4
Normalized db - after
5
Decision Making with Databases
 Databases
are used for transaction
processing
 Data from transaction processing is used for
tactical decision making
– Database provides basic reporting function

But…
6
The Need for Data Analysis





Different managers require different data and data may
come from other part of the organization or outside the
organization
External and internal forces require tactical and
strategic decisions
Search for competitive advantage
Business environments are dynamic
Decision-making cycle time is reduced
7
Some Questions Analysts Need to Answers
Sales analysis:
What are the sales by quarter and geography?
How do sales compare in two different stores in the
same state?
Profitability analysis:
Which is the most profitable store in the state CA?
Which product lines are the highest revenue
producers this year?
Which products and product lines are the most
profitable this quarter?
Sale force analysis
Which salesperson is the best revenue producer this
year?
Do salesperson X meet his sale target this quarter?
8
From transaction processing to supporting
decision making
9
Operational vs. Decision Support Data



Operational data
– Relational, normalized database
– Optimized to support transactions
– Real time updates
DSS
– Snapshot of operational data
– Summarized
– Large amounts of data
Data analyst viewpoint
– Timespan
– Granularity
– Dimensionality
10
Data Warehouse




Integrated
– Centralized
– Holds data retrieved from entire organization
Subject-Oriented
– Optimized to give answers to diverse questions
– Used by all functional areas
Time Variant
– Flow of data through time
– Projected data
Non-Volatile
– Data never removed
– Always growing
11
Data Warehouse


Extraction, transformation, and loading (ETL) – a process
that extracts information from internal and external databases,
transforms the information using a common set of enterprise
definitions, and loads the information into a data warehouse
Data mart – contains a subset of data warehouse information
12
ETL – Extraction, Transformation,
Load
Extract
data from
source
systems
Transform:
cleanse data for consistency and
output exceptions
o Apply business rules
o Selecting certain columns to load
(not null records)
o Translating coded values (1, M, male
=0)
o Derive new calculated value
(sale_amount = qty * unit_price)
o Join data from multiple sources
(lookup, merge)
o Aggregate (rollup/summarize data)
o Transpose/pivot (turning columns
into rows)
o Data validation.
Load:
data into
repository
13
Data in a Data Warehouse
 Data
for a data
warehouse is obtained
from a variety of
databases
–
E.g. customer
database, transaction
database, accounts
database
 Data
in data
warehouse is
multidimensional
14
Multidimensional Analysis

Cube – common term for the representation of
multidimensional information
15
Star Schema




Data-modeling technique
Maps multidimensional decision support into relational
database
Yield model for multidimensional data analysis while
preserving relational structure of operational DB
Four Components:
– Facts
– Dimensions
– Attributes
– Attribute hierarchies
16
Simple Star Schema
Figure 13.12
17
Slice and Dice View of Sales
Figure 13.14
18
Star Schema Representation




Facts and dimensions represented by physical tables
in data warehouse DB
Fact table related to each dimension table (M:1)
Fact and dimension tables related by foreign keys
Subject to the primary/foreign key constraints
19
Star Schema for Sales
Sales fact table and its four dimensions: location, time,
product, and customer .
Allows sales to be aggregated by time, geographic location,
product, and by customer.
20
Data Warehouse to Data Marts
 Given
the large
size of a data
warehouse,
organizations
create data marts
–
–
–
Subject oriented
data
Subset of data in
a data warehouse
Used for focused
decision-making
21
Online Analytical Processing (OLAP)



Advanced data analysis environment
Supports decision making, business modeling, and
operations research activities
Characteristics of OLAP
–
–
–
–
Use multidimensional data analysis techniques
Provide advanced database support
Provide easy-to-use end-user interfaces
Support client/server architecture
22
OLAP Client/Server Architecture
Figure 13.6
23
Data Mining
 Seeks
to discover patterns or relationships
within the data
 Data mining tools automatically search data for
patterns and relationships
 Data mining tools
–
–
–
–
–
Analyze data
Uncover problems or opportunities
Form computer models based on findings
Predict business behavior with models
Require minimal end-user intervention
24
What Are Data-Mining Tools?
25
Data Mining Process
26
Business Intelligence
AB113 - Information Technology
27
MS SQL 2008 Architecture
Relational Model
Dimensional Model/Star Schema
28
Back Room—Data prepared from many sources
Front Room—Information presented
29
Multidimensional Analysis
and Data Mining

Differences between databases and data warehouse/data
mart?

Data mining – the process of analyzing data to extract
information not offered by the raw data alone
– To perform data mining users need data-mining tools
» Data-mining tool – uses a variety of techniques to find patterns and
relationships in large volumes of information and infers rules that predict
future behavior and guide decision making

Business intelligence – taking data from multiple sources
and turn it into useful and easy to understand information to
support decision-making efforts for various kinds of people.
30