Download PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
TEACHING THE DATA
WAREHOUSE COURSE
Paul Gray
ISECON 2001
1
Outline



Overview of what is Data Warehousing
The 5 and 10 week courses
The Indiana University of Pennsylvania
course
2
ORIGINS

Data warehouses are the results of two
software solutions needing and finding one
another:



Data base firms developed data warehouses
and were looking for applications
EIS and DSS software developers and vendors
needed to deal with ever-increasing data
bases
About 10 years ago, the two groups
started interacting with the results
described here.
3
ORIGINS


Database developers long understood that
their software was required for both
transactional and analytic processing
However, their principal developments were
directed to ever-larger transactional data
bases. This process occurred even through
operational and analytic data are separate
with different requirements and different user
communities.
4
ORIGINS


Once these differences were understood, new
data bases were created specifically for
analysis use.
Today, data warehouses have 3 major
applications



On-line analytic processing for business
intelligence
Data Mining
Customer Relationship Management
5
WHAT IS DATA
WAREHOUSING




A data warehouse is typically a dedicated data base
system for decision making that is separate from the
production data base(s) used operationally. It differs
from production system in that:
it covers a much longer time horizon than transaction
systems
it includes multiple data bases that have been
processed so that the warehouse’s data are defined
uniformly (i.e., ‘clean’ data)
it is optimized for answering complex queries from
managers and analysts.
6
WHAT IS DATA
WAREHOUSING?


In the last 5 years, data warehousing has become a
major industry within computing which has brought
together the ideas of data bases and decision
support. It has also been the foundation for efforts in
data mining and in CRM
Data mining refers to finding answers about an
organization from the information in the data
warehouse that the executive or the analyst had not
thought to ask. Data mining is made possible by the
very presence of large databases in the data
warehouse. It provides techniques that allow
managers to obtain managerial information from their
legacy systems. Its objective is to identify valid,
novel, potentially useful, and understandable patterns
in data.
7
WHAT IS DATA
WAREHOUSING?


The objective of a data warehouse is to
create a “single truth”
Data warehousing is a major new
application area. It rates extremely high
salaries (up to $100,000 for specialists,
$300,000 for consultants).
8
DEFINITION

A data warehouse is a:
Subject oriented
 Integrated
 Time-variant
 Non-volatile
Collection of data in support of management
decision processes

9
NOTE:


Data warehouse is physically separated
from operational systems and
operational data bases
Data warehouses hold both aggregated
and detailed data for management
separate from the databases used for
On-Line Transaction Processing (OLTP)
10
CHARACTERISTICS
Subject oriented Data are organized by how
users refer to it
Integrated
Non-volatile
Time series
Inconsistencies are removed
in both nomenclature and
conflicting information; (i.e.
data are ‘clean’)
Read-only data. Data do not
change over time.
Data are time series, not
current status
11
CHARACTERISTICS
Summarized
Non normalized
Operational data are mapped
into decision usable form
Time series implies much
more data is retained
Data can be redundant
Metadata
=Data about data
Input
Unintegrated, operational environment (‘legacy systems’)
Larger
12
SUBJECT ORIENTATION


Data is organized around major
subjects of the enterprise
Example:
OPERATIONAL
•Loans
•Savings
•Bank card
•Trust
An application
orientation
DATA WAREHOUSE
Customer
Vendor
Product
Activity
A subject
orientation
13
USING THE WAREHOUSE



The higher the level of summarization,
the more the data is used
The more summarized the data, the
quicker it is to retrieve
However, the higher the level of
summarization, the lower the level of
detail
14
DATA MARTS




Data Mart: A scaled-down version of the
data warehouse
A data mart is a small warehouse designed
for the SBU or department level.
It is often a way to gain entry and provide
an opportunity to learn
Major problem: if they differ from
department to department, they can be
difficult to integrate enterprise-wide
15
COST




Data warehouses are not cheap
Median cost to create (does not include
operating cost) = $2.2M
Multimillion dollar costs are common
Their design and implementation is still an
art and they require considerable time to
create
16
SIZE


Being designed for the enterprise so
that everyone has a common data set,
they are large and increase in size with
time.
Typical storage sizes run from 50
Gigabytes to several Terabytes
17
SIZE OF INDUSTRY



Data warehouses are a major industry
within information systems. 6B$/year
Estimates vary but it is clear that many
more than 90% of Fortune 1000 have data
warehouse projects
Major players include:


Oracle IBM+Informix Sybase NCR
+ BI companies (Brio, Cognos, Pilot,….)
18
MARKET COMPONENTS






High end business intelligence (OLAP)
Low end query tools
Data cleansing
Data marts
Data mining
Customer relationship management
(CRM)
19
APPLICATION – OLAP/BI



OLAP = On Line analytic processing
Basic idea of OLAP: managers should be
able to manipulate enterprise data models
across many dimensions to understand
changes that are occurring
Vendors claim they are OLAP compliant even
if they are not
20
APPLICATION - DATA MINING


Also known as Knowledge Data
Discovery (KDD)
Mining terminology refers to finding
answers about a business from the data
warehouse that the executive or analyst
had not thought to ask
21
APPLICATION - DATA MINING



KDD applies techniques mostly from
artificial intelligence and statistics to
discover new information.
It is designed to find information that
queries and reports don’t reveal
effectively
KDD uses AI and statistics to find
pattern in data and to infer rules.
22
APPLICATION - DATA MINING

Some successes:





People who buy scuba gear take Australian
vacations
Fraud detection, consumer loan analysis
Optimizing production lines
IBM’s SCOUT
(apocryphal?) Men who buy diapers buy beer
23
APPLICATION - DATA MINING
Associations
Sequences
Classifications
Clusters
Forecasting
Things done together
(buy groceries)
Events over time
(house.refrigerator)
Pattern recognition
(rules)
Define new groups
Predictions from time
series
24
CUSTOMER RELATIONSHIP
MANAGEMENT (CRM)




Successor to data base marketing
Implies marketing to customers on a
1:1 basis
Requires data granularity at the level of
the individual customer
large
amounts of data
Data warehouse is only a part of the
CRM concept.
25
History of Course I




First given as PhD Seminar on
Management of Information Systems
1996
Overview lecture by instructor
Students present topics each week –
typically 2 or 3
Great source of material!
26
History of Course II



Attend Data Warehouse Institute
conference in San Diego. Agree to
write book with Watson. January 1997
Complete book in late 1997. Desk-top
published. Used sabbatical at UCI to do
the writing.
Move course to regular MS course in
Spring 1997.
27
Two Versions:




14 week for Claremont Graduate
University
5 Week for UC Irvine
Taught 1/year for last 5 years
WHY DO STUDENTS TAKE COURSE?

High salaries (100K specialist, $300K for
consultant)
28
Both Versions




Text: Decision Support in the Data
Warehouse by P. Gray and H.J. Watson
Prentice Hall 1998
PLUS readings
PLUS hands-on exercises
PLUS term paper
29
5 WEEK VERSION
Wk Part 1
1
Overview of DW
2
Demo Startracker
software
3
Data sources,
cleansing,metadata
4
Data marts,ODS
5
Mining, BI
Part 2
OLAP, applications
Strategic use,
Framework
Planning and
operating the DW
Industry, economics
CRM, Developments
30
Date
14-WEEK COURSE
9-9
Introduction to DW
9-16
Architecture, Data Input
9-23
Metadata, Multiple Dimensions
9-30
Interface, Strategic Uses
10-7
Design Methods; Data Marts
10-14
Progress Report; ODS; Maintaining the DW
10-21
Midterm
10-28
Economics of DW; Personnel
11-7
Querying, Data Mining
11-14
Progress Report; CRM
11-21
Business Intelligence ; Future Developments
11-28
Student Reports
12-2
Student Reports
12-9
Final Exam
31
Using Software—Star Tracker



Simple DW
The Data Warehouse Toolkit : Practical
Techniques for Building Dimensional
Data Warehouses by Ralph Kimball
(Wiley 1996)
Commercial version “Synchrony” from
If… no longer available
32
Using Software – Commercial



Red Brick (now part of Informix which is
now owned by IBM)
Gift from Red Brick
Mostly used with student projects
33
Data Sources


Journal of Data Warehousing
DM Review



On-line (dmreview.com)
Hard copy
Lots and lots on Internet e.g.,



www.Dw-institute.com
Pwp.starnetinc.com/larryg
www.datawarehouse.org
34
Assignments

Find 10 articles on Internet



teaches search techniques, nomenclature,
rapid course immersion
Exercise with software
The DW industry
35
Term Projects for DW

Next chart shows 22 of the 30
suggested topics. Last topic allows
student to choose topic subject to
instructor approval.
36
DW Architecture
Data sources, loading, cleaning, summarizing, granularity
Metadata in a DW
Data Modeling and the DW
Multiple Dimensions in DWs via ROLAP and MOLAP
Indexing in DWs
DW Interfaces –conventional, window, browser
Design principles for DWs
Strategic use of the DW
Justification of the DW including C/B analysis
DW Security
Data Marts
Operational Data Stores
Maintaining the DW
Economics of DWs
Querying and reporting in a DW
OLAP and the DW
Decision Support Systems and the DW
Database marketing and the DW
Data Mining and the DW
Personnel considerations for building the DW and for maintaining the DW
Organizing the DW Project (Project management etc.)
37
Elizabeth Pierce’s Course at
Indiana University of Penn.



Paper published in Communications of
AIS (CAIS) Sept. 1999 (Vol. 2 Article
16)
Developing and Delivering a Data
Warehousing and Mining Course
“Introduces students to the strategies,
technologies, and techniques”
38
Elizabeth Pierce’s Course at
Indiana University of Penn.

“Students learn what is involved in planning,
designing, building, using, and managing a
data warehouse. Students also learn about
how a data warehouse must fit into an overall corporate data architecture that may
include legacy systems, operational data
stores, enterprise data warehouses, and data
marts. In addition, students are exposed to
the different data mining techniques used by
organizations to derive information from the
data warehouse for strategic and long-term
business decision making.” (abstract)
39
U. Of Indiana Course


14 weeks –7 weeks on DW
Challenges:



Evolving nature of the subject
Hands on experience for students
Textbook selection—mostly tradebooks
40
Useful Tables in Pierce




List of Internet sources
Available software demos
Some on-line data warehouses with
public access
Schools that offered data mining
courses in 1999
41
Conclusions




Students come because it is a job source
A good 2nd course for data base (can be
taken w/o database course)
Allows covering the new topics of the
90’s and 00’s—OLAP,CRM,Mining, in more
depth than Intro to IS course.
It’s a fun topic and you can even do
research on it.
42