Download Marakas Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
1
Chapter 1: Introduction to
Data Mining, Warehousing,
and Visualization
Modern Data Warehousing, Mining, and
Visualization: Core Concepts
by George M. Marakas
Spring 2012
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1
1
Objectives








What is the purpose and motivation for developing a
Data Warehouse (DW)?
Position of DW within IT infrastructure
Relationship between DW and business data mart
What can a DW do?
Foundations for Data Mining
Steps in a typical Data mining project
What is a “Correlation”? KEY CONCEPT
History of Data Visualization vis-à-vis DW
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
2
1
1-1: The Modern Data Warehouse



A data warehouse is a copy of transaction data
specifically structured for querying, analysis and
reporting
Note that the data warehouse contains a copy of the
transactions. These are not updated or changed later
by the transaction system.
Also note that this data is specially structured, and may
have been transformed when it was placed in the
warehouse
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
3
1-2: Data Warehouse Roles
and Structures
1
The DW has the following primary functions:
 It is a direct reflection of the business rules of the
enterprise.
 It is the collection point for strategic information.
 It is the historical store of strategic information.
 It is the source of information later delivered to data
marts.
 It is the source of stable data regardless of how the
business processes may change.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
4
Elements of a DW
1
Extract
Transform
Store
[ETS]
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
5
Position of the Data Warehouse Within
the Organization – Figure 1-2
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1
6
Data Mining Example
Service Quality vs. Training
1
Courtesy: MicroStrategy (2005)
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
11
Examples of Common DW Applications
Table 1-1
1
Sales Analysis

Determine real-time product sales to make vital pricing and distribution decisions.

Analyze historical product sales to determine success or failure attributes.

Evaluate successful products and determine key success factors.

Use corporate data to understand the margin as well as the revenue implications of a decision.

Rapidly identify a preferred customer segments based on revenue and margin.

Quickly isolate past preferred customers who no longer buy.

Identify daily what product is in the manufacturing and distribution pipeline.

Instantly determine which salespeople are performing, on both a revenue and margin basis, and which are
behind.
Financial Analysis

Compare actual to budgets on an annual, monthly and month-to-date basis.

Review past cash flow trends and forecast future needs.

Identify and analyze key expense generators.

Instantly generate a current set of key financial ratios and indicators.

Receive near-real-time, interactive financial statements.
Human Resource Analysis

Evaluate trends in benefit program use.

Identify the wage and benefits costs to determine company-wide variation.

Review compliance levels for EEOC and other regulated activities.
Other Areas

Warehouses have also been applied to areas such as: logistics, inventory, purchasing, detailed transaction
analysis and load balancing.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
12
1
Table 1-2
Comparison of Typical DW Costs and Benefits
Costs
 Hardware, software, development personnel and consultant costs.
 Operational costs like ongoing systems maintenance.
 Benefits
Added Revenue
 Will the new (business objective) process generate new customers (what is the
estimated value?)
 Will the new (business objective) process increase the buying propensity of
existing customers (by how much?)
 Is the new process necessary to ensure that the competition doesn't offer a
demanded service that you can't match?
Reduced costs
 What costs of current systems will be eliminated?
 Is the new process intended to make some operation more efficient? If so, how
and what is the dollar value?
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
13
1-4: The Cost of DW




1
Expenditures can be categorized as one-time initial
costs or as recurring, ongoing costs.
The initial costs can further be identified as for hardware
or software.
Expenditures can also be categorized as capital costs
(associated with acquisition of the warehouse) or as
operational costs (associated with running and
maintaining the warehouse)
Cost of a Data Warehouse:
 Rule of Thumb: $1 million per 1 Terabyte of data
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
14
Expenditures Associated with Building a DW
Table 1-3
Recurring Costs
Capital
Operational
1
One-Time Costs




Hardware maintenance
Software maintenance
Terminal analysis
Middleware
Hardware

Disk

CPU

Network

Terminal Analysis







Ongoing refreshment
Integration transformation
Data model maintenance
Record identification maintenance
Metadata infrastructure maintenance
Archival of data
Data aging within the DW









Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
Software
DBMS
Terminal analysis
Middleware
Log utility
Processing
Metadata
Infrastructure
Integration/transformation
processing specification
Metadata infrastructure population
System of record definition
Data dictionary language definition
Network transfer definition
CASE/Repository interface
Initial data warehouse population
Data model definition
Database design definition
15
1-5: Data Mining:
Farmers and Explorers


1
Every corporation has two types of DW users.
 Farmers [Traditional Statistical Hypothesis
testing] know what they want before they set out to
find it. They submit small queries and retrieve small
nuggets of information.
 Explorers [Data Mining] are quite unpredictable.
They often submit large queries. Sometimes they
find nothing, sometimes they find priceless “golden”
nuggets.
Cost justification for the DW is usually done on the basis
of the results obtained by farmers since explorers are
unpredictable.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
16
1-6: Foundations of Data
Mining




1
Data mining is the process of using raw data to infer
important business relationships.
Despite a consensus on the value of data mining, a
great deal of confusion exists about what it is.
It is a collection of powerful techniques intended for
analyzing large datasets.
There is no single data mining approach, but rather a
set of techniques that can be used in combination with
each other.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
17
1-6 & -7: The Foundations of
Data Mining




1
Data mining has roots in practice dating back over 30
years using standard statistics [e.g., bio-statistics]
In the early 1960s, data mining was called statistical
analysis, and the pioneers were statistical software
companies such as SAS and SPSS.
By the 1980s, the traditional techniques had been
augmented by new methods such as fuzzy logic,
heuristics and neural networks.
Also, DSS tools came into popular use in the 1980’s
with tools such as Lotus 1-2-3 & EXCEL
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
18
Data Mining – A General Approach
1
Although all data mining endeavors are unique,
they possess a common set of process steps:
1. Infrastructure preparation – choice of hardware
platform, the database system and one or
more mining tools
2. Exploration – looking at summary data,
sampling and applying intuition [Data
visualization useful here]
3. Analysis – each discovered pattern is
analyzed for significance and trends
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
19
A General Approach
(continued)
4.
5.
1
Interpretation – Once patterns have been
discovered and analyzed, the next step is to
interpret them. Considerations include
business cycles, seasonality and the
population the pattern applies to.
Exploitation – this is both a business and a
technical activity. One way to exploit a
pattern is to use it for prediction. Others are to
package, price or advertise the product in a
different way.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
20
The Data Warehouse and
Data Mining
1
 Data
mining does not require the use of a data
warehouse (DW), however, DWs are designed
with data mining in mind.
 The data in the DW is integrated and stable
(non-volatile)
 Data changes continuously in an operational
database.
 If multiple analyses are run in sequence, the
data need to be held constant (as in a DW).
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
24
Volumes of Data – The
Biggest Challenge
1
 The
largest challenge a “data miner” may face is
the sheer volume of data in the warehouse.
 It is quite important, then, that summary data
also be available to get the analysis started.
 A major problem is that this sheer volume may
mask the important relationships the analyst is
interested in.
 The ability to overcome the volume and
visualize the data becomes quite important.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
25
1.9: Foundations of Data
Visualization [DV]
1
 One
of the earliest known examples of data
visualization was in London during the 1854
cholera epidemic. A map (next slide) helped
to identify the source of the disease.
 Modern visualization techniques grew from the
twin technologies of computer graphics and
high performance computing in the 1970s
and 1980s.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
26
1
Dr. John
Snow used
a map to
show the
source of
cholera was
a water
pump, thus
proving the
disease
was water
borne.
Broad Street
Pump
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
27
1
DV: Opportunity and Timing
 Alternative
input devices (light pen, sketch pad
and mouse) began to appear in the 1960s.
 In the 1970s, flight simulators became much
more realistic when graphics replaced film.
 In the same decade, special effects computers
became entrenched in the entertainment
industry.
 In the 1980s, visualization grew more dynamic
with applications like the animation of weather
patterns.
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
28
Data Visualization – Sales by Region
1
Typical Spreadsheet Graphic
90
80
70
60
50
East
West
North
40
30
20
10
0
1st Qtr 2nd Qtr 3rd Qtr
4th Qtr
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
31
Data Visualization – Total Precipitation
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1
33
DV & DM:
Future Success Drivers



1
In the 1990s, rapid advances in chip technology, both
at the CPU and the graphics processor, put data
visualization everywhere.
On-going reduced costs of computing.
 Each new generation has a 10X-100X performancecost improvements.
 Approximately every 18 months [Moore’s Law].
Web-based E-commerce
 Business to Consumer Commerce [B to C; and C:C]
 Generates billions and even trillions of characters per
reporting period
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
35
The End
Modern Data Warehousing, Mining & Visualization, 2003, George Marakas
1
36