Download A Look at Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Look at Data Mining
Presented by:
Charles Hollingsworth
Flavia Peynado
Ritch Overton
DSc8020, Group Presentation, July 31, 2002
What is Data Mining?
It may be described as the process of
extracting previously unidentified, valid, and
actionable information from large databases
and then using the information to make crucial
business decisions.
Why the need for data mining?
 Business environment is constantly
changing.







Customer Behavior Patterns
Market Saturation
New niche markets
Increased commoditization
Time to market
Shorter product life cycles
Increased competition and business risks
 Drivers
The Customer
 Products
 Competition
 Operations/Data
Assets.

 Enablers
Data flood
 Growth of data
warehousing
 New IT solutions
 New research in
machine learning

Process overview contd.
1.
2.
3.
4.
5.
6.
7.
Business Understanding
Data understanding
Data Preparation
Data Transformation
Data Mining
Analysis of results
Assimilation of results
Effort needed at each stage of data mining
60
50
40
30
20
10
Effort
0
ti
a
c
i
ntif
e
Id
on
o
of
bje
e
c ti v
pa
e
r
P
s
io
rat
no
ing
...
n
i
e
M
g
ta
ed
l
a
w
D
no
K
nd
a
ults
s
re
a ta
D
f
of
s
i
lys
a
An
Visualization
 Goal is to provide a summary and overview of a dataset
 Promotes Understanding: Deconstructive process
 Promotes Trust: Constructive process
 Narrows the gap between human and computer during
data analysis
Types of Visualization Tools
 Histograms
 Bar Charts
 Time-Series Plots
 Decision Trees
 Scatter plots
 Coxcomb Plots
 Pie Charts
 Stereograms
 Line Plots
 Mosley’s X-ray’s
Histogram
Graphically illustrates how many
observations fall in various categories
Histogram for Diam eter
100
80
60
40
20
Category
>0
.5
45
<=
0.
45
5
.4
55
-.
46
5
.4
65
-.
47
5
.4
75
-.
48
5
.4
85
-.
49
5
.4
95
-.
50
5
.5
05
-.
51
5
.5
15
-.
52
5
.5
25
-.
53
5
.5
35
-.
54
5
0
Bar Chart
Categories are placed on the vertical axis,
instead of the horizontal axis in a histogram
Scatter Plot
Graphical representation of the relationship
between two variables.
Scatter Plot
25
Salary
20
15
Salary
10
5
0
0
50
100
Domestic Gross
150
200
Pie Chart
Radii are used to divide a circle into wedges. The
resulting angles represent the values of the wedges.
Spring 2000 Salary Survey
<$30,000
$30,000 to $39,999
$40,000 to $49,999
$50,000 to $59,999
$60,000 to $69,999
More than $70,000
No Answer
Line Plot
Connects consecutive data points to
enhance visualization
Time-Series Plot: Playfair’s
•Helpful in forecasting future values
•Time variable is placed on the
horizontal axis
•Makes patterns in data more
apparent
•The area between two time-series
curves was emphasized to show the
difference between them,
representing the balance of trade.
Decision Trees
Conventions for Decision Trees:
1.
Composed of nodes (points in time) and branches (possible
decisions).
2.
Squares represent decision nodes, circles represent
probability nodes, triangles represent end nodes.
3.
Probabilities are listed on probability branches.
4.
Monetary values are listed on the branches where they
occur.
5.
Decision maker has no control over probability branches.
Decision Trees
Coxcomb Plot
 In 1858, Florence Nightingale
constructed graphs of her own design,
which she called “Coxcombs".
 The radii in a Coxcomb vary as opposed
to the angle of the wedge in a pie chart.
Stereogram
Luigi Perozzo, from the Annali di Statistica,
1880
The population of Sweden from 1750-1875
by age groups
Mosley’s X-ray’s
Caused Henry Mosley to discover that
the atomic number is more than a serial
number; that it has some physical basis.
Moseley proposed that the atomic
number was the number of electrons in
the atom of the specific element.
Other Visualization Tools
 Doughnut
 Area Chart
 Box Plot
 Radar
Algorithms
Predictive
 Regression
 Classification
Descriptive
 Parallel Formulation
of Classification
 Association Rule
Discovery
 Sequential Pattern
Discovery Analysis
 Clustering
Applying
 Relevance to managers



Decreasing Costs
Valuing Appropriately
Effective Implementation
Conclusion
 Converging Developments




Data compilation
Processing power
Maturing Algorithms
Visualization
 Accessible Resources