Download Craig A. Stevens Presentation Slides on Data Mining Intro

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Understanding Data Mining
Craig A. Stevens, PMP, CC
[email protected]
www.westbrookstevens.com
Examples of
Classical Statistical
Methods
Latitude 36.19N and Longitude -86.78W
Nashville, TN, USA
Yi = a + bxi + e
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Multiple Regression
http://www.ats.ucla.edu/stat/sas/faq/spplot/reg_int_cont.htm
Data Mining
http://datamining.typepad.com/photos/uncat
egorized/livejournal.png
What is Data Mining?
• The process of identifying hidden patterns, trends,
and relationships in large quantities of data.
Why Do Data Mining?
• To discover useful information for making
decisions.
• Too many variables for Classical Statistical methods
to work.
– Large Number of Records 108 - 1012
• Gigabyte – Terabyte
– High Dimensional Data
• Lots of Variables (10 – 104 attributes)
The Huber-Wegman Taxonomy of Data Set Sizes
Descriptor
Tiny
Small
Data Set Size in
Bytes
10^2
10^4
Medium
10^6
Large
Huge
Massive
10^8
10^10
10^12
Super Massive
10^15
Storage Mode
Piece of Paper
A few Pieces of
Paper
A Floppy Disk
Hard Disk
Multiple Hard Disks
Robotic Magnetic
Tape
Storage Silos
Distributed Data
Archives
Name
BAD
Model
Role
Target
Measurement
Level
Binary
Description
CLAGE
Input
Interval
Age of oldest trade line in
months
CLNO
Input
Interval
Number of trade lines
DEBTINC
Input
Interval
Debt-to-income ratio
DELINQ
Input
Interval
Number of trade lines
DEROG
Input
Interval
Number of major
derogatory reports
JOB
Input
Nominal
LOAN
Input
Interval
MORTDUE
Input
Interval
NINQ
Input
Interval
REASON
Input
Binary
VALUE
Input
Interval
Six occupational
categories
Amount of the loan
request
Amount due on existing
mortgage
Number of recent credit
inquiries
DebtCon=debt
consolidation,
HomeImp=home
improvement
Value of current property
YOJ
Input
Interval
Years at present job
1=client defaulted on loan
0=loan repaid
SAS Enterprise Miner Objects
Shows the Cut off Point is 6 Variables
Small Number of Useful Variables
Comparing Methods and Profit vs
Marketing Cost
Decision Trees for Predictive Modeling
Padraic G. Neville SAS Institute Inc. 4 August
1999
Clustering As in Different Brands
2
0
1
- 1
0
1
0
- 1
0
1
- 1
0
1
0
1
H
4
H
4
P CR2 _ 1
M
O
IS
_I 9B
O 0
- 1
J
- 1
_
1
2
O 0
L
J
A
L
- 1
0
1
- 1
0
1
2
3
- 1
0
1
2
- 1
0
1
_
C
A
1
C
- 1
0
0
2
Z
Z
M
O
IS
_I 9B
S
S
- 1
B
B
_
R
R
0
A
A
_
C
C
1
M
O
IS
_I 9B
Q
G
- 1
H
Q
_
G
I
H
D
_
O
S
I
1
M
O
IS
_I 9B
D
2
3
- 1
6
O
S
6
D
O
O
0
J
D
_
J
H
S
_
1
S
H
A
A
2
M
O
IS
_I 9B
J
- 1
L
C
J
F
C
L
_
F
0
_
T
2
F
A
1
T
3
- 1
1
A
F
0
M
O
IS
_I 9B
2
- 1
3
R 0
3
T
_
O 1
T
R
P
P CR3 _ 1
P CR1 _ 1
0
0
0
0
0
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
P
R
O
T_TR
3
1
2
2
2
2
2
M
O
IS
_I 9B P
R
O
T_TR
3 FA
T_FC
LJ A
S
H
_JO
D
6 S
O
D
I _H
G
QC
A
R
B
_S
Z0 C
A
L_JO
H
4
1
2
- 1
0
1
- 1
0
1
2
3
- 1
0
1
2
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
Q
G
H
_
I
D
O
S
6
D
O
J
_
H
S
A
- 1
- 1
- 1
- 1
0
1
1
1
1
FA
T_FC
LJ
0
FA
T_FC
LJ
0
FA
T_FC
LJ
0
FA
T_FC
LJ
2
2
2
2
3
3
3
3
1
2
- 1
0
1
- 1
0
1
2
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
Q
G
H
_
I
D
O
S
3
- 1
- 1
- 1
1
1
1
A
S
H
_JO
D
6
0
A
S
H
_JO
D
6
0
A
S
H
_JO
D
6
0
2
2
2
1
2
- 1
0
1
4
H
- 1
O 0
J
_
L
A
C
0
Z
S
_
B
R
A
C
- 1
- 1
0
0
1
S
O
D
I _H
G
Q
1
S
O
D
I _H
G
Q
2
2
3
3
1
2
4
H
- 1
O 0
J
_
L
A
C
- 1
0
C
A
R
B
_S
Z0
1
Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/
Data Mining Art found at http://datamining.typepad.com/data_mining/dataviz/page/2/
National Energy Research Scientific Computing Center
SurfStat
A Matlab toolbox for the statistical analysis of
univariate and multivariate surface and
volumetric data using linear mixed effects
models and random field theory
Keith J. Worsley
Latitude 36.19N and Longitude -86.78W
Nashville, TN, USA
Genealogical Tree
On You Tube
http://www.youtube.com/watch?v=CnniJR5Ah7g
Related documents