Download Data Modelling in SAS - How SAS is Used for Research and Teaching to Enable Students to Become More Marketable

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Modelling in SAS
How SAS is Used for Research and Teaching to
Enable Students to Become More Marketable
Iveta Stankovičová
Comenius University
Faculty of Management
Bratislava, Slovakia
[email protected]
Data
„
„
Current age is characteristic of
information explosion
Data are generated:
– for research purposes (historically, for data
analysis) – experimental data
– as operational data (today, in business) –
opportunistic data (Huber 1977)
2
Data
Experimental Opportunistic
Data
Data
Purpose
Reaserch
Operational
Value
Scientific
Commercial
Generation
Actively
controlled
Passively
observed
Size
Small
Massive
Hygiene
Clean
Dirty
State
Static
Dynamic
3
Data
„
„
Information
It is necessary to obtain information from
massive amounts of operational data for
decision making of managers (business
decision support)
It is necessary to explore and model
relationships in data
predictive modelling
(fundamental task)
„ Data
Modelling = Data Mining
(cca 1963)
4
Data Mining - Definition
„
„
„
Selection process, research and modelling
based on great volume of data in order to
detect previous unknown information
patterns for advantage in the competitive
environment
Multidisciplinary lineage
Use statistical methods and further methods
in borders on artificial intelligence
5
Data Mining – SAS definition
Advanced methods for exploring and
modelling relationships in large amounts of
data
Characteristics:
1. data – massive, operational, opportunistic
2. users and sponsors – non-researchers,
business oriented
3. methodology – multidisciplinary, via
computer
„
6
Data Mining – Analytical tools
„
„
„
„
„
„
Statistics
Artificial intelligence (AI)
Knowledge discovery in databases (KDD)
Machine learning
Pattern recognition methodology
Neurocomputing
7
Data Mining – Steps, Cycle
1. Identifying business
problem
2. Transforming data
into actionable
results
3. Acting according to
achieved results
4. Measuring the
results
1.
2.
4.
3.
8
Data Mining - Activities
„
„
„
„
„
„
Classification
Affinity grouping or association rules
Clustering, segmentation
Estimation
Prediction
Description and visualization
9
Data Mining - People
„
„
„
Domain experts
Data experts
Analytical experts
10
Data Mining - Processes
1. Model making
„
historical data:
1. training
2. test
3. validation
2. Apply model
„
new data
„
prediction
Data Mining
System
Algorithm
Training
Training Test
Eval
Model
Score Model
Prediction
Results
11
Data Mining – Practice
1.
2.
3.
4.
5.
6.
7.
Goal definition
Selection of data sources
Preparation of data for modelling
Selection and transformation of variables
Processing and evaluation of the model
Model verification
Implementation and model maintenance
12
Data Mining – SAS solution
SEMMA methodology:
1. Sample – identify input data sets, sample
from a large data set (training, test and
validation data sets)
2. Explore – explore data set statistically and
graphically
3. Modify – prepare the data for analysis
(data manipulation and transformation)
4. Model – fit a predictive model
5. Assess – compare competing models
13
Data Mining - Methods
„
„
„
Statistical methods - linear and logistic
regression, multidimensional methods,
time series analysis ...
Non-statistical methods - neural
networks, genetic algorithm ...
Mixed methods - classificacion and
regression trees ...
14
SAS System at Comenius
University Bratislava (CU)
„
„
November 1999 – signed a license
contract between CU Bratislava and SAS
Institute GmbH on providing 50 licences
of SAS System
November 2001 - addition to the licence
contract with Enterprise Guide
15
SAS System at Faculty of
Management Bratislava (FM)
„
„
„
Faculty of Management - 25 licenses
Beginning with SAS education (V 6.12) summer term in academic year
1999/2000
Current days – SAS V8.2 and Enterprise
Guide V2.0
16
Subjects of Statistics
3 compulsory subjects:
„
Introduction to Statistics
„
„
Statistics on PC
„
„
(1st year, summer term – 4 hours/week)
(2nd year, winter term – 2 hours/week)
Statistical Methods
„
(2nd year, summer term - 4 hours/week)
2 elective subjects:
„
Quantitative methods (in SAS System)
„
„
(3rd year, summer term - 2 hours/week)
Time series analysis (in SAS System)
„
(3rd year, summer term - 2 hours/week)
17
Subjects contents
Contents of compulsory subjects:
– mathematical statistics methods are included
into the basic module (SAS/BASE, SAS/STAT,
SAS/ETS)
Contents of elective subject:
– logistic regression, principal components analysis
(PCA), cluster analysis, factor analysis,
discriminant analysis (SAS/STAT, SAS/EG)
– Time series analysis – ARIMA models (SAS/EG)
18
Example – Logistic model
„
„
Sample of 396 applicants for credit
Independet Variables Xi (categorical):
Age (classes) = vek
Gender = pohl (0=male, 1=female)
Income (classes) = plat
Number of dependants = vyz_os
Job duration (classes) = trv_zam
„
8 values
2 values
8 values
4 values
6 values
Dependet Variable Y (binary):
Credit
ƒ 1 = assigned 0 = non-assigned
2 values
19
Logistic regression model
„
Conditional Probability P(Y=1/X) ............ p
p= 1/(1 + e-(α + β’X))
„
Odds ........................................... p/(1-p)
p/(1-p) = eα + β’X
„
Logarithm odds = logit
– linear transformation
logit (p) = log (p/1-p) = α + β’X
20
Signification of Variables
Analysis of Effects Not in the Model
Effect
DF
Score Chi-Square
Pr > ChiSq
pohl
1
0.8121
0.3675
vek
1
48.0791
<.0001
vyz_os
1
39.6707
<.0001
plat
1
41.4197
<.0001
trv_zam
1
33.9234
<.0001
21
Estimates of model’s parameters
Analysis of Maximum Likelihood Estimates
DF
Estimate
Standard
Error
Intercept
1
-6.3538
0.7073
80.6885
<.0001
vek
1
0.3916
0.0871
20.2308
<.0001
vyz_os
1
0.8109
0.1539
27.7692
<.0001
plat
1
0.7182
0.1264
32.2918
<.0001
trv_zam
1
0.6000
0.1155
27.0075
<.0001
Parameter
Wald
Chi-Square
Pr > ChiSq
22
Odds Ratio Estimates
Odds Ratio Estimates
Effect
Point Estimate
95% Wald Confidence Limits
vek
1.479
1.247
1.755
vyz_os
2.250
1.664
3.042
plat
2.051
1.601
2.627
trv_zam
1.822
1.453
2.285
23
Logistic model - final
„
Logit function:
log(p/1-p) =
= -6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam
„
„
Probability function:
p= 1/(1+ e -(-6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam))
Odds function:
p/(1-p) = e (-6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam)
Interpretation - example:
„
Odds of client to have the credit assigned are being increased
approximately 2-times with each higher income class.
– because e 0.72= 2,05, i.e. the parameter of variable income (plat) in
logistic function
24
Measures of association
Association of Predicted Probabilities and
Observed Responses
Percent Concordant
82.0
Somers' D
0.647
Percent Discordant
17.3
Gamma
0.652
Percent Tied
0.7
Tau-a
0.316
c
0.824
Pairs
38180
25
Logistic S-curve
„
„
x-axis = income classes
y-axis = probability of credit's assignment
P ro b ab ility o f cred it's
assig n m en t
1
0,5
0
0
2
4
income classes
6
8
26
SAS Sytem – offered in Menu
„
Overview of modules an applications of SAS
System V8.2 for creation of statistical
analysis in the menu mode (knowledge of
SAS code is not required)
SAS/ASSIST software
SAS/INSIGHT software
SAS Analyst
SAS/Enterprise Guide
27
Activities
Outputs from SAS education:
„ Projects – output from each subject
„ Student Research Activity Competition – 3rd
year, cca 15 works/per year
„ Thesis works
–
–
–
„
information system (module AF)
data analysis (module BASE, STAT, QC, ...)
Scorecard (Enterprise Guide, Enterprise Miner)
Conference SAS Forum - participation of
teachers and students
28
Plans
Extension of plans for SAS exploitation in
following subjects:
„
„
„
„
„
„
„
Multidimensional Methods of Analysis
Time Series Analysis
Marketing Research
Data Mining
Financial Analysis
Quality Control
Operational Management
29
Thanks for your attention!
30