Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Modelling in SAS How SAS is Used for Research and Teaching to Enable Students to Become More Marketable Iveta Stankovičová Comenius University Faculty of Management Bratislava, Slovakia [email protected] Data Current age is characteristic of information explosion Data are generated: – for research purposes (historically, for data analysis) – experimental data – as operational data (today, in business) – opportunistic data (Huber 1977) 2 Data Experimental Opportunistic Data Data Purpose Reaserch Operational Value Scientific Commercial Generation Actively controlled Passively observed Size Small Massive Hygiene Clean Dirty State Static Dynamic 3 Data Information It is necessary to obtain information from massive amounts of operational data for decision making of managers (business decision support) It is necessary to explore and model relationships in data predictive modelling (fundamental task) Data Modelling = Data Mining (cca 1963) 4 Data Mining - Definition Selection process, research and modelling based on great volume of data in order to detect previous unknown information patterns for advantage in the competitive environment Multidisciplinary lineage Use statistical methods and further methods in borders on artificial intelligence 5 Data Mining – SAS definition Advanced methods for exploring and modelling relationships in large amounts of data Characteristics: 1. data – massive, operational, opportunistic 2. users and sponsors – non-researchers, business oriented 3. methodology – multidisciplinary, via computer 6 Data Mining – Analytical tools Statistics Artificial intelligence (AI) Knowledge discovery in databases (KDD) Machine learning Pattern recognition methodology Neurocomputing 7 Data Mining – Steps, Cycle 1. Identifying business problem 2. Transforming data into actionable results 3. Acting according to achieved results 4. Measuring the results 1. 2. 4. 3. 8 Data Mining - Activities Classification Affinity grouping or association rules Clustering, segmentation Estimation Prediction Description and visualization 9 Data Mining - People Domain experts Data experts Analytical experts 10 Data Mining - Processes 1. Model making historical data: 1. training 2. test 3. validation 2. Apply model new data prediction Data Mining System Algorithm Training Training Test Eval Model Score Model Prediction Results 11 Data Mining – Practice 1. 2. 3. 4. 5. 6. 7. Goal definition Selection of data sources Preparation of data for modelling Selection and transformation of variables Processing and evaluation of the model Model verification Implementation and model maintenance 12 Data Mining – SAS solution SEMMA methodology: 1. Sample – identify input data sets, sample from a large data set (training, test and validation data sets) 2. Explore – explore data set statistically and graphically 3. Modify – prepare the data for analysis (data manipulation and transformation) 4. Model – fit a predictive model 5. Assess – compare competing models 13 Data Mining - Methods Statistical methods - linear and logistic regression, multidimensional methods, time series analysis ... Non-statistical methods - neural networks, genetic algorithm ... Mixed methods - classificacion and regression trees ... 14 SAS System at Comenius University Bratislava (CU) November 1999 – signed a license contract between CU Bratislava and SAS Institute GmbH on providing 50 licences of SAS System November 2001 - addition to the licence contract with Enterprise Guide 15 SAS System at Faculty of Management Bratislava (FM) Faculty of Management - 25 licenses Beginning with SAS education (V 6.12) summer term in academic year 1999/2000 Current days – SAS V8.2 and Enterprise Guide V2.0 16 Subjects of Statistics 3 compulsory subjects: Introduction to Statistics Statistics on PC (1st year, summer term – 4 hours/week) (2nd year, winter term – 2 hours/week) Statistical Methods (2nd year, summer term - 4 hours/week) 2 elective subjects: Quantitative methods (in SAS System) (3rd year, summer term - 2 hours/week) Time series analysis (in SAS System) (3rd year, summer term - 2 hours/week) 17 Subjects contents Contents of compulsory subjects: – mathematical statistics methods are included into the basic module (SAS/BASE, SAS/STAT, SAS/ETS) Contents of elective subject: – logistic regression, principal components analysis (PCA), cluster analysis, factor analysis, discriminant analysis (SAS/STAT, SAS/EG) – Time series analysis – ARIMA models (SAS/EG) 18 Example – Logistic model Sample of 396 applicants for credit Independet Variables Xi (categorical): Age (classes) = vek Gender = pohl (0=male, 1=female) Income (classes) = plat Number of dependants = vyz_os Job duration (classes) = trv_zam 8 values 2 values 8 values 4 values 6 values Dependet Variable Y (binary): Credit 1 = assigned 0 = non-assigned 2 values 19 Logistic regression model Conditional Probability P(Y=1/X) ............ p p= 1/(1 + e-(α + β’X)) Odds ........................................... p/(1-p) p/(1-p) = eα + β’X Logarithm odds = logit – linear transformation logit (p) = log (p/1-p) = α + β’X 20 Signification of Variables Analysis of Effects Not in the Model Effect DF Score Chi-Square Pr > ChiSq pohl 1 0.8121 0.3675 vek 1 48.0791 <.0001 vyz_os 1 39.6707 <.0001 plat 1 41.4197 <.0001 trv_zam 1 33.9234 <.0001 21 Estimates of model’s parameters Analysis of Maximum Likelihood Estimates DF Estimate Standard Error Intercept 1 -6.3538 0.7073 80.6885 <.0001 vek 1 0.3916 0.0871 20.2308 <.0001 vyz_os 1 0.8109 0.1539 27.7692 <.0001 plat 1 0.7182 0.1264 32.2918 <.0001 trv_zam 1 0.6000 0.1155 27.0075 <.0001 Parameter Wald Chi-Square Pr > ChiSq 22 Odds Ratio Estimates Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits vek 1.479 1.247 1.755 vyz_os 2.250 1.664 3.042 plat 2.051 1.601 2.627 trv_zam 1.822 1.453 2.285 23 Logistic model - final Logit function: log(p/1-p) = = -6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam Probability function: p= 1/(1+ e -(-6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam)) Odds function: p/(1-p) = e (-6,35 + 0,39*vek + 0,81*vyz_os + 0,72*plat + 0,6*trv_zam) Interpretation - example: Odds of client to have the credit assigned are being increased approximately 2-times with each higher income class. – because e 0.72= 2,05, i.e. the parameter of variable income (plat) in logistic function 24 Measures of association Association of Predicted Probabilities and Observed Responses Percent Concordant 82.0 Somers' D 0.647 Percent Discordant 17.3 Gamma 0.652 Percent Tied 0.7 Tau-a 0.316 c 0.824 Pairs 38180 25 Logistic S-curve x-axis = income classes y-axis = probability of credit's assignment P ro b ab ility o f cred it's assig n m en t 1 0,5 0 0 2 4 income classes 6 8 26 SAS Sytem – offered in Menu Overview of modules an applications of SAS System V8.2 for creation of statistical analysis in the menu mode (knowledge of SAS code is not required) SAS/ASSIST software SAS/INSIGHT software SAS Analyst SAS/Enterprise Guide 27 Activities Outputs from SAS education: Projects – output from each subject Student Research Activity Competition – 3rd year, cca 15 works/per year Thesis works – – – information system (module AF) data analysis (module BASE, STAT, QC, ...) Scorecard (Enterprise Guide, Enterprise Miner) Conference SAS Forum - participation of teachers and students 28 Plans Extension of plans for SAS exploitation in following subjects: Multidimensional Methods of Analysis Time Series Analysis Marketing Research Data Mining Financial Analysis Quality Control Operational Management 29 Thanks for your attention! 30