Download SAS Certified Predictive Modeler using SAS Enterprise Miner 7

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
RICEVIMENTO Giovedi
9.30-11.30
[email protected]
LEZIONI
SGI+SSE: 21 settembre 2015 21-Ottobre 2015 LAB 717
Lunedi 10,30-13,30
Mercoledi 10,30-13,30
Giovedi 14,30-17,30
SGI 22-Ottobre 2015- 5 Novembre 2015
Giovedi 22 ottobre 2015 14,30-17,30
Lunedi 26 ottobre 2015 14,30-17,30
Martedi 3 novembre 2015 10,30-13,30
Martedi 3 novembre 2015 14,30-17,30
Mercole 4 novembre 2015 10,30-13,30
Mercole 4 novembre 2015 14,30-17,30
Giovedi 5 novembre 2015 14,30-17,30
NB: Settimana di recupero
Distribuzione argomenti nelle settimane (stima)
SGI+SSE
1-2-3 settimana: Introduzione DM, Intro classificazione
(target qualitativi)
4-5 settimana: logistica binaria/politomica, alberi
decisionali, Nearest neighbour, Naive Bayes,
6 settimana: Assessment
SGI
7-8-9 settimana: Neural Network, PCR, PLS e cenni sui
Target quantitativi (se avanza tempo Association)
PROGRAMMA: files forniti
DATA MINING, MODELLI PER IL DM,
http://www.statistica.unimib.it/utenti/lovaglio
files del corso
P0 I1: parte 0 SEZ 1
P0 I1: parte 0 SEZ 2
P1 I2: parte 1, SEZ 1
P1 I2: parte 1, SEZ 2
…..etc
I files sono in ordine di Parte e sezione
Argomenti d’esame
Per tutti
Solo
SGI
Software:
SAS base-Stat e SAS ENTERPRISE MINER Workstation 7.1
SAS a CASA
http://www.unimib.it/go/47940/Home/Italiano/Servizi-informatici/Softwarecampus/SAS
ESAME: ORALE (TEORIA + output elaborazione)
Da portare all’orale gli output relativi ad un’applicazione (dataset scelti dal
sito del docente o da altri repositories) in cui viene richiesta la seguente
analisi: Con un TARGET QUALITATIVO-binario, definire profitti, priors,
etc, fare pre-processing (missing, difchi, transformation covariate, collinearità
e separation), lanciare e specificare VARI modelli: - REGRESSIONE
LOGISTICA binaria, ALBERI CLASSIFICATIVI, Nearest Neighbour* e altre
opzioni (es. su dataset iniziale o ripulito, con X originali, trasformate,
componenti delle X). Confrontare le performances tra modelli, con il
migliore fare lo score di nuovi casi.
NB: ricordarsi di specificare che tipo di tecnica di validation/robustezza
si utilizza: crossvalidation o validation dataset
NB2 Per fare lo scoring, se non esite il dataset di score, prendete casualmente il 10% delle
osservazioni del dataset iniziale, eliminate il target e lo usate come score.
Quindi il dataset di partenza (da dividere in trainig e validation) avrà il 90% delle
osservazioni.
*SGI tra i vari modelli implementa anche una rete neurale.
SAS ENTERPRISE MINER
ESEMPIO target binario
SAS offre sei certificazioni riconosciute a livello internazionale fra cui :
SAS Certified Predictive Modeler using SAS Enterprise Miner 7
Credential
SAS Certified Predictive Modeler using SAS Enterprise Miner 7 Credential
Designed for SAS Enterprise Miner users who perform predictive analytics
During this performance-based examination, candidates will use SAS Enterprise Miner to perform the examination tasks. It is essential that the candidate
have a firm understanding and mastery of the functionalities for predictive modeling available in SAS Enterprise Miner 7.
Successful candidates should have the ability to: prepare data, build predictive models, assess models, score new data sets, implement models.
Required Exam



Exam: Candidates will use SAS Enterprise Miner to perform this exam
61 multiple-choice questions (must achieve score of 70% correct to pass)
3 hours to complete exam
Exam topics include:
Data Preparation, Starting a new Enterprise Miner project, Missing values , Initial data exploration including data, visualization/measurement levels or
scales/variable reduction, Transformation/recoding/binning
Predictive Models: Data splitting/balancing/overfitting/oversampling, Logistic/linear regression, Artificial neural networks (MLP), Decision trees, Variable
importance/odds ratio, Profit/loss/prior probabilities
Model Assessment: Comparison between models/lift chart/ROC/profit & loss, Assessment of a single model
Scoring and Implementation: Score a data set, Model implementation
R
Cenni al package Rattle di R:
install.packages("RGtk2")
install.packages(“rattle")
library(rattle)
rattle()
ESAME
http://datamining.togaware.com/survivor/Getting_Started.html
Classificative Models in Rattle
Assessment tool in Rattle
http://orange.biolab.si/getting-started/
Data Sets (oltre a quelli forniti da me)
Data Repositories
1. Open Gov. Data: www.data.gov, www.data.gov.uk, www.data.gov.fr,
http://opengovernmentdata.org/data/catalogues/
2. Kaggel: www.kaggle.com
3. KDD Nugets: http://www.kdnuggets.com/datasets/
4. UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
5. StatLib: http://lib.stat.cmu.edu
6. TwitteR: http://cran.r-project.org/web/packages/twitteR/index.html
7. rfigshare: http://figshare.com, http://cran.r-project.org/web/packages/rfigshare/index.html
IDS data sets
Data Sets for Data Mining
Competition Data Set
UCI Machine learning repository
Quest data repository
KDNuggets
DATA MINING TUTORIAL
http://www.autonlab.org/tutorials/
Datasets for "The Elements of Statistical Learning"
http://www-bcf.usc.edu/~gareth/ISL/data.html
http://statweb.stanford.edu/~tibs/ElemStatLearn/
14-cancer microarray data: Info Training set gene expression , Training set class labels , Test set gene
expression , Test set class labels .
The indices in the cross-validation folds used in Sec 18.3 are listed in CV folds.
Bone Mineral Density: Info Data
Countries: Info Data
Galaxy: Info Data
Los Angeles Ozone: info Data
Marketing: Info Data
Mixture Simulation: Info Data
NCI (microarray): Info Data
Ozone: Info Data
Phoneme: Info Data
Prostate: Info Data
Protein flow cytometry data: Info Data
Covariance matrix
Radiation sensitivity data: Info gene expression data
outcome
SRBCT microarray data: Info Training set gene expression , Training set class labels , Test set gene
expression , Test set class labels
Signatures data: Info Data
Skin of the Orange (Section 12.3.4): Info Data
South African Heart Disease: Info Data
Vowel: Info, Training and Test data.
Waveform: Info, Training and Test data, and a generating function waveform.S (Splus or R).
ZIP code: Info, gzipped Training and Test data.
Spam: Info Data and test set Indicator
For more informations, see the UCI spambase directory.
Dataset corso spain “Mineria de datos”
http://www.lsi.upc.edu/~belanche/Docencia/mineria/mineria.html
The SAMPSIO library
contains both real and fictitious data sets. To see this library type:
libname sampsio ' C:\Program Files\SASHome\SASFoundation\9.4\dmine\sample'; run;
1. You can use the DMA[xxxx] data sets (example dmafish) and a Data Partition node to create your training,
validation, and test data.
OPPURE
2. The DML[xxxx] data sets (ex. dmlfish) contains input and target values for training.
You can use the DMT_data sets as test data for comparing models.
The are few data sets available as validation (DMV[xxxx]) and score (DMS[xxxx]) data sets
Nome dataset:
DMAHMEQ, DMLHMEQ, DMVHMEQ, DMTHMEQ, HMEQ
Variable
Model Role
Measurement
Description
bad
target
binary
default or seriously delinquent
clage
input
interval
age of oldest trade line in months
clno
input
interval
number of trade (credit) lines
debtinc
input
interval
debt to income ratio
delinq
input
interval
number of delinquent trade lines
derog
input
interval
number of major derogatory
reports
job
input
nominal
job category
loan
input
interval
amount of current loan request
mortdue
input
interval
amount due on existing mortgage
ninq
input
interval
number of recent credit inquiries
reason
input
binary
home improvement or debt
consolidation
value
input
interval
value of current property
yoj
input
interval
years on current job
Caratteristiche dei Funghi sulla velenosità o mangiabilità (edible)
DMAMUSH, DMLMUSH, DMTMUSH
Variable
Model Role
Measurement
Description
bruises
input
nominal
bruises
capcolor
input
nominal
cap color
capshape
input
nominal
cap shape
capsurf
input
nominal
cap surface
gillatta
input
nominal
gill attachment
gillcolo
input
nominal
gill color
gillsize
input
nominal
gill size
gillspac
input
nominal
gill spacing
habitat
input
nominal
habitat
odor
input
nominal
odor
populat
input
nominal
population
ringnumb
input
nominal
ring number
ringtype
input
nominal
ring type
sporepc
input
nominal
spore print color
stalkcar
input
nominal
Stalk (gambo) color above ring
stalkcbr
input
nominal
stalk color below ring
stalkroo
input
nominal
stalk root
stalksar
input
nominal
stalk surface above ring
stalksbr
input
nominal
stalk surface below ring
stalksha
input
nominal
stalk shape
target
target
binary
poisonous or edible
veilcolo
input
nominal
Veil (cappello) color
ceiltype
input
nominal
veil type
Vessels: Heart attack riduzione dei vasi sanguigni
DMAHART, DMLHART, DMTHART
Variable
Model Role
Measurement
Description
age
input
interval
age
bpress
input
interval
resting blood pressure
bsugar
input
binary
fasting blood sugar > 120 mg/dl
ca
input
ordinal
number of major vessels (0-3)
colored by fluoroscopy
chol
input
interval
serum cholesterol in mg/dl
ekg
input
nominal
resting electrocardiographic
results
exang
input
binary
exercise induced angina
oldpeak
input
interval
ST depression induced by exercise
relative to rest
pain
input
nominal
chest pain type
sex
input
binary
sex
slope
input
ordinal
slope of the peak exercise ST
segment
target
target
ordinal
number of major vessels (0-4)
reduced in diameter by more than
50%
thal
input
nominal
thal
thalach
input
interval
maximum heart rate achieved
SAS AT the UCLA Statistics website.
http://www.ats.ucla.edu/stat/
http://www.ats.ucla.edu/stat/sas/
http://www.ats.ucla.edu/stat/dae/ Data examples
Learning SAS base
http://www.biostat-edu.com/ProgramNotes.html
http://web.utk.edu/sas/OnlineTutor/1.2/en/60476/paths.htm
Data mining case studies
http://megaputer.com/site/success_stories.php
Analysis and Forecasting of House Price Indices
Customer Response Prediction and Profit Optimization
Predictive Modeling of Big Data with Limited Memory
See http://www.rdatamining.com/docs chapter 12-13-14
Scandinavian Airlines Modernize Business Intelligence Capabilities
http://www.alsharif.info/#!iom530/c21o7
Altro materiale utile (Fatto bene)
http://zlin.ba.ttu.edu/6347/notes13.htm
http://www.lsi.upc.edu/~belanche/Docencia/mineria/mineria.html
http://psi.cse.tamu.edu/teaching/lecture_notes/
http://www.autonlab.org/tutorials/
Paper e approfondimenti
1. Evaluating Performance of Classifiers
 Compare the bias and variance of models generated using different evaluation methods (leave one out,
cross validation, bootstrap, stratification, etc.)
 References:
a. Kohavi, R., A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection
(1995)
b. Efron, B. and Tibshirani, R., Cross-Validation and the Bootstrap: Estimating the Error Rate of a
Prediction Rule (1995)
c. Martin, J.K., and Hirschberg, D.S., Small Sample Statistics for Classification Error Rates I: Error Rate
Measurements (1996)
d. Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised Classification Learning
Algorithms (1998)
2.


a.
b.
c.
Support Vector Machine (SVM)
Present an overview of SVM or applying Support Vector Machines to various application domains.
References:
Mangasarian, O.L., Data Mining via Support Vector Machines (2001)
Burges, C.J.C., A Tutorial on Support Vector Machines for Pattern Recognition (1998)
Joachims, T., Text Categorization with Support Vector Machines: Learning with Many Relevant Features
(1998)
d. Salomon, J., Support Vector Machines for Phoneme Classification (2001)
3. Cost-sensitive learning
 A comparative study and implementation of different techniques for ensemble learning such as bagging,
boosting, etc.
 References:
a. Freund Y. and Schapire, R.E., A short introduction to boosting (1999)
b. Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner
Strong? (2002)
c. Quinlan, J.R., Boosting, Bagging and C4.5 (1996)
d. Bauer, E., Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting,
and Variants (1999)
4. Semi-supervised learning (classification with labeled and unlabeled data)


a.
b.
c.
d.
Applying different semi-supervised learning techniques to UCI data sets.
References:
Nigam, K., Using Unlabeled Data to Improve Text Classification (2001)
Seeger, M., Learning with labeled and unlabeled data (2001)
Nigam, K. and Ghani, R., Analyzing the Effectiveness and Applicability of Co-training (2000)
Vittaut, J.N., Amini, M-R., Gallinari, P., Learning Classification with Both Labeled and Unlabeled Data
(2002).
5. Classification for rare-class problems
 A comparative study and/or implementation of different classification techniques to analyze rare class
problems
 References:
a. Joshi, M.V., and Agrawal, R., PNrule: A New Framework for Learning Classifier Models in Data Mining
(A Case-study in Network Intrusion Detection) (2001)
b. Joshi, M.V., Agrawal, R., and Kumar, V., Mining Needles in a Haystack: Classifying Rare Classes via
Two-Phase Rule Induction (2001)
c. Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner
Strong? (2002)
d. Joshi, M.V., Kumar, V., Agrawal, R., On Evaluating Performance of Classifiers for Rare Classes (2002)
(2002)
6.


a.
b.
c.
Time Series Prediction/Classification
A comparative study and/or implementation of time series prediction/classification techniques
References:
Geurts, P., Pattern Extraction for Time Series Classification (2001)
Kadous, M.W., A General Architecture for Supervised Classification of Multivariate Time Series (1998)
Giles, C.L., Lawrence, S. and Tsoi, A.C., Noisy Time Series Prediction using a Recurrent Neural
Network and Grammatical Inference (2001)
d. Keogh, E.J. and Pazzani, M.J., An enhanced representation of time series which allows fast and accurate
classification, clustering and relevance feedback (1998)
e. Chatfield, C., The Analysis of Time Series, Chapman & Hall (1989)
7.


a.
Sequence Prediction
A comparative study and implementation of sequence prediction techniques
References:
Laird, P.D., Saul, R. Discrete Sequence Prediction and Its Applications. Machine Learning, 15(1): 43-68
(1994)
b. Sun, R. and Lee Giles, C., Sequence Learning: From Recognition and Prediction to Sequential Decision
Making (2001)
c. Lesh, N., Zaki, M.J., and Ogihara, M., Mining features for Sequence Classification (1999)
8.


a.
b.
Association Rules for Classification
A comparative study and implementation of classification using association patterns (rules and itemsets)
References:
Liu, B., Hsu, W., and Ma, Y., Integrating Classification and Association Rule Mining (1998)
Liu, B., Ma, Y. and Wong, C-K, Classification Using Association Rules: Weaknesses and Enhancements
(2001)
c. Li, W., Han, J. and Pei, J., CMAR: Accurate and Efficient Classification Based on Multiple ClassAssociation (2001)
d. Deshpande, M. and Karypis, G., Using Conjunction of Attribute Values for Classification (2002)
9. Spatial Association Rule Mining
 A comparative study on spatial association rule mining.
 References:
a. Koperski, K., and Han, J., Discovery of Spatial Association Rules in Geographic Information Databases
(1995)
b. Shekhar, S. and Huang, Y., Discovering Spatial Co-location Patterns: A Summary of Results (2001)
c. Malerba, D., Esposito, F. and Lisi, F., Mining Spatial Association Rules in Census Data (2001)
10.
Temporal Association Rule Mining
 A comparative study and/or implementation of temporal association rule mining techniques
 References:
a. Li, Y., Ning, P., Wang, and S., Jajodia, S., Discovering Calendar-based Temporal Association Rules
(2001)
b. Chen, X. and Petrounias, Mining temporal features in association rules
c. Lee, C.H., Lin, C.R. and Chen, M.S., On Mining General Temporal Association Rules in a Publication
Database (2001)
d. Ozden, B., Ramaswamy, Silberschatz, Cyclic Association Rules (1998)
e. Literature on Sequential Association Rule Mining below
11.
Sequential Association Rule Mining
 A comparative study and/or implementation of sequential association rule mining techniques
 References:
a. Srikant, R. and Agrawal, R., Mining Sequential Patterns: Generalizations and Performance Improvements
(1996)
b. Mannila, H. and Toivonen, H., Verkamo, A.I., Discovery of Frequent Episodes in Event Sequences
(1997)
c. Joshi, M., Karypis, G., and Kumar, V., A Universal Formulation of Sequential Patterns (1999)
d. Borges J., and Levene, M., Mining Association Rules in Hypertext Databases (1998)
12.


a.
b.
c.
d.
e.
Outlier Detection
A comparative study and/or implementation of outlier detection techniques.
References:
Knorr, Ng, A Unified Notion of Outliers: Properties and Computation, - 1997
Knorr, Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets - 1998
Breunig, Kriegel, Ng, Sander, LOF: Identifying Density-Based Local Outliers - 2000
Aggarwal, Yu, Outlier Detection for High Dimensional Data – 2001
Tang, Chen, Fu, Cheung, A Robust Outlier Detection Scheme for Large Data Sets – 2001
13.


a.
b.
c.
d.
Parallel Formulations of Clustering
Study and possible implementation of parallel formulations of clustering techniques.
References:
Olson, Parallel Algorithms for Hierarchical Clustering – 1993
Nagesh, High Performance Subspace Clustering for Massive Data Sets - 1999
Skillicorn, Strategies for Parallel Data Mining, 1999
Dhillon, Modha, A Data-Clustering Algorithm On Distributed Memory Multiprocessors - 2000
14.
Clustering of Time Series
 Study and possible implementation of time series clustering techniques on actual NASA time series data.
 References:
a. Oates, Clustering Time Series with Hidden Markov Models and Dynamic Time Warping - 1999
b. Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta, Distance Measures for Effective
Clustering of ARIMA Time Series
c. Tim, Identifying Distinctive Subsequences in Multivariate Time Series by Clustering – 1999
15.


a.
b.
c.
d.
Scalable clustering algorithms
A comparative study of scalable data mining techniques.
References:
Tian Zhang, BIRCH: An Efficient Data Clustering Method for Very Large Databases -. 1999
Ganti, Ramakrishnan, Clustering Large Datasets in Arbitrary Metric Spaces, 1998
Bradley, Fayyad, Reina Scaling Clustering Algorithms to Large Databases –1998
Farnstrom, Lewis, Elkan, Scalability for Clustering Algorithms Revisited - 2000
16.
Clustering association rules and frequent item sets
 A comparative study of techniques for clustering association rules.
 References:
a. Toivonen, Klemettinen, Pruning and Grouping Discovered Association Rules, 1995
b. Widom, Clustering Association Rules - Lent, Swami - 1997
c. Gunjan K. Gupta , Alexander Strehl AND Joydeep Ghosh, Distance Based Clustering of Association
Rules