Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
RICEVIMENTO Giovedi 9.30-11.30 [email protected] LEZIONI SGI+SSE: 21 settembre 2015 21-Ottobre 2015 LAB 717 Lunedi 10,30-13,30 Mercoledi 10,30-13,30 Giovedi 14,30-17,30 SGI 22-Ottobre 2015- 5 Novembre 2015 Giovedi 22 ottobre 2015 14,30-17,30 Lunedi 26 ottobre 2015 14,30-17,30 Martedi 3 novembre 2015 10,30-13,30 Martedi 3 novembre 2015 14,30-17,30 Mercole 4 novembre 2015 10,30-13,30 Mercole 4 novembre 2015 14,30-17,30 Giovedi 5 novembre 2015 14,30-17,30 NB: Settimana di recupero Distribuzione argomenti nelle settimane (stima) SGI+SSE 1-2-3 settimana: Introduzione DM, Intro classificazione (target qualitativi) 4-5 settimana: logistica binaria/politomica, alberi decisionali, Nearest neighbour, Naive Bayes, 6 settimana: Assessment SGI 7-8-9 settimana: Neural Network, PCR, PLS e cenni sui Target quantitativi (se avanza tempo Association) PROGRAMMA: files forniti DATA MINING, MODELLI PER IL DM, http://www.statistica.unimib.it/utenti/lovaglio files del corso P0 I1: parte 0 SEZ 1 P0 I1: parte 0 SEZ 2 P1 I2: parte 1, SEZ 1 P1 I2: parte 1, SEZ 2 …..etc I files sono in ordine di Parte e sezione Argomenti d’esame Per tutti Solo SGI Software: SAS base-Stat e SAS ENTERPRISE MINER Workstation 7.1 SAS a CASA http://www.unimib.it/go/47940/Home/Italiano/Servizi-informatici/Softwarecampus/SAS ESAME: ORALE (TEORIA + output elaborazione) Da portare all’orale gli output relativi ad un’applicazione (dataset scelti dal sito del docente o da altri repositories) in cui viene richiesta la seguente analisi: Con un TARGET QUALITATIVO-binario, definire profitti, priors, etc, fare pre-processing (missing, difchi, transformation covariate, collinearità e separation), lanciare e specificare VARI modelli: - REGRESSIONE LOGISTICA binaria, ALBERI CLASSIFICATIVI, Nearest Neighbour* e altre opzioni (es. su dataset iniziale o ripulito, con X originali, trasformate, componenti delle X). Confrontare le performances tra modelli, con il migliore fare lo score di nuovi casi. NB: ricordarsi di specificare che tipo di tecnica di validation/robustezza si utilizza: crossvalidation o validation dataset NB2 Per fare lo scoring, se non esite il dataset di score, prendete casualmente il 10% delle osservazioni del dataset iniziale, eliminate il target e lo usate come score. Quindi il dataset di partenza (da dividere in trainig e validation) avrà il 90% delle osservazioni. *SGI tra i vari modelli implementa anche una rete neurale. SAS ENTERPRISE MINER ESEMPIO target binario SAS offre sei certificazioni riconosciute a livello internazionale fra cui : SAS Certified Predictive Modeler using SAS Enterprise Miner 7 Credential SAS Certified Predictive Modeler using SAS Enterprise Miner 7 Credential Designed for SAS Enterprise Miner users who perform predictive analytics During this performance-based examination, candidates will use SAS Enterprise Miner to perform the examination tasks. It is essential that the candidate have a firm understanding and mastery of the functionalities for predictive modeling available in SAS Enterprise Miner 7. Successful candidates should have the ability to: prepare data, build predictive models, assess models, score new data sets, implement models. Required Exam Exam: Candidates will use SAS Enterprise Miner to perform this exam 61 multiple-choice questions (must achieve score of 70% correct to pass) 3 hours to complete exam Exam topics include: Data Preparation, Starting a new Enterprise Miner project, Missing values , Initial data exploration including data, visualization/measurement levels or scales/variable reduction, Transformation/recoding/binning Predictive Models: Data splitting/balancing/overfitting/oversampling, Logistic/linear regression, Artificial neural networks (MLP), Decision trees, Variable importance/odds ratio, Profit/loss/prior probabilities Model Assessment: Comparison between models/lift chart/ROC/profit & loss, Assessment of a single model Scoring and Implementation: Score a data set, Model implementation R Cenni al package Rattle di R: install.packages("RGtk2") install.packages(“rattle") library(rattle) rattle() ESAME http://datamining.togaware.com/survivor/Getting_Started.html Classificative Models in Rattle Assessment tool in Rattle http://orange.biolab.si/getting-started/ Data Sets (oltre a quelli forniti da me) Data Repositories 1. Open Gov. Data: www.data.gov, www.data.gov.uk, www.data.gov.fr, http://opengovernmentdata.org/data/catalogues/ 2. Kaggel: www.kaggle.com 3. KDD Nugets: http://www.kdnuggets.com/datasets/ 4. UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ 5. StatLib: http://lib.stat.cmu.edu 6. TwitteR: http://cran.r-project.org/web/packages/twitteR/index.html 7. rfigshare: http://figshare.com, http://cran.r-project.org/web/packages/rfigshare/index.html IDS data sets Data Sets for Data Mining Competition Data Set UCI Machine learning repository Quest data repository KDNuggets DATA MINING TUTORIAL http://www.autonlab.org/tutorials/ Datasets for "The Elements of Statistical Learning" http://www-bcf.usc.edu/~gareth/ISL/data.html http://statweb.stanford.edu/~tibs/ElemStatLearn/ 14-cancer microarray data: Info Training set gene expression , Training set class labels , Test set gene expression , Test set class labels . The indices in the cross-validation folds used in Sec 18.3 are listed in CV folds. Bone Mineral Density: Info Data Countries: Info Data Galaxy: Info Data Los Angeles Ozone: info Data Marketing: Info Data Mixture Simulation: Info Data NCI (microarray): Info Data Ozone: Info Data Phoneme: Info Data Prostate: Info Data Protein flow cytometry data: Info Data Covariance matrix Radiation sensitivity data: Info gene expression data outcome SRBCT microarray data: Info Training set gene expression , Training set class labels , Test set gene expression , Test set class labels Signatures data: Info Data Skin of the Orange (Section 12.3.4): Info Data South African Heart Disease: Info Data Vowel: Info, Training and Test data. Waveform: Info, Training and Test data, and a generating function waveform.S (Splus or R). ZIP code: Info, gzipped Training and Test data. Spam: Info Data and test set Indicator For more informations, see the UCI spambase directory. Dataset corso spain “Mineria de datos” http://www.lsi.upc.edu/~belanche/Docencia/mineria/mineria.html The SAMPSIO library contains both real and fictitious data sets. To see this library type: libname sampsio ' C:\Program Files\SASHome\SASFoundation\9.4\dmine\sample'; run; 1. You can use the DMA[xxxx] data sets (example dmafish) and a Data Partition node to create your training, validation, and test data. OPPURE 2. The DML[xxxx] data sets (ex. dmlfish) contains input and target values for training. You can use the DMT_data sets as test data for comparing models. The are few data sets available as validation (DMV[xxxx]) and score (DMS[xxxx]) data sets Nome dataset: DMAHMEQ, DMLHMEQ, DMVHMEQ, DMTHMEQ, HMEQ Variable Model Role Measurement Description bad target binary default or seriously delinquent clage input interval age of oldest trade line in months clno input interval number of trade (credit) lines debtinc input interval debt to income ratio delinq input interval number of delinquent trade lines derog input interval number of major derogatory reports job input nominal job category loan input interval amount of current loan request mortdue input interval amount due on existing mortgage ninq input interval number of recent credit inquiries reason input binary home improvement or debt consolidation value input interval value of current property yoj input interval years on current job Caratteristiche dei Funghi sulla velenosità o mangiabilità (edible) DMAMUSH, DMLMUSH, DMTMUSH Variable Model Role Measurement Description bruises input nominal bruises capcolor input nominal cap color capshape input nominal cap shape capsurf input nominal cap surface gillatta input nominal gill attachment gillcolo input nominal gill color gillsize input nominal gill size gillspac input nominal gill spacing habitat input nominal habitat odor input nominal odor populat input nominal population ringnumb input nominal ring number ringtype input nominal ring type sporepc input nominal spore print color stalkcar input nominal Stalk (gambo) color above ring stalkcbr input nominal stalk color below ring stalkroo input nominal stalk root stalksar input nominal stalk surface above ring stalksbr input nominal stalk surface below ring stalksha input nominal stalk shape target target binary poisonous or edible veilcolo input nominal Veil (cappello) color ceiltype input nominal veil type Vessels: Heart attack riduzione dei vasi sanguigni DMAHART, DMLHART, DMTHART Variable Model Role Measurement Description age input interval age bpress input interval resting blood pressure bsugar input binary fasting blood sugar > 120 mg/dl ca input ordinal number of major vessels (0-3) colored by fluoroscopy chol input interval serum cholesterol in mg/dl ekg input nominal resting electrocardiographic results exang input binary exercise induced angina oldpeak input interval ST depression induced by exercise relative to rest pain input nominal chest pain type sex input binary sex slope input ordinal slope of the peak exercise ST segment target target ordinal number of major vessels (0-4) reduced in diameter by more than 50% thal input nominal thal thalach input interval maximum heart rate achieved SAS AT the UCLA Statistics website. http://www.ats.ucla.edu/stat/ http://www.ats.ucla.edu/stat/sas/ http://www.ats.ucla.edu/stat/dae/ Data examples Learning SAS base http://www.biostat-edu.com/ProgramNotes.html http://web.utk.edu/sas/OnlineTutor/1.2/en/60476/paths.htm Data mining case studies http://megaputer.com/site/success_stories.php Analysis and Forecasting of House Price Indices Customer Response Prediction and Profit Optimization Predictive Modeling of Big Data with Limited Memory See http://www.rdatamining.com/docs chapter 12-13-14 Scandinavian Airlines Modernize Business Intelligence Capabilities http://www.alsharif.info/#!iom530/c21o7 Altro materiale utile (Fatto bene) http://zlin.ba.ttu.edu/6347/notes13.htm http://www.lsi.upc.edu/~belanche/Docencia/mineria/mineria.html http://psi.cse.tamu.edu/teaching/lecture_notes/ http://www.autonlab.org/tutorials/ Paper e approfondimenti 1. Evaluating Performance of Classifiers Compare the bias and variance of models generated using different evaluation methods (leave one out, cross validation, bootstrap, stratification, etc.) References: a. Kohavi, R., A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (1995) b. Efron, B. and Tibshirani, R., Cross-Validation and the Bootstrap: Estimating the Error Rate of a Prediction Rule (1995) c. Martin, J.K., and Hirschberg, D.S., Small Sample Statistics for Classification Error Rates I: Error Rate Measurements (1996) d. Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms (1998) 2. a. b. c. Support Vector Machine (SVM) Present an overview of SVM or applying Support Vector Machines to various application domains. References: Mangasarian, O.L., Data Mining via Support Vector Machines (2001) Burges, C.J.C., A Tutorial on Support Vector Machines for Pattern Recognition (1998) Joachims, T., Text Categorization with Support Vector Machines: Learning with Many Relevant Features (1998) d. Salomon, J., Support Vector Machines for Phoneme Classification (2001) 3. Cost-sensitive learning A comparative study and implementation of different techniques for ensemble learning such as bagging, boosting, etc. References: a. Freund Y. and Schapire, R.E., A short introduction to boosting (1999) b. Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002) c. Quinlan, J.R., Boosting, Bagging and C4.5 (1996) d. Bauer, E., Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants (1999) 4. Semi-supervised learning (classification with labeled and unlabeled data) a. b. c. d. Applying different semi-supervised learning techniques to UCI data sets. References: Nigam, K., Using Unlabeled Data to Improve Text Classification (2001) Seeger, M., Learning with labeled and unlabeled data (2001) Nigam, K. and Ghani, R., Analyzing the Effectiveness and Applicability of Co-training (2000) Vittaut, J.N., Amini, M-R., Gallinari, P., Learning Classification with Both Labeled and Unlabeled Data (2002). 5. Classification for rare-class problems A comparative study and/or implementation of different classification techniques to analyze rare class problems References: a. Joshi, M.V., and Agrawal, R., PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-study in Network Intrusion Detection) (2001) b. Joshi, M.V., Agrawal, R., and Kumar, V., Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction (2001) c. Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002) d. Joshi, M.V., Kumar, V., Agrawal, R., On Evaluating Performance of Classifiers for Rare Classes (2002) (2002) 6. a. b. c. Time Series Prediction/Classification A comparative study and/or implementation of time series prediction/classification techniques References: Geurts, P., Pattern Extraction for Time Series Classification (2001) Kadous, M.W., A General Architecture for Supervised Classification of Multivariate Time Series (1998) Giles, C.L., Lawrence, S. and Tsoi, A.C., Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference (2001) d. Keogh, E.J. and Pazzani, M.J., An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback (1998) e. Chatfield, C., The Analysis of Time Series, Chapman & Hall (1989) 7. a. Sequence Prediction A comparative study and implementation of sequence prediction techniques References: Laird, P.D., Saul, R. Discrete Sequence Prediction and Its Applications. Machine Learning, 15(1): 43-68 (1994) b. Sun, R. and Lee Giles, C., Sequence Learning: From Recognition and Prediction to Sequential Decision Making (2001) c. Lesh, N., Zaki, M.J., and Ogihara, M., Mining features for Sequence Classification (1999) 8. a. b. Association Rules for Classification A comparative study and implementation of classification using association patterns (rules and itemsets) References: Liu, B., Hsu, W., and Ma, Y., Integrating Classification and Association Rule Mining (1998) Liu, B., Ma, Y. and Wong, C-K, Classification Using Association Rules: Weaknesses and Enhancements (2001) c. Li, W., Han, J. and Pei, J., CMAR: Accurate and Efficient Classification Based on Multiple ClassAssociation (2001) d. Deshpande, M. and Karypis, G., Using Conjunction of Attribute Values for Classification (2002) 9. Spatial Association Rule Mining A comparative study on spatial association rule mining. References: a. Koperski, K., and Han, J., Discovery of Spatial Association Rules in Geographic Information Databases (1995) b. Shekhar, S. and Huang, Y., Discovering Spatial Co-location Patterns: A Summary of Results (2001) c. Malerba, D., Esposito, F. and Lisi, F., Mining Spatial Association Rules in Census Data (2001) 10. Temporal Association Rule Mining A comparative study and/or implementation of temporal association rule mining techniques References: a. Li, Y., Ning, P., Wang, and S., Jajodia, S., Discovering Calendar-based Temporal Association Rules (2001) b. Chen, X. and Petrounias, Mining temporal features in association rules c. Lee, C.H., Lin, C.R. and Chen, M.S., On Mining General Temporal Association Rules in a Publication Database (2001) d. Ozden, B., Ramaswamy, Silberschatz, Cyclic Association Rules (1998) e. Literature on Sequential Association Rule Mining below 11. Sequential Association Rule Mining A comparative study and/or implementation of sequential association rule mining techniques References: a. Srikant, R. and Agrawal, R., Mining Sequential Patterns: Generalizations and Performance Improvements (1996) b. Mannila, H. and Toivonen, H., Verkamo, A.I., Discovery of Frequent Episodes in Event Sequences (1997) c. Joshi, M., Karypis, G., and Kumar, V., A Universal Formulation of Sequential Patterns (1999) d. Borges J., and Levene, M., Mining Association Rules in Hypertext Databases (1998) 12. a. b. c. d. e. Outlier Detection A comparative study and/or implementation of outlier detection techniques. References: Knorr, Ng, A Unified Notion of Outliers: Properties and Computation, - 1997 Knorr, Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets - 1998 Breunig, Kriegel, Ng, Sander, LOF: Identifying Density-Based Local Outliers - 2000 Aggarwal, Yu, Outlier Detection for High Dimensional Data – 2001 Tang, Chen, Fu, Cheung, A Robust Outlier Detection Scheme for Large Data Sets – 2001 13. a. b. c. d. Parallel Formulations of Clustering Study and possible implementation of parallel formulations of clustering techniques. References: Olson, Parallel Algorithms for Hierarchical Clustering – 1993 Nagesh, High Performance Subspace Clustering for Massive Data Sets - 1999 Skillicorn, Strategies for Parallel Data Mining, 1999 Dhillon, Modha, A Data-Clustering Algorithm On Distributed Memory Multiprocessors - 2000 14. Clustering of Time Series Study and possible implementation of time series clustering techniques on actual NASA time series data. References: a. Oates, Clustering Time Series with Hidden Markov Models and Dynamic Time Warping - 1999 b. Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta, Distance Measures for Effective Clustering of ARIMA Time Series c. Tim, Identifying Distinctive Subsequences in Multivariate Time Series by Clustering – 1999 15. a. b. c. d. Scalable clustering algorithms A comparative study of scalable data mining techniques. References: Tian Zhang, BIRCH: An Efficient Data Clustering Method for Very Large Databases -. 1999 Ganti, Ramakrishnan, Clustering Large Datasets in Arbitrary Metric Spaces, 1998 Bradley, Fayyad, Reina Scaling Clustering Algorithms to Large Databases –1998 Farnstrom, Lewis, Elkan, Scalability for Clustering Algorithms Revisited - 2000 16. Clustering association rules and frequent item sets A comparative study of techniques for clustering association rules. References: a. Toivonen, Klemettinen, Pruning and Grouping Discovered Association Rules, 1995 b. Widom, Clustering Association Rules - Lent, Swami - 1997 c. Gunjan K. Gupta , Alexander Strehl AND Joydeep Ghosh, Distance Based Clustering of Association Rules