Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ITM 326: DATA MINING (Spring 2010) Jianmin Liu, Ph.D. Golden Gate University San Francisco Overview Data Mining is becoming a major analytical tool for corporations and institutions that need to analyze large amount of data to achieve their specific objectives. It enables companies to explore large amount of data and discover relationships and patterns that can assist in more profitable, proactive decision-making. Advances in hardware and software have propelled data mining beyond traditional analytics by making computeintensive activities more feasible and usable. The Data Mining software used in this course is Enterprise Miner from the SAS Institute, Inc. SAS System is widely used in business, banking, insurance, biotechnology, government, and education for database management and statistical analysis. Enterprise Miner (version 4.3) is the leading data mining software in business application and a natural evolution from the SAS system and is complemented in the traditional quantitative analysis. This course provides students with the basic technical background of Data Mining techniques. The instructor will demonstrate different data mining techniques using reallife data and SAS/Enterprise Miner software with a hands-on approach. Instructor Information I received my Ph.D. and M.S. in economics from UC Berkeley, MS in applied statistics and BS in electrical engineering from China. I have 8 years of experience using SAS and have been a frequent speaker on international data mining conferences. I have contributed a number of articles in the data mining conference and SAS user conference and have worked as a SAS consultant for both government and private sectors. I currently work as senior vice president in risk management of business banking division of Wells Fargo Bank, responsible for model development and data infrastructure. Previously I was a senior vice president of Home Equity Lending division of Wells Fargo Bank, responsible for statistical/econometric modeling and data mining. Prior to that I was a vice president of credit risk management at Bank of America Mortgage. I also worked at Wells Fargo Bank as Assistant VP/Senior Project Manager in on-line banking, and worked at a financial engineering firm in Berkeley as a Senior Financial Economist. I have been an 11/13/09 2 adjunct faculty member at the Golden Gate University since 1996, teaching Data Mining, SAS Programming, Derivatives Markets and Econometrics. Learning Objectives This class is intended for providing hands-on experience of applying Data Mining techniques using SAS Enterprise Miner. Although it is not a statistics class, various practical issues that are in statistical nature will be discussed in the class. For this reason, students are highly encouraged to interact with me and participate in class discussion. Upon successful completion of this course, students will be able to: • Understand what is data mining as a general concept and the difference from traditional statistical analysis • Understand some widely used data mining techniques • Know how to conduct a data mining analysis using SAS’ Enterprise Miner on real data • Understand the strong and weak points of data mining techniques vs. traditional statistical analysis • Understand how to interpret the results of data mining analysis and how to implement the results Prerequisites No programming experience is required, but some statistical background helps. Textbooks Getting Started with Enterprise Miner Software, Release 4.3, by SAS Institute, 2000 Mastering Data Mining, by M. Berry, John Wiley & Sons, Inc, 2000 Classification And Regression Trees, by L. Breiman, J. Friedman, R. Olshen, C. Stone, Wadsworth International Group, Belmont, California, 1984 (reference) Neural Networks for Pattern Recognition (reference) by Christopher Bishop, Oxford University Press, 1995 Pattern Recognition and Neural Networks (reference) by Brian Ripley, Cambridge University Press, 1996 Instruction Method This course combines lectures, discussions and computer lab hours. 50% percent of course will focus on the basic theoretical background of various data mining techniques. 11/13/09 3 The rest of course will focus on application of data mining techniques and SAS Enterprise Miner. Grading There is a midterm exam, which counts for 50% of class grade. The rest of 50% will be based on class participation and course project. Content Outline Session 1 • • • • • • • • Introduction Introduction of the course, syllabus, etc. What is Data Mining? Why Data Mining? Situations where Data Mining techniques are being utilized Data Mining and Statistical Modeling Basic Data Mining Terminology Background • growth in computing power and operational databases • challenges presented by massive, opportunistic data • prediction and understanding of business outcomes • contributing disciplines: statistics, machine learning, pattern recognition Problem Formulation • formulating business objectives that can be translated into suitable analytical methods • applying predictive modeling to database marketing, credit scoring and fraud detection • applying and recognizing the pitfalls of cluster analysis and association rule discovery Session 2-3 Overview of Some Basic Statistical Techniques • • • • • • • • • • Data Difficulties • data structure and organization • errors, outliers, and missing values • sampling and oversampling • dimension reduction and the curse of dimensionality Linear Regression Logistic Regression Multinomial Logit Model (optional, more detailed statistics required) Variable Selection in Regression Analysis Variable Transformation Hypothesis Test in Regression Analysis Goodness-of-fit of Models Variable Selection in Data Mining Sampling 11/13/09 4 • • • Variable Transformation Outlier Filter Prior Probability Specification • Measuring Effectiveness of Data Mining Techniques Lift Chart Profit Matrix (Cost Matrix) Misclassification Rate Session 4 – 6 Data Mining Techniques: • • • • • Decision Trees constructing a decision tree using a credit scoring example examining the functionality of the Decision Tree node constructing decision trees with binary and multiway splits pruning and assessing decision trees • CART (Classification and Regression Tree) Gini Index and Impurity CHAID (Chi-Squre Automatic Interaction Detection) C4.5 Neural Networks constructing multilayer perceptrons visualizing network complexity performing stopped training Curse of Dimensionality Multi-Layer Perceptron (Feed Forward NN) Redial Activation Function NN • • • • • • Session 7 The Data Mining Process • • • • • Data Preparation Defining a Study SEMMA (Sample, Explore, Modify, Model, Assess) Reading the Data and Building a Model Understanding Your Model and Testing the Results Session 9-10 • Using SAS/Enterprise Miner to Conduct Data Mining Analysis • • • • • Using an example data set to test logit model (binary) Using an example data set to test CART (binary output) Using an example data set to test CHAID (binary output) Using an example data set to test NN (Neural Network) (binary output) Using an example data set to test C4.5 (binary output) 11/13/09 • 5 Ensemble Node Session 11 • • • • • Using a real-life data set to test logit model (binary output) Using a real-life data set to test CART (binary output) Using a real-life data set to test CHAID (binary output) Using a real-life data set to test NN (Neural Network) (binary output) Using a real-life data set to test C4.5 (binary output) Session 12 • • • • Using a real-life data set to test CART (multiple output) Using a real-life data set to test CHAID (multiple output) Using a real-life data set to test NN (Neural Network) (multiple output) Using a real-life data set to test C4.5 (multiple output) Session 13 Practical Issues in Data Mining Applications • • • • • • • Sampling Issues Variable Type Issues Variable Transformation Issues Model Stability Issues Score Scale Issues Score Implementation Issues Exporting Score Code from Enterprise Miner to a different platform A Case Study of Database Marketing A Case Study of Credit Risk Management in Mortgage Portfolio Session 14-15 • • • Hands-on Project Course Summary TBA Session 16 Project Presentation