Download introduction to data mining using sas/enterprise miner

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ITM 326: DATA MINING
(Spring 2010)
Jianmin Liu, Ph.D.
Golden Gate University
San Francisco
Overview
Data Mining is becoming a major analytical tool for corporations and institutions that
need to analyze large amount of data to achieve their specific objectives. It enables
companies to explore large amount of data and discover relationships and patterns that
can assist in more profitable, proactive decision-making. Advances in hardware and
software have propelled data mining beyond traditional analytics by making computeintensive activities more feasible and usable.
The Data Mining software used in this course is Enterprise Miner from the SAS
Institute, Inc. SAS System is widely used in business, banking, insurance, biotechnology,
government, and education for database management and statistical analysis. Enterprise
Miner (version 4.3) is the leading data mining software in business application and a
natural evolution from the SAS system and is complemented in the traditional
quantitative analysis.
This course provides students with the basic technical background of Data Mining
techniques. The instructor will demonstrate different data mining techniques using reallife data and SAS/Enterprise Miner software with a hands-on approach.
Instructor Information
I received my Ph.D. and M.S. in economics from UC Berkeley, MS in applied statistics
and BS in electrical engineering from China. I have 8 years of experience using SAS and
have been a frequent speaker on international data mining conferences. I have contributed
a number of articles in the data mining conference and SAS user conference and have
worked as a SAS consultant for both government and private sectors. I currently work as
senior vice president in risk management of business banking division of Wells Fargo
Bank, responsible for model development and data infrastructure. Previously I was a
senior vice president of Home Equity Lending division of Wells Fargo Bank, responsible
for statistical/econometric modeling and data mining. Prior to that I was a vice president
of credit risk management at Bank of America Mortgage. I also worked at Wells Fargo
Bank as Assistant VP/Senior Project Manager in on-line banking, and worked at a
financial engineering firm in Berkeley as a Senior Financial Economist. I have been an
11/13/09
2
adjunct faculty member at the Golden Gate University since 1996, teaching Data Mining,
SAS Programming, Derivatives Markets and Econometrics.
Learning Objectives
This class is intended for providing hands-on experience of applying Data Mining
techniques using SAS Enterprise Miner. Although it is not a statistics class, various
practical issues that are in statistical nature will be discussed in the class. For this reason,
students are highly encouraged to interact with me and participate in class discussion.
Upon successful completion of this course, students will be able to:
•
Understand what is data mining as a general concept and the difference from
traditional statistical analysis
•
Understand some widely used data mining techniques
•
Know how to conduct a data mining analysis using SAS’ Enterprise Miner on real
data
•
Understand the strong and weak points of data mining techniques vs. traditional
statistical analysis
•
Understand how to interpret the results of data mining analysis and how to implement
the results
Prerequisites
No programming experience is required, but some statistical background helps.
Textbooks
Getting Started with Enterprise Miner Software, Release 4.3, by SAS Institute, 2000
Mastering Data Mining, by M. Berry, John Wiley & Sons, Inc, 2000
Classification And Regression Trees, by L. Breiman, J. Friedman, R. Olshen, C. Stone,
Wadsworth International Group, Belmont, California, 1984 (reference)
Neural Networks for Pattern Recognition (reference) by Christopher Bishop, Oxford
University Press, 1995
Pattern Recognition and Neural Networks (reference) by Brian Ripley, Cambridge
University Press, 1996
Instruction Method
This course combines lectures, discussions and computer lab hours. 50% percent of
course will focus on the basic theoretical background of various data mining techniques.
11/13/09
3
The rest of course will focus on application of data mining techniques and SAS
Enterprise Miner.
Grading
There is a midterm exam, which counts for 50% of class grade. The rest of 50% will be
based on class participation and course project.
Content Outline
Session 1
•
•
•
•
•
•
•
•
Introduction
Introduction of the course, syllabus, etc.
What is Data Mining?
Why Data Mining?
Situations where Data Mining techniques are being utilized
Data Mining and Statistical Modeling
Basic Data Mining Terminology
Background
•
growth in computing power and operational databases
•
challenges presented by massive, opportunistic data
•
prediction and understanding of business outcomes
•
contributing disciplines: statistics, machine learning, pattern recognition
Problem Formulation
•
formulating business objectives that can be translated into suitable analytical
methods
•
applying predictive modeling to database marketing, credit scoring and fraud
detection
•
applying and recognizing the pitfalls of cluster analysis and association rule
discovery
Session 2-3 Overview of Some Basic Statistical Techniques
•
•
•
•
•
•
•
•
•
•
Data Difficulties
• data structure and organization
• errors, outliers, and missing values
• sampling and oversampling
• dimension reduction and the curse of dimensionality
Linear Regression
Logistic Regression
Multinomial Logit Model (optional, more detailed statistics required)
Variable Selection in Regression Analysis
Variable Transformation
Hypothesis Test in Regression Analysis
Goodness-of-fit of Models
Variable Selection in Data Mining
Sampling
11/13/09
4
•
•
•
Variable Transformation
Outlier Filter
Prior Probability Specification
•
Measuring Effectiveness of Data Mining Techniques
 Lift Chart
 Profit Matrix (Cost Matrix)
 Misclassification Rate
Session 4 – 6 Data Mining Techniques:
•
•
•
•
•
Decision Trees
constructing a decision tree using a credit scoring example
examining the functionality of the Decision Tree node
constructing decision trees with binary and multiway splits
pruning and assessing decision trees
•
CART (Classification and Regression Tree)
 Gini Index and Impurity
CHAID (Chi-Squre Automatic Interaction Detection)
C4.5
Neural Networks
constructing multilayer perceptrons
visualizing network complexity
performing stopped training
 Curse of Dimensionality
 Multi-Layer Perceptron (Feed Forward NN)
 Redial Activation Function NN
•
•
•
•
•
•
Session 7 The Data Mining Process
•
•
•
•
•
Data Preparation
Defining a Study
SEMMA (Sample, Explore, Modify, Model, Assess)
Reading the Data and Building a Model
Understanding Your Model and Testing the Results
Session 9-10
•
Using SAS/Enterprise Miner to Conduct Data Mining Analysis
•
•
•
•
•
Using an example data set to test logit model (binary)
Using an example data set to test CART (binary output)
Using an example data set to test CHAID (binary output)
Using an example data set to test NN (Neural Network) (binary output)
Using an example data set to test C4.5 (binary output)
11/13/09
•
5
Ensemble Node
Session 11
•
•
•
•
•
Using a real-life data set to test logit model (binary output)
Using a real-life data set to test CART (binary output)
Using a real-life data set to test CHAID (binary output)
Using a real-life data set to test NN (Neural Network) (binary output)
Using a real-life data set to test C4.5 (binary output)
Session 12
•
•
•
•
Using a real-life data set to test CART (multiple output)
Using a real-life data set to test CHAID (multiple output)
Using a real-life data set to test NN (Neural Network) (multiple output)
Using a real-life data set to test C4.5 (multiple output)
Session 13 Practical Issues in Data Mining Applications
•
•
•
•
•
•
•
Sampling Issues
Variable Type Issues
Variable Transformation Issues
Model Stability Issues
Score Scale Issues
Score Implementation Issues
Exporting Score Code from Enterprise Miner to a different platform
 A Case Study of Database Marketing
 A Case Study of Credit Risk Management in Mortgage Portfolio
Session 14-15
•
•
•
Hands-on Project
Course Summary
TBA
Session 16 Project Presentation