Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
January 23, 2002 92.6180-01 DATA MINING AND KNOWLEDGE DISCOVERY Instructor: Prof. Mark J. Embrechts (x 4009 or 371-4562) Office hrs: CII 5217 Tuesday 10:00-12:00 Class Time: Monday/Thursday 10:00-11:20 am (Low 3112) Book: Michael J. A. Berry, and Gordon Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley (1997). ISBN 0-471-17980-9 Lectures #2&3: Typical Data Mining Problems These two lectures expose typical classes of data mining problems (asymmetric market basket analysis, clustering, association rules, database marketing, the standard data mining problem). The lecture highlights the conditions to classify a problem as a data mining problem and gives an example of data strip mining and scientific data mining. Special emphasis is placed on the right terminology of training set, validation set and test set. The concept of bootstrapping will be introduced. Emphasis will also be placed on data preparation, data presentation, data cleaning and data visualization. Lecture 3 will emphasize non-linear association rules, feature reduction, correlation matrices, sensitivity analysis and 2-D sensitivity analysis for data mining. Data prediction problems will be introduced as stripminer problems and measures for model quality and prediction quality (q2 and r2) will also be defined. Handouts: Jorge Luis Borges, “The Library of Babel,” from Labyrinths, pp. 51-58, Modern Library (1983). Homework 1: Grab the two datasets posted on the class website (TBA) and knowing that the last column is a data label and the column next to last is some variable for which a predictive model is required describe what you see in this dataset and try to make a prediction for the variable of interest with the second dataset. Deadlines: 1. 2. 3. January 24, HW#0 (Web browsing). January 28, Project Proposal February 31, HW #1 1 DATA MINING PROBLEMS Market basket analysis Database marketing (e.g., telemarketing) Outlier detection (e.g., manufacturing errors) Fraud detection (e.g., insurance, taxes) Unusual/Interesting pattern detection (e.g., credit cards) Medical diagnosis (e.g., pap smear interpretation, medical image interpretation) Uni-variate and multi-variate time series analysis Text and web mining problems (e.g., taxonomy trees, also www.encartia.com) The standard data mining problem:: (i) Predict a variable (ii) Identify mportant features (iii) Interpret features In domain The standard data clustering problem (generally unsupervised, but now with a calibration phase) Data stripmining problems (lots of features/relatively few data) Genome mining 2 DATA MINING TERMINOLOGY AND ISSUES Field and Records Descriptors or Features or Attributes Training set/validation set/test set Calibration (for unsupervised clustering) Gauging, bagging N-fold cross-validation The jack-knife and the bootstrap Robust statistics Data cleansing (outliers vs. false values) Missing data and data quality (e.g. multiple entries for same field, diff. quality). DATA MINING DEFINITIONS Large Data Sets Data do not fit in Memory Multiple Datasets Missing and False Data Non-Obviousness of the Problem Vastness of the problem 3