Download January 23, 2002 92.6180-01 DATA MINING AND KNOWLEDGE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
January 23, 2002
92.6180-01 DATA MINING AND KNOWLEDGE DISCOVERY
Instructor:
Prof. Mark J. Embrechts (x 4009 or 371-4562)
Office hrs: CII 5217 Tuesday 10:00-12:00
Class Time:
Monday/Thursday 10:00-11:20 am (Low 3112)
Book: Michael J. A. Berry, and Gordon Linoff, Data Mining Techniques: For Marketing,
Sales, and Customer Support, John Wiley (1997). ISBN 0-471-17980-9
Lectures #2&3: Typical Data Mining Problems
These two lectures expose typical classes of data mining problems (asymmetric market
basket analysis, clustering, association rules, database marketing, the standard data mining
problem). The lecture highlights the conditions to classify a problem as a data mining
problem and gives an example of data strip mining and scientific data mining. Special
emphasis is placed on the right terminology of training set, validation set and test set. The
concept of bootstrapping will be introduced. Emphasis will also be placed on data
preparation, data presentation, data cleaning and data visualization. Lecture 3 will
emphasize non-linear association rules, feature reduction, correlation matrices, sensitivity
analysis and 2-D sensitivity analysis for data mining. Data prediction problems will be
introduced as stripminer problems and measures for model quality and prediction quality
(q2 and r2) will also be defined.
Handouts:
Jorge Luis Borges, “The Library of Babel,” from Labyrinths, pp. 51-58, Modern Library
(1983).
Homework 1:
Grab the two datasets posted on the class website (TBA) and knowing that the last column
is a data label and the column next to last is some variable for which a predictive model is
required describe what you see in this dataset and try to make a prediction for the variable
of interest with the second dataset.
Deadlines:
1.
2.
3.
January 24, HW#0 (Web browsing).
January 28, Project Proposal
February 31, HW #1
1
DATA MINING PROBLEMS












Market basket analysis
Database marketing (e.g., telemarketing)
Outlier detection (e.g., manufacturing errors)
Fraud detection (e.g., insurance, taxes)
Unusual/Interesting pattern detection (e.g., credit cards)
Medical diagnosis (e.g., pap smear interpretation, medical image
interpretation)
Uni-variate and multi-variate time series analysis
Text and web mining problems (e.g., taxonomy trees, also www.encartia.com)
The standard data mining problem::
(i) Predict a variable
(ii) Identify mportant features
(iii) Interpret features In domain
The standard data clustering problem
(generally unsupervised, but now with a calibration phase)
Data stripmining problems (lots of features/relatively few data)
Genome mining
2
DATA MINING TERMINOLOGY AND ISSUES










Field and Records
Descriptors or Features or Attributes
Training set/validation set/test set
Calibration (for unsupervised clustering)
Gauging, bagging
N-fold cross-validation
The jack-knife and the bootstrap
Robust statistics
Data cleansing (outliers vs. false values)
Missing data and data quality (e.g. multiple entries for same field, diff.
quality).
DATA MINING DEFINITIONS







Large Data Sets
Data do not fit in Memory
Multiple Datasets
Missing and False Data
Non-Obviousness of the Problem
Vastness of the problem
3