Download Class Slides - Pitt Department of Biomedical Informatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Joyeeta Dutta-Moscato
July 10, 2013
Data Mining
Wherever we have large amounts of data, we have the need for
building systems capable of learning information from the data
– predictions in medicine
– text and web page classification
– speech recognition
Learning underlying patterns useful to
– to predict the presence of a disease for future patients,
– describe the dependencies between diseases and
symptoms
Data Mining focuses on the discovery of (previously)
unknown properties from data, using techniques from
Machine Learning.
Data
• 4 attributes / features
• Each attribute has values
• 3 × 3 × 2 × 2 = 36 possible combinations
• 14 combinations present in this example
Data  Prediction
A set of rules to predict whether we will get to play
could look like this:
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
 A decision list
Decision Tree Learning
F = { <Outlook, Humidity, Wind, Temp>  Play Tennis? }
The goal is to create a model that predicts the
value of a target variable based on several
input variables.
Decision Tree Learning
Problem Setting
• Set of possible instances X
Each instance x in X is a feature vector x = < x1, x2, ... xn>
• Unknown target function f: XY
Y is discrete valued
• Set of function hypotheses H = { h | h : X Y }
Each hypothesis h is a decision tree
Input:
• Training examples {<x(i),y(i)>} of unknown target function f
Output
• Hypothesis h ∈ H that best approximates target function f
Supervised Learning
Given a set of training examples of the form:
{(x1, y1), … (xn, yn)}
a learning algorithm seeks a function:
g:XY
Where X is the input space and Y is the output space.
Example:
- Classify the universe of music into ‘like’ & ‘dislike’ for
one person
- Training set: A list of songs that the person heard,
and marked as ‘like’ or ‘dislike’
- Task: Infer a function of features (of these songs) to
predict what other songs the person will like
Supervised Learning
Given a model family, we are interested in finding the
best model parameters, such that the misfit (measured by
an error function) between the data and the model is
minimized.
An optimal scenario will allow for the algorithm to
correctly determine the class labels for unseen instances.
Supervised Learning
Considerations:
• The learning algorithm must generalize from the
training data to unseen situations in a "reasonable"
way: Avoid overfitting
• Bias-variance tradeoff
• Number of training examples versus model
complexity
Supervised Learning
Common methods of supervised learning:
• Regression
X discrete or continuous → Y continuous
Examples:
– debt, equity, orders, sales → stock price
– age, height, weight, race, VKORC1 genotype, CYP2C9 genotype
→ warfarin dose
• Classification
X discrete or continuous → Y discrete
Examples:
- family history, history of head trauma, age, gender, race,
APOE status → Alzheimer’s disease
- arrangement of pixels in handwritten digit → “3”
Supervised Learning
• Linear Regression
Fitting the data to the model
Object: Minimize mean square error
Regression
Is a mean square error of 0 (i.e. no difference
between prediction and target) mean this is the best
model?
 Overfitting
Real test of ‘best model’ is performance on
data it has not been trained on
Regression
What does this mean about the relationship
between x and y?
Classification
• Linear classifier
• Logistic regression
Hard threshold
Uses the logistic function,
which goes between 0 and 1
Soft threshold
Other common methods in
Supervised Learning
More sophisticated algorithms are needed for data
that are not linearly separable
• Support Vector machines
• Artificial Neural Networks (can also be unsupervised)
• K-nearest neighbor
• Graphical models, Bayesian models
Unsupervised Learning
Learn relationships among the inputs, x1 , … xn .
No y is given.
Clustering
– Group inputs based on some measure of
similarity
- Common “first pass” exploratory data mining
technique
Hierarchical Clustering
A method of cluster analysis which aims to partition
into groups that are “close” to each other according to
some distance metric.
k-means Clustering
A method of cluster analysis which aims to partition
the data into k clusters in which each observation
belongs to the cluster with the nearest mean.
Acknowledgments
Shyam Visweswaran, Dept. of Biomedical Informatics
Tom Mitchell, Dept. of Machine Learning, CMU
“Data Mining: Practical Machine Learning Tools and
Techniques” Ian H. Witten, Eibe Frank, Mark A. Hall