Download Slide - Chrissnijders

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Machine learning wikipedia , lookup

Neural modeling fields wikipedia , lookup

Mathematical model wikipedia , lookup

Cross-validation (statistics) wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Pattern recognition wikipedia , lookup

Time series wikipedia , lookup

Transcript
Dealing with data,
crunching numbers
Aim of this lecture
• Get acquainted with the building blocks of
data-mining
• Get acquainted with the difficulties or issues
in data mining
• Decent data-mining = knowledge + work
• Top data-mining = knowledge, work, sweat,
and a good handful of luck
Simple or complex models
The SuperCruncher book considers two ways of
using data:
• Human vs computer ->
often (relatively) simple models (so it is not the case that
people in general are outperformed by models because the models are so
complex)
• Using data to make (surprising or better)
predictions -> often more complex models
Datamining (super crunching, sort of)
“the process of extracting patterns from data”
or: Knowledge Discovery in Data (KDD)
Basically, multiple regression (from the 1800s) is
one way to mine data. Usually, the term data
mining is used when the approach is more ‘data
driven’.
Examples: see the Super Crunchers book
Sources of (artificial) intelligence
• Reasoning versus learning (or deduction vs induction)
• Learning from data
–
–
–
–
–
–
–
–
Patient data
Customer records
Stock prices
Piano music
Criminal mug shots
Websites
Robot perceptions
Etc.
Data mining tasks
• Undirected, explorative, descriptive,
‘unsupervised’ data mining
– Matching & search
– Profile & rule extraction
– Clustering & segmentation; dimension reduction
• Directed, predictive, ‘supervised’ data mining
– Predictive modeling
Data mining - history
•
•
•
•
•
•
•
Bayesian inference (1700s)
Multiple regression (1800s)
Instance based learning (1900 [?])
Neural networks, genetic algorithms (1950s)
Decision trees (1960s)
Support vector machines (1980s)
Random forests (1990s)
(and many other methods, that sometimes combine
the previous ones)
General course of events
in data mining
• Getting the data … (representativity!)
• Preprocessing – this is also where part of the
magic is
• Analysis
• Validation / prediction
Preprocessing data
(and why is this necessary anyway?)
• Creating data sets
–
–
–
–
Reducing computer time
Feature extraction
Flattening
Longitudinal data set up
…
Setting up the data
(think about creating a data
file from a server log file)
• On inputs (choosing X-variables)
– Attribute selection
– Attribute construction
– …
Creating variables with
substance (e.g. factor
analysis)
• On input values (massaging X-variables)
–
–
–
–
–
Outlier removal / clipping
Normalization
Creating dummies
Missing values imputation
….
THIS IS A LOT OF WORK:
Based on knowledge of DM…
… AND of your topic
(and a bit of luck won’t hurt)
Software (mainly analysis, not much pre-processing [yet])
(and lots of others; see www.kdnuggets.com)
Kaggle
< show Kaggle website here>
Some analytical procedures
(that can be used instead
of multiple regression)
1.
2.
3.
4.
Other regression flavors
Decision trees
Instance based learning
Neural networks
Sideline - defining model fit
(it need not be R2)
• For classifying: % correct, perhaps weighted
because some errors are worse than others,
area under the ROC curve, …
• For predicting: R2, least squares, least absolute
difference, etc
For instance, what if
errors of a certain
magnitude are
acceptable, and
others are not?
Alternatives to regression:
Regression variants
Regression variants
All: use weighted average of X-vars to predict Y.
- Optimize something else
e.g.: “median regression” minimizes
|y-yp| instead of (y-yp)2
- Robust regression
- MARS
(create weights w on basis of fit, re-estimate with w; repeat until convergence)
(multi-adaptive regression splines)
- And many others …
Note: this is something different
than using logistic regression, or
multi-level regression...
Alternatives to regression:
Decision trees
Decision trees
(Perhaps an example on
a blackboard helps)
Decision trees
• Very popular: easy to understand, implement,
use, and calculate
• Risk: overfitting. There is no obvious “best”
tree.
• Used for classification mainly, but can be
extended to do regression analysis
(“regression trees”)
Decision trees: issues
•
How to choose the splitting variable, and where to split?
•
Why not just calculate the optimal tree?  not obvious which one this is
•
If x1 and x2 are the top splitting variables, are these also the most important
predictors?
•
If x3 does not end up as a splitting variable, should we conclude that x3 is not
important?
•
Sensitivity to outliers
http://www.geocities.com/
adotsaha/CTree/CtreeinExc
el.html
AnswerTree / CHAID
Alternatives to regression:
Instance based learning
Instance based learning
Slides taken from
http://www.autonlab.org/tutorials/
by Andrew Moore
Alternatives to regression:
Neural networks
Neural networks
Inspired by neuronal computation
in the brain (McCullough & Pitts
1943 (!))
Input (attributes) are coded as
activation on the input layer
neurons, activation feeds forward
through network of weighted links
between neurons and causes
activations on the output neurons.
Algorithm learns to find optimal
weight using the training instances
and a general learning rule.
(MANY different variants exist)
Tricks of the trade
(many of which we have no clue about why they work [!])
Some tricks of the trade (1)
“Bagging”:
sample with replacement
from your data to
generate new data, and
weigh the different models
you estimate
Amazingly, even simple estimators
can lead to very good predictions
Some tricks of the trade (2)
“Boosting”
• Apply estimator
• Calculate weights per case:
higher if case badly
predicted
• Back to first, until
estimator only just as good
as random
• Predict using the stored
weights
Amazingly, even simple
estimators can lead to
very good predictions
Combining methods (“meta-learning”)
Having lots of methods gives rise to new ones…
More weird stuff
You can show that in
principle, different
kinds of models are
equivalent. But this
does not mean that it
does not matter which
you choose.
Weighted averages of
(good and bad)
prediction models
often outperform the
best prediction models
(“ensembling”).
 There are several strange
tricks around, and we do not
really know why they work ...
Validation and prediction
The risk of overfitting
(Slides borrowed from Andrew Moore)
… because our predictions on new data will be terrible!
First way out: create a test set
• Randomly choose, say, 30% of your data to be
the test set, remainder is training set
• Run regressions on training set
• Estimate future performance on test set
Verdict:
Second way out: k-fold cross validation
-How to decide on the k-value?
-What is the eventual model?
A thriving community…
Art or science
(or luck, or
sweat)?
Example Results
Predicting Survival for Head & Neck Cancer
Can this be generalized at all?
Hmm … other than some basic
rules this is not really
understood yet.
A winner’s tale:
“We normalized the numerical variables by range, keeping the sparsity. For the categorical
variables, we coded them using at most 11 binary columns for each variable. For each
categorical variable, we generated a binary feature for each of the ten most common values,
encoding whether the instance had this value or not. The eleventh column encoded whether
the instance had a value that was not among the top ten most common values. We removed
constant attributes, as well as duplicate attributes.
We replaced the missing values by mean for numerical attributes, and coded them as a
separate value for discrete attributes. We also added a separate column for each numeric
attribute with missing values, indicating wether the value was missing or not. We also tried
another approach for imputing missing values based on KNN.
On the large data set we discretized the 100 numerical variables that had the highest mutual
information with the target into 10 bins, and added them as extra features.
Because we noticed that some of the most predictive attributes were not linearly correlated
with the targets, we build shallow decision trees (2-4 levels deep) using single numerical
attributes and used their predictions as extra features. We also build shallow decision trees
using two features at a time and used their prediction as an extra feature in the hope of
capturing some non-additive interactions among features.
We also generated additional features using a co-clustering based technique.”