Download Current Trends in Machine Learning and Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Current Trends in Machine Learning and
Data Mining
Dr Marcus Gallagher
School of Information Technology and Electrical
Engineering
University of Queensland 4072 Australia
[email protected]
Talk Overview
Machine Learning
Data mining

Where machine learning fits into data mining
Scientific Data Mining


Machine learning and the scientific process
Current topics & future directions
Summary
Australian Virtual Observatory
Workshop 2003
2
Machine Learning
“…the study of computer algorithms
capable of learning to improve their
performance on a task on the basis of their
own experience.”

Often this is “learning from data”.
A sub-discipline of artificial intelligence,
with large overlaps into statistics, pattern
recognition, visualization, robotics, control,
…
Australian Virtual Observatory
Workshop 2003
3
Data Mining
“…the analysis of (often large)
observational data sets to find
unsuspected relationships and to
summarize the data in novel ways that are
both understandable and useful to the
data owner.”
Australian Virtual Observatory
Workshop 2003
4
Data Mining

Modern science is driven by data analysis like never
before. We have an ability to collect and process data
that is increasing exponentially!
Australian Virtual Observatory
Workshop 2003
5
Data mining: increase in CPU performance (1/execution time)
Australian Virtual Observatory
Workshop 2003
6
Data mining: trends in memory capacity and cost
Australian Virtual Observatory
Workshop 2003
7
Data mining
General trends:

Increase in the size of datasets (in terms of
observations and dimensionality).
What to do when #dimensions > # observations?


Interest in data mining for non-numerical data.
“We are drowning in information and starving
for knowledge”. – Rutherford D. Roger
Australian Virtual Observatory
Workshop 2003
8
Data Mining
Define
problem
Data
collection
Data
preparation
Data
modelling
Interpretation/
Evaluation
Implement/
Deploy model
Australian Virtual Observatory
Workshop 2003
9
Data Mining
Define
problem
Machine
Learning
Data
collection
Data
preparation
Data
modelling
Interpretation/
Evaluation
Implement/
Deploy model
Australian Virtual Observatory
Workshop 2003
10
Data modelling and the scientific process
Data modelling plays an important role at
several stages in the scientific process:
1.
2.
3.
4.
5.
Observe and explore interesting phenomena.
Generate hypotheses.
Formulate model to explain phenomena.
Test predictions made by the theory.
Modify theory and repeat (at 2 or 3).
The explosion of data suggests that we need
to (partially) automate numerous aspects of
the scientific process.
Australian Virtual Observatory
Workshop 2003
11
1. Observe and explore interesting phenomena.
Problems here are typically referred to as unsupervised
learning in the ML community:

Given a set of d-dimensional data vectors (X1,…,Xn), Xi =
(x1,…,xd), build a model of the data to infer properties of the
underlying distribution (process) that generated the data.
Key problems:


Dimensionality reduction: developing algorithms that can reduce
a dataset of hundreds or thousands of dimensions to just a few
for visualization, while retaining as much of the “information” as
possible in the original dataset.
Clustering of data – outlier detection: Identifying trends and/or
anomalies in datasets.
Australian Virtual Observatory
Workshop 2003
12
1. Observe and explore interesting phenomena
Current ML approaches:

Independent Component Analysis –
decompose multivariate data with the aim of
producing components that are as
“statistically independent” as possible.
Related to PCA and factor analysis.

Gaussian mixture models for clustering –
uses a semi-parametric probability density
estimator that is trained iteratively on data.
Implements a “soft” version of k-means.
Australian Virtual Observatory
Workshop 2003
13
1. Observe and explore interesting phenomena

Self-organizing maps and topographic
mapping – similar to clustering but where the
cluster-centres are constrained to lie in a lowdimensional manifold (and so have a spatial
relationship).
Australian Virtual Observatory
Workshop 2003
14
3. Formulate model to explain phenomena
Problems here are typically referred to as
supervised learning in the ML community:

Given a training set of pairs of input and
output data vectors {(Xi,Yi),…,(Xn,Yn)}, where
the input values are thought to have some
influence on the corresponding output values,
build a model of the data that can predict the
outputs of unseen (test) inputs.
Key problems:

Regression, classification, forecasting.
Australian Virtual Observatory
Workshop 2003
15
3. Formulate model to explain phenomena
Established ML approaches:


Neural networks
Decision-trees and rule-based classifiers
Example applications in astronomy:



Star/galaxy classification – on the basis of
optical data.
Photometric redshift evaluation.
Noise identification and removal in
gravitational waves detectors.
Australian Virtual Observatory
Workshop 2003
16
Current/New Machine Learning Models
Current ML approaches:


Support vector machines – uses insights from
computation learning theory and geometry to produce
predictors with powerful discrimination and good
generalization.
Ensemble methods – improving the accuracy of
predictions by using multiple models and bootstrap
sampling.
Example: boosting – incrementally constructs an ensemble of
“weak” models, where each model is forced to concentrate
on the mistakes made by previous models.
Australian Virtual Observatory
Workshop 2003
17
Current/New Machine Learning Models

Gaussian Processes – use the machinery of
Bayesian inference to model data using
stochastic processes.
All information is represented as a probability
distribution.
Incorporates uncertainty associated with prior
information and predictions made.

Probabilistic graphical models – the model is
a probability distribution where dependencies
are explicitly encoded.
Generative models.
Australian Virtual Observatory
Workshop 2003
18
Current/New Problems in Machine Learning
Active learning:


Assume that data can be generated
(measured or labelled) on demand – build a
learning algorithm that learns on the basis of
self-selected data points (queries).
Aims to reduce the amount of data required,
training time of the model, or amount of data
that must be manually labelled.
Australian Virtual Observatory
Workshop 2003
19
Current/New Problems in Machine Learning
Semi-supervised learning:


Learning from both labelled (expensive, scarce) and
unlabelled (abundant, cheap) data.
Aims are similar to active learning.
Transductive learning:


All unlabelled data points belong to the test set.
The algorithm is able to take advantage of the spatial
distribution of the data that the real-world generates
(and that it will be tested on).
Australian Virtual Observatory
Workshop 2003
20
Current/New Problems in Machine Learning
Non-numerical (unreal!) data

Examples:
Strings – text, DNA, computer programs.
Trees – parse trees, XML, phylogenetic trees.
Graphs – organic molecules, go board positions.
Embedding unreal data in continuous space and
applying standard ML techniques has limitations.
Need to define appropriate kernels, distance
metrics over these data spaces.
Australian Virtual Observatory
Workshop 2003
21
Summary
Automated data mining plays an increasingly
important role in the scientific process.
Machine Learning techniques are driven by the
problems in data mining and provide some
effective solutions.
Machine Learning is a highly active area of
research – new & evolving algorithms.
The scope of learning problems is widening to
deal with important real-world data.
Australian Virtual Observatory
Workshop 2003
22
References
1.
2.
3.
4.
5.
6.
7.
E. Mjolsness and D. DeCoste. Machine Learning for Science:
State of the Art and Future Prospects. Science (293):2051-2055,
2001.
D. Hand et al. Principles of Data Mining. MIT Press, 2001.
T. Hastie et al. The Elements of Statistical Learning. Springer,
2001.
R. Tagliaferri et al. Neural networks in astronomy. Neural
Networks (16):297-319, 2003.
S. Harmeling et al. Kernel-based nonlinear blind source
separation. Neural Computation v15, n5, 1089-1124, 2003.
T. Graepel. Getting Real with Unreal Data: Lessons Learned and
the Way Ahead. NIPS Workshop on Unreal Data, 2002.
S. J. Hong and S. Weiss: Advances in predictive models for data
mining. Pattern Recognition Letters 22(1): 55-61 (2001).
Australian Virtual Observatory
Workshop 2003
23