Download Challenges in Machine Learning and Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
What is machine learning?
g
Challenges in Machine
Learning and Data Mining
Tu Bao Ho
Ho, JAIST
Based on materials from
1 9 challenges in ML (Caruana & Joachims)
1.
2. 10 challenging problems in DM (Yang & Wu)

The goal of machine learning is to build computer systems that
can adapt and learn from their experience (Tom Dietterich).

A computer
p
p
program
g
is said to learn from experience
p
E with
respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measure by P, improves with
experience
i
(Tom
(T
Mi
Mitchell
h ll book,
b k p. 2).
2)

ML problems can be formulated as
 Given: (x1, y1), (x2, y2), …, (xn, yn)
- xi is description of an object, phenomenon, etc.
- yi is some property of xi, if not available learning is unsupervised

Find: a function f(x) that f(xi) = yi
Finding hypothesis f in a huge hypothesis space F by narrowing the search with constraints (bias)
Overview of ML challenges
g
1.
1
2.
3
3.
4.
5.
6.
7.
8
8.
9.
Generative vs.
vs discriminative learning
Learning from non-vectorial data
B
Beyond
d classification
l ifi ti and
d regression
i
Distributed data mining
Machine learning bottlenecks
Intelligible models
Combining learning methods
Unsupervised learning comes of age
More informed information access
1. Generative vs. discriminative methods
Training classifiers involves estimating f: X  Y, or P(Y|X).
E
Examples:
l P(apple
P(
l | red
d  round),
d) P(
P(noun | ““cá”)
á”)
Generative classifiers
Discriminative classifiers

Assume some functional
form for P(X|Y)
P(X|Y), P(Y)


Estimate parameters of
( | ), P(Y)
( ) directlyy from
P(X|Y),
training data, and use Bayes
rule to calculate P(Y|X = xi)


HMM, Markov random
fields, Bayesian networks,
Gaussians Naïve Bayes,
Gaussians,
Bayes etc.
(cá: fish, to bet)

Assume some functional form
for P(Y|X)
Estimate parameters of P(Y|X)
directlyy from trainingg data
SVM, logistic regression,
traditional neural networks,,
nearest neighbors, boosting,
MEMM, conditional random
fields etc.
fields,
etc
1. Generative vs. discriminative methods
Training classifiers involves estimating f: X  Y, or P(Y|X).
E
Examples:
l P(apple
P(
l | red
d  round),
d) P(
P(noun | ““cá”)
á”)
Generative classifiers
Discriminative classifiers

Assume some functional
form for P(X|Y)
P(X|Y), P(Y)


Estimate parameters of
P(X|Y),
( | ), P(Y)
( ) directlyy from
training data, and use Bayes
P(logistic
X | Y ) P (Yregression,
)
 SVM,
rule to calculate P(Y|X = xi) P(Y | X ) 
P( Xneural
)
traditional
networks,,
HMM, Markov random
nearest neighbors, boosting,
P(red  round | apple) P(apple)
fields, Bayesian networks,
P(apple MEMM,
| red  round ) conditional

random
P(red  round )
Gaussians Naïve Bayes,
Gaussians,
Bayes etc.
fields etc.
fields,
etc


Assume some functional form
for P(Y|X)
Estimate parameters of P(Y|X)
directlyy from trainingg data
Generative vs. discriminative methods
Generative approach


Try to build models for the
underlying
d l i patterns
tt
Can be learned, adapted,
and generalized with small
data.
Discriminative approach


Try to learn to minimize an utility
f
function
ti (e.g.
(
classification
l ifi ti error))
but not to model, represent, or
“understand” the p
pattern explicitly
p
y
(detect 99.99% faces in real images
and do not “know” that a face has
two eyes).
eyes)
Often need large training data, say
100,000
00,000 labeled
abe ed eexamples,
a p es, aand
d
can hardly be generalized.
(cá: fish, to bet)
Generative vs. discriminative learningg


Objective: determine which is better for what, and why
Current:

Discriminative learning (ANN, DT, KNN, SVM) typically more
accurate









More complex problems
M
More
fl
flexible
ibl predictions
di ti
Vapnik: “When
When solving a problem,
problem don’t
don t first solve a harder problem
problem”
Making generative methods computationally more feasible
When to prefer discriminative vs. generative
Objective: learning from non-vectorial
non vectorial input data
Current:

Better with larger data
Faster to train
Key Challenges:


Generative learning (graphical models, HMM) typically more flexible


2. Kernel methods

Key Challenges:



Mostt learning
M
l
i algorithms
l ith workk on flat,
fl t fixed
fi d length
l th ffeature
t
vectors
Each new data type requires a new learning algorithm
Difficult to handle strings, gene/protein sequences, natural
language parse trees, graph structures, pictures, plots, …
One data-interface for multiple
p learning
g methods
One learning method for multiple data types
Research alreadyy in progress
p g
Kernel methods: the basic ideas
Input space X
x1
Feature space F
inverse map -11
x2
(x)
(x1)
…
xn-1
Kernel methods: math background
g

Input space X
x1
x2
…
xn-1
(x2)
kernel-based algorithm on K
(computation done on kernel matrix)
 :X  R2  H  R3
(xn-1)
(x2)
Gram matrix Knxn= {k(xi,xj)}
-
kernel-based algorithm on K
Linear algebra, probability/statistics, functional analysis, optimization

Mercer theorem: Any positive definite function can be written as an
inner product in some feature space.

Kernel trick: Using kernel matrix instead of inner product in the
feature space.
Every minimizer of min{C ( f , {x , y })  ( f ) admits
( x1 , x2 )  ( x1 , x2 , x  x )
2
1

k( i,xj) = (x
k(x
( i).(x
( j)
kernel function k: XxX  R
Kernel matrix Knxn
(x1)
xn
xn
kernel function k: XxX  R
(xn)
(x)
(xn)
(xn-1)
k(xi,xj) = (xi).(xj)
Feature space F
inverse map -1
2
2
f H

Representer theorem:
i
i
H
m
a representa tion of the form f (.)    i K (., xi )
i1
1
3. Beyond
y
classification and regression
g


Objective: learning to predict complex objects
Current:




Most machine learning focuses on classification and regression
Discriminative methods often outperform generative methods
Generative methods used for learning complex objects (e.g. language
parsing protein sequence alignment
parsing,
alignment, information extraction)
What is structured p
prediction? (Daume)

Structured prediction is a fundamental machine learning task
involving classification or regression in which the output
p
or constrained.
variables are mutuallyy dependent

Such dependencies and constraints reflect sequential, spatial or
combinatorial structure in the problem domain,
domain and capturing
these interactions is often as important for the purposes of
prediction as capturing input-output dependencies.
Key Challenges:



Extend discriminative methods (ANN,, DT,, KNN,, SVM,, …) to more
general settings
Examples: ranking functions (e.g. Google top-ten, ROC), natural
language parsing,
parsing finite-state models
Find ways to directly optimize desired performance criteria (e.g.
ranking performance vs. error rate)

Structured prediction (SP)  the machine learning task of
generatingg outputs
g
p with complex
p internal structure .
What is structured p
prediction? (Lafferty)
y



Handwritingg recognition
g
Text, sound, event logs, biological, handwriting, gene networks,
linked data structures like the Web can be viewed as graphs
connecting basic data elements-.
x
y
IImportant
t t tasks
t k involving
i
l i structured
t t dd
data
t require
i the
th
computation of a labeling for the nodes or the edges of the
underlying graph.
graph E.g.,
E g POS tagging of natural language text
can be seen as the labeling of nodes representing the successive
words with linguistic labels.
brace
A good labeling will depend not just on individual nodes but
also the contents and labels of nearbyy nodes,, that is,, the
preceding and following words--thus, the labels are not
independent.
Sequential structure
Structured prediction
p
Labeling
g sequence
q
data problem
p

 Structured
Learning / Structured Prediction


X is a random variable over data sequences
Y is a random variable over label sequences whose labels are assumed
to range over a finite label alphabet A
Problem: Learn how to give labels from a closed set Y to a data
sequence X
x1
x2
x3
X
X:
Thinking
is
being
Y:
noun
verb
y1
y2
noun
y3
structured
yt-1
xt-1
(a) Unstructured Model
yt
xt
yt+1
xt+1
(b) Linear-Structured Model





POS tagging, phrase types, etc. (NLP),
Named entity recognition (IE)
Modeling protein sequences (CB)
Image segmentation, object recognition (PR)
etc
etc.
4. Distributed learning
g
Objective: DM/ML with distributed data
Current:





Most ML algorithms
g
assume random access to all data
Often data comes from decentralized sources (e. g. sensor
networks, multiple organizations, learning across firewalls,
diff
different
t security
it systems)
t
)
Many projects infeasible (e.g. organization not allowed to share
data)
5. Full auto: ML for the masses










Objective: make learning results more understandable
Current:










Machine learning methods that are understandable &
generate accurate rules
Methods for generating explanations
Model verification for user acceptance
p
Reformatting, Sampling, Filling in missing values, Outlier detection
Robust performance estimation and model selection
“Data Scoup”
7. Ensemble methods


Objective:
j
combiningg learningg methods automaticallyy
Current:

Methods often achieve good prediction accuracy
The prediction rule appears complex & is difficult to verify
Lack of trust in the rule
Lack of insight
Key Challenges:
Automatic selection of machine learning method
Tools for preprocessing the data

Develop methods for distributing data while preserving privacy
Develop methods for distributed learning without distributing the
data
6. Interpretable
p
models
ML applications
li i
require
i detailed
d il d knowledge
k
l d about
b
the
h algs
l
Preparing/Preprocessing takes at least 75% of the effort
Key Challenges:

Key Challenges:

Objective: make ML easier to apply to real problems
Current:



We do not have a single DM/ML method that “does
does it all
all”
Results indicate that combining models results in large
improvements in performance
Focus on boosting and bagging
Key Challenges:



Develop methods that combine the best of different learning algs
Searching for good combinations might be more efficient than
designing one “global” learning algorithm
Theoretical explanation for why and when ensemble methods
help
8. Unsupervised
p
learning
g


Objective: improve state-of-the-art in unsupervised learning
Current:




Key Challenges:




9. Information access


Objective: more informed information access
Current:




Bag-of-words
Retrieval functions exploit document structure and link structure
Information retrieval is a process without memory
Research focus in 90’s was supervised learning
Much progress on supervised learning methods like neural
networks, support vector machines, boosting, etc.
U
Unsupervised
i d llearning
i needs
d tto “catch
“ t h up””
More robust
M
b t and
d stable
t bl methods
th d for
f clustering
l t i
Include user feedback into unsupervised learning (e.g. clustering
with constraints, semi
semi-supervised
supervised learning, transduction)
Automatic distance metric learning
Clustering
g as an interactive data analysis
y process
p
What is data mining?
g
“Data
Data-driven
driven discovery of models and patterns
from massive observational data sets”
Key Challenges:






Develop methods for exploiting usage data
Learn from the query history of a user / user group
Preserving privacy while mining access patterns
Exploiting
p
g common access p
patterns and finding
g “groups”
g p of users
Web Expert: agent that learns the web (beyond Google)
Topic modelling
Statistics,,
Inference
Languages,
g g ,
Representations
Applications
Data
Management
Overview of DM challenges
g (ICDM’05)
1. Developing
p g a unifying
y g theoryy of DM

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Developing a Unifying Theory of Data Mining
Scaling Up for High Dimensional Data/High Speed Streams
Mining Sequence Data and Time Series Data
Mining Complex Knowledge from Complex Data
Data Mining in a Network Setting
Distributed Data Mining and Mining Multi-agent Data
Data Mining for Biological and Environmental Problems
Data-Mining-Process Related Problems
Security, Privacy and Data Integrity
Dealing with Non-static, Unbalanced and Cost-sensitive Data
2. Scalingg up
p for high
g dimensional data
and high speed streams





ultra-high dimensional
classification p
problems (millions
or billions of features, e.g., bio
data)
Ultra-high
g speed
p
data streams
Needs unifying research

Long standing theoretical issues







continuous, online process
e.g. how
h to monitor
i network
k
packets for intruders?
concept drift and environment
drift?
RFID network and sensor
network data

Example:
p VC dimension. In statistical
learning theory, or sometimes
computational learning theory, the VC
dimension (for Vapnik-Chervonenkis
dimension) is a measure of the capacity
of a statistical classification algorithm,
defined as the cardinality of the largest
set of points that the algorithm can
shatter.
How to avoid spurious
correlations?
Knowledge discovery on
hidden causes?
Similar to discovery of
Newton’s Law?
VC dimension of perceptron is 3.
3. Sequential
q
and time series data

How to efficiently and
accurately cluster, classify
and predict the trends?
Time series data used for
predictions are
contaminated by noise


Excerpt from Jian Pei’s Tutorial
http://www.cs.sfu.ca/~jpei/
E l ti vs explanation
Exploration
l ti

Deep research


Streams
techniques are designed for
i di id l problems
individual
bl
no unifying theory

Scaling up is needed

Thee cu
current
e t state oof the
t e art
a t of
o
data-mining research is too
“ad-hoc”

How to do accurate shortterm and long-term
p
predictions?
Signal processing techniques
introduce lags in the filtered
data, w
which
c reduces
educes accuracy
accu acy
Key in source selection,
domain knowledge in rules,
and optimization methods
Real time series data obtained from
Wireless sensors in Hong Kong UST
CS department hallway
4. Mining complex knowledge from complex
data (complexly structured data)






many objects
bj
are not iindependent
d
d
off each
h other,
h and
d are not off a single
i l type.
mine the rich structure of relations among objects,
E.g.: interlinked Web pages, social networks, metabolic networks in the cell
Integration of data mining and knowledge inference

The biggest gap: unable to relate the results of mining to the real-world
decisions they affect - all they can do is hand the results back to the user.
More research on interestingness of knowledge
6. Distributed data mining and mining
multi-agent
lti
t data
d t

N d to
Need
t correlate
l t the
th data
d t


Adversary data mining:
d lib
deliberately
l manipulate
i l the
h
data to sabotage them (e.g.,
make
k th
them produce
d
false
f l
negatives)

Game theory may be needed
for help
Player 1:miner
Action: H

T
Player 2
H
((-1,1)
1,1)
T
(1,-1)
(1,
1)
T
H
(1,-1)
(1,
1)
Outcome
Picture from Matthew Pirretti’s slides,penn state
An Example of packet streams (data courtesy of
NCSA, UIUC)
7. Data mining
g for biological
g
and
environmental problems
Games
seen at the various probes
(such as in a sensor network)

Community and
d Sociall Networks
k
 Linked data between emails,
Web pages, blogs, citations,
sequences and people
 Static and dynamic structural
behavior
 Mining in and for Computer
Networks
 detect anomalies (e.g.,
(e g sudden
traffic spikes due to a DoS
(Denial of Service) attacks
 Need to handle 10Gig
Ethernet links (a) detect (b)
trace back (c ) drop packet

Mining graphs
Data that are not i.i.d. (independent and identically distributed)

5. Data mining
g in a network settingg
New problems raise new
questions
Large scale problems especially so


((-1,1)
, )


Biological data mining, such as
HIV vaccine design
DNA, chemical properties, 3D
structures, and
d functionall
properties  need to be fused
Environmental data mining
Mining for solving the energy
crisis
Metabolomics
Proteomics
3000
metabolites
2,000,000 Proteins
Genomics
25,000 Genes
8. Data-mining-process
gp
related p
problems
to automate mining
process?
 the composition of data
mining
i i operations
ti
 Data cleaning, with
logging capabilities
 Visualization and mining
automation
 Need a methodology: help
users avoid many data mining
mistakes
it k
 What is a canonical set of
data mining operations
9. Security,
y, privacy
p
y and data integrity
g y

 How
a step in the KDD
process consisting
p
g of
methods that produce
useful patterns or
models from the data
Maybe 7090% of effort
and cost in
KDD
2
5
4
Putting the results
in p
practical use
Interpret and evaluate
discovered knowledge
Data mining
3

Extract Patterns/Models
Collect and preprocess data
KDD is inherently
interactive and
iterative
1 Understand the domain
and define problems

(In many cases, viewed KDD as data mining)
How to ensure the
users privacy while
their data are being
mined?
How to do data
mining for
protection of security
and privacy?
Knowledge
K
l d iintegrity
t it
assessment


Alice
’s
age
Add
random
number to
Age
30
becom
es 65
(30+35
)
Perturbation Methods
Secure Multi-Party Computation
(SMC) Methods
30 | 70K | ...
50 | 40K |
...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K |
...
Reconstruct
Distribution
of Age
Reconstruct
Distribution
of Salary
Classification
Algorithm
10. Dealing
g with non-static,, unbalanced
and cost-sensitive data




The UCI datasets are small
and not highly unbalanced
Real world data are large
(10^5 features) but only <
1% off the
th useful
f l classes
l
(+’ve)
(+’ )
There is much information
on costs and benefits
benefits, but no
overall model of profit and
loss
Data may evolve with a bias
introduced by sampling

pressure blood test essay
?
temperature
39oc
?
?
cardiogram
?
• Each test incurs a cost
• Data extremely unbalanced
• Data change with time
Some papers can be found here
p
j
jp
http://www.jaist.ac.jp/~bao/K417-2008
.
.
...
.
...
Model
Để máyy hiểu được
ợ nghĩa
g
các văn bản?



D

Làm sao máy tính hiểu được nội
dung văn bản để tìm kiếm cho
hiệu quả?
rock
2
1
0
2
0
1
1
granite
1
0
1
0
0
0
0
marble
1
2
0
0
0
0
1
Thông qua chủ đề của văn bản
music
0
0
0
1
2
0
0
Latent semantic analysis
((Deerwester et al.,, 1990;;
Hofmann, 1999): Biểu diễn văn
bản trong một không gian Euclid,
mỗi chiều là một tổ hợp tuyến
tính các từ (giống PCA).
D1
D2
D3
D4
D5
D6

Q1
song
0
0
0
1
0
2
0
band
0
0
0
0
1
0
0


D1
D2
D3
D4
D5
D6
Q1
dim1
-0.888
-0.759
-0.615
-0.961
-0.388
-0.851
-0.845
dim2
0.460
0.652
0.789
-0.276
-0.922
-0.525
0.534
( ik1 i ) 1 1
1  k k 1
k

(

)
 i 1
i
Dirichlet prior on the per-document topic distributions
 
z
wN
d
M
N
p ( , z, w  ,  )  p (  ) p ( zn  ) p ( wn zn ,  )
n 1
Joint distribution of topic mixture θ, a set of N topic z, a set of N words w
 N

p (w  ,  )   p (  )    p ( zn  ) p ( wn zn ,  ) d k
 n 1 zn

Marginal distribution of a document by integrating over θ and summing over z
M
 Nd

p ( D  ,  )    p ( d  )    p ( zdn  d ) p ( wdn zdn ,  ) d k d
d 1
 n 1 zdn

P b bilit off collection
Probability
ll ti b
by product
d t off marginal
gi l probabilities
b biliti off single
i gl d
documents
t

T
documents
mỗi văn bản là một mixture
của các chủ đề
mỗi chủ đề là một phân bố
xác suất trên các từ.

“thực phẩm” = {an toàn, rau,
thịt, cá, không ngộ độc,
không đau bụng …}
“mắm
ắ tôm” = {tôm, mặn, đậu
phụ, thịt chó, lòng lợn, …}
“dịch bệnh” = {nhiều người,
cấp cứu
cứu, bệnh viện
viện, thuốc
thuốc,
vacine, mùa hè, … }
D1= {thực phẩm 0.6, mắm
ị bệnh
ệ 0.8}}
tôm 0.35,, dịch
C
 From 16000
documents
of AP corpus
 100-topic
100 t i
LDA model.
 Each color
codes a
different
factor from
which the
word is
putatively
generated
documents


Latent Dirichlet Allocation (LDA)
Per-document
topic proportions

Topic
hyperparameter
yp p
Per-word
topic
p assignment
g
Dirichlet
parameter
d
Observed
word
Zd,n
dn
Example
p of topics
p learned

topics
Normalized cooccurrence matrix
Thí dụ
ụ

Latent Dirichlet allocation (LDA) model
p (  ) 

V
Topic modeling
to
opics
U
documents
Topic modeling key idea
(LDA Blei,
(LDA,
Blei JMLR 2004)
words
C
dims
dims
Google cho ra rất nhiều tài liệu
liệu,
với precision và recall thấp.
dims
dims
documents

words
Latent semantic analysis
words

Tìm tài liệu trên Google liên quan
3 chủ
hủ đề “th
“thực phẩm”,
hẩ ” ““mắm
ắ
tôm”, “dịch bệnh”.
words

Topic
p modeling:
g keyy ideas
Wd,n
d N
d
Per-topic
word proportions
t
M

T
Traditional ML vs. TL (P. Langley 06)
Dirichlet-Lognormal
g
(DLN) topic
p model

w
N
dM



z
Model descrption:
M th d
Method
DLN
Accuracy 0.5937
1
 1

p ((log
g x   )T 1 ((log
g x   )
exp
(2 ) n / 2 . .x1...xn
 2

where log x  (log x1 ,..., log xn )T
N
dM
T
Spam classification
 |  ~ Dirichlet ( )
 |  ,  ~ Lognormal (  , )
z |
~ Multinomial ( )
w |  ~ Multinormial ( f (  ))
Pr((x1 ,,...,, xn ) 
w
LDA
SVM
0.4984
0.4945
test item
ms
z
ttraining ittems

test item
ms


T
ttraining ittems

Transfer of learning
across domains
Traditional ML in
multiple domains

Predicting crime
DLN
LDA
SVM
0.2442
0.1035
0.2261
Humans can learn in many domains.
Humans can also transfer from one
domain to other domains.
(Than Quang Khoat, Ho Tu Bao, 2010)
Traditional ML vs. TL
Learning Process of
Traditional ML
Notation
Learning Process of
Transfer Learning
Domain:
It consists of two components: A feature space
training items
, a marginal distribution
training items
In general, if two domains are different, then they may have different feature spaces
or different marginal distributions.
Task:
Learning System
Learning System
Given a specific domain and label space
predict its corresponding label
Learning System
, for each
in the domain, to
p
or
In ggeneral,, if two tasks are different,, then theyy mayy have different label spaces
different conditional distributions
Knowledge
Learning System
Notation
For simplicity, we only consider at most two domains and two tasks.
Source domain:
Task in the source domain:
Target domain:
Task in the target domain