Download When computing meets statistics - Centre for Pattern Recognition

Document related concepts

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
When Computing Meets
Statistics
Trần Thế Truyền
Department of Computing
Curtin University of Technology
[email protected]
http://truyen.vietlabs.com
Content
•
•
•
•
•
Introduction
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration
Data as a starting point
• The ultimate goal is to make sense of data
– “It is a capital mistake to theorize before one
has data.” (Sir Arthur Conan Doyle)
How big is the data?
• Google currently indexes 1012 Web pages
– At NIPS’09: they have shown how to estimate
logistic regression for 108 documents
•
•
•
•
MIT dataset has 108 images
106 sentence pairs for machine translation
The Netflix data has 108 entries.
Dimensions for language: typically 107, for
bioinformatics: up to 1012
Mathematics for data processing
Statistics
Probabilistic graphs
Exponential family
Kernels
Bayesian
Non-parametric
Random processes
High dimensional
Abstract spaces
Projection
Linear algebra
Hilbert spaces
Metric spaces
Topology
Differential geometry
Information
theory
Entropy
Mutual information
Divergence
Data compression
Differential entropy
Channel capacity
Optimization
Duality
Sparsity
Sub-modularity
Linear programming
Integer programming
Non-convexity
Combinatorics
Why does computing needs
statistics?
•
•
•
•
•
•
•
•
•
•
The world is uncertain
Making sense of data, e.g. sufficient statistics, clustering
Convergence proof
Performance bound
Consistency
Bayes optimal
Confidence estimate
Most probable explanation
Symmetry breaking
Randomness as a solution to NP-hard problems
What computing has to offer
• Massive data and computing power
• Computational algorithms
– Less memory
– Fast processing
– Dynamic programming
• Parallel processing
– Clusters
– GPUs
Conferences and Journals
• Most important and current results in computing are
published in conferences, some followed by journal
versions
• Relevant conferences:
–
–
–
–
–
AAAI/IJCAI
COLT/ICML/NIPS/KDD
UAI/AISTATS
CVPR/ICCV
ACL/COLLING
• Relevant journals:
–
–
–
–
–
Machine Learning
Journal of Machine Learning Research
Neural Computation
Pattern Analysis and Machine Intelligence
Pattern Recognition
Content
•
•
•
•
•
Introduction
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration
Probabilistic graphical models
• Non-identically
independently
distributed
• Variable dependencies
Graph theory
+
Probability theory
• Directed models
–
–
–
–
–
–
Markov chains
Hidden Markov models
Kalman filters
Bayesian networks
Dynamic Bayesian networks
Probabilistic neural networks
• Undirected models
–
–
–
–
–
–
Ising models
Markov random fields
Boltzmann machines
Factor graphs
Relational Markov networks
Markov logic networks
Representing variable
dependencies using graphs
Causes
Effects
Causes
Effects
Hidden factors
Directed graphs: decomposition
• Suitable to encode causality
• Domain knowledge can be expressed in conditional probability
tables
• Graph must be acyclic
B
A
D
C
DAG examples: Markov chains
(a) Markov chain
(b) Hidden Markov model
(c) Hidden semi-Markov model
(d) Factorial hidden Markov model
DAG examples: Abstract hidden
Markov models (Bui et al, 2002)
Some more DAG examples
(some borrowed from Bishop’s slides)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Hidden Markov models
Kalman filters
Factor analysis
Probabilistic principal component analysis
Independent component analysis
Probabilistic canonical correlation analysis
Mixtures of Gaussians
Probabilistic expert systems
Sigmoid belief networks
Hierarchical mixtures of experts
Probabilistic Latent Semantic Indexing
Latent Dirichlet Allocation
Chinese restaurant processes
Indian buffet processes
Undirected graphs: factorisation
• Suitable to encode correlation
• More flexible than directed graphs
• But lose the notion of causality
B
A
D
C
Undirected graph examples:
Markov random fields
{‘Sky’, ‘Water’, ‘Animal’, ‘Car’, ‘Tree’, ‘Building’, ‘Street’}
{124}
Image from LabelMe
Undirected graph examples:
Restricted Boltzmann machines
h1 h2
h3
P(hk jr ) =
wk
P(r i jh) =
wi k
r1
r2
r3
r4
1
P
1 + exp(¡ wk ¡ i wi k s )
X
1
exp(wi ;s +
wi k s hk )
Z (i ; h)
k
wi
where
s = ri
• Useful to discover hidden aspects
• Can theoretically represent all binary distributions
Conditional independence
• Separator
• Markov blanket
Bad news
• Inference in general
graphs is intractable
• Some reduced to
combinatorial
optimization
• Model selection is
really hard!
– There are
exponentially many
graphs of given size
– Each of them is likely
to be intractable
Good news
• Chains and trees
are easy to
compute
• There exist good
approximate
algorithms
• Approximate
methods are still
very useful
Approximate inference
• Belief propagation
• Variational methods
• MCMC
Belief propagation
• Introduced by J. Pearl (1980s)
• A major breakthrough
– Guaranteed to converge for trees
– Good approximation for non-trees
• Related to statistical physics
(Bethe & Kikuchi free-energies)
• Related to Turbo decoding
• Local operation, global effect
k
i
k’
j
Variational methods
MCMC
•
•
•
•
•
Metropolis-Hasting
Gibbs/importance/slice sampling
Rao-Blackwellisation
Reversible jump MCMC
Contrastive divergence
?
Content
•
•
•
•
•
Introduction
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration
Statistical machine learning
• (Mitchell, 2006):
– How can we build computer systems that
automatically improve with experience, and
– What are the fundamental laws that govern
all learning processes?
• More concerned about prediction
performance in the unseen data
– Need consistency guarantee
– Need error bounds
Statistical machine learning
• Inverse problems
• Supervised learning: regression/classification
• Unsupervised learning: density
estimation/clustering
• Semi-supervised learning
• Manifold learning
• Transfer learning & domain adaptation
• Multi-task learning
• Gaussian processes
• Non-parametric Bayesian
Classifier example: naïve Bayes
{‘Sport’, ‘Social’, ‘Health’}
Words
Classifier example: MaxEnt
• Maximum entropy principle: out of all
distributions which are consistent with the data,
select the one that has the maximum entropy
(Jaynes, 1957)
• The solution
Gaussian and Laplace priors
• Parameter estimation is an ill-posed problem
– Needs regularisation theory
• Gaussian prior
• Laplace prior
Transfer learning
• Moving from a domain to another domain
– May be distribution shifts
• The goal is to use as little data as possible
to estimate the second task
Multitask learning
• Multiple predictions based on a single dataset
• E.g., for each image, we want to do:
– Object recognition
– Scene classification
– Human and car detection
Open problems
• Many learning algorithms are not
consistent
• Many performance bounds are not tight
• The dimensions are high, just feature
selection is important
• Most data is unlabelled
• Structured data is pervasive, but most
statistical methods assume i.i.d
Dealing with unlabelled data
?
Content
•
•
•
•
•
Data as a starting point
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration
Applications
• Computational linguistics
–
–
–
–
Accent restoration
Language modelling
Statistical machine translation
Speech recognition
• Multimedia & computer vision
• Information filtering
–
–
–
–
Named Entity Recognition
Collaborative filtering
Web/Text classification
Ranking in search engines
Accent restoration
http://vietlabs.com/vietizer.html
Chiến thắng Real trong trận siêu
kinh điển cuối tuần qua cũng
như phong độ ấn tượng mùa
này khiến HLV trẻ của Barca
nhận được những lời tán tụng từ
người nhà cũng như đông đảo
các cổ động viên.
Chien thang Real trong tran sieu
kinh dien cuoi tuan qua cung
nhu phong do an tuong mua nay
khien HLV tre cua Barca nhan
duoc nhung loi tan tung tu nguoi
nha cung nhu dong dao cac co
dong vien.
st ¡
vt ¡
P(vjs) =
1
Z ( s)
exp(
1
1
st
accents
vt
accentless terms
P P
¸ k f k (vc ; s))
P
P P
Z (s) =
v2 V ( s) exp(
c
k ¸ k f k (vc ; s))
c
k
Decoding using N-order hidden
Markov models
cong hoa xa hoi chu nghia viet nam
cong
còng
cóng
cõng
cọng
công
cồng
cống
cổng
cộng
hoa
hòa
hóa
hỏa
họa
xa
xà
xá
xả
xã
xạ
hoi
hói
hỏi
hôi
hồi
hối
hội
hơi
hời
hới
hỡi
hợi
chu
chú
chủ
chư
chừ
chứ
chử
chữ
nghía
nghĩa
việt
viết
The Viterbi path: “cộng hòa xã hội chủ nghĩa việt nam”
nam
nám
nạm
năm
nằm
nắm
nầm
nấm
nậm
Accent restoration (cont.)
• Online news corpus
– 426,000+ sentences for training
– 28,000+ sentences for testing
– 1,400+ accentless terms
• compared to 10,000+ accentful terms.
– 7,000+ unique unigrams
– 842,000+ unique bigrams
– 3,137,000+ unique trigrams
Language modelling
• This is the key of all linguistics problems
Pn (v) =
Q
t
P(vt jvt ¡ 1 ; ::; vt ¡
n + 1)
• Most useful models are N-grams
– Equivalent to (N-1)th order Markov chains
– Usually N=3
– Google offers N=5 with multiple billions entries
– Smoothing is the key to deal with data
sparseness
Statistical machine translation
• Estimate P(Vietnamese unit | English unit)
– Usually, unit = sentence
• Current training size: 106 sentence pairs
• Statistical methods are state-of-the-arts
– Followed by major labs
– Google translation services
SMT: source-channel approach
• P(V) is language model of Vietnamese
• P(E|V) is translation model from
Vietnamese to English
• Subcomponents:
– Translation table: from Vietnamese phrases to
English phrases
– Alignment: position distortion, syntax, idioms
SMT: maximum conditional entropy
approach
• f is called feature
– f may be the estimate from the sourcechannel approach
Speech recognition
• Estimate P(words| sound signals)
• Usually the source-channel approach
– P(words) is the language model
– P(sound|words) is called ‘acoustic model’
• Hidden Markov models are the state-of-the-arts
Hidden states
Start state
End state
Acoustic features
Multimedia
• Mix of audio, video, text, user interaction,
hyperlinks, context
• Social media
– Diffusion, random walks, Brownian motion
• Cross-modality
– Probabilistic canonical correlation analysis
Computer vision
•
•
•
•
Scene labelling
Face recognition
Object recognition
Video surveillance
Information filtering
Named entity recognition
Boltzmann machines for
collaborative filtering
h1 h2
h3
P(hk jr ) =
wk
P(r i jh) =
wi k
r1
r2
r3
r4
1
P
1 + exp(¡ wk ¡ i wi k s )
X
1
exp(wi ;s +
wi k s hk )
Z (i ; h)
k
wi
where
s = ri
• Boltzmann machines are one of the main methods in the
$1mil Netflix competition
• This is essentially the matrix completion problem
Ranking in search engines
Ranking in search engines (cont.)
• This is an object ordering problem
• We want to estimate the probability of
permutation
– There are exponentially many permutations
– Permutations are query-dependent
Content
•
•
•
•
•
Data as a starting point
Probabilistic graphical models
Statistical machine learning
Applications
Collaboration
Collaboration
• IMPCA: Institute for Multi-sensor Processing and Content Analysis
– http://impca.cs.curtin.edu.au
– Lead by Prof. Svetha Venkatesh
• Some Vietnamese guys
– Phùng Quốc Định, [http://computing.edu.au/~phung/]
•
•
•
•
Probabilistic graphical models
Topic modelling
Non-parametric Bayesian
Multimedia
– Phạm Đức Sơn, [http://computing.edu.au/~dsp/]
•
•
•
•
Statistical learning theory
Compressed sensing
Robust signal processing
Bayesian methods
– Trần Thế Truyền, [http://truyen.vietlabs.com]
•
•
•
•
Probabilistic graphical models
Learning structured output spaces
Deep learning
Permutation modelling
Scholarships
• Master by research
– 2+ years full, may upgrade to PhD
• PhD
– 3+ years full
– Strong background in maths and good programming
skills
• Postdoc
– 1-2 year contract
• Research fellows
– 3-5 year contract
• Visiting scholars
– 3-12 months
Discussion
• Collaboration mode