Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
When Computing Meets Statistics Trần Thế Truyền Department of Computing Curtin University of Technology [email protected] http://truyen.vietlabs.com Content • • • • • Introduction Probabilistic graphical models Statistical machine learning Applications Collaboration Data as a starting point • The ultimate goal is to make sense of data – “It is a capital mistake to theorize before one has data.” (Sir Arthur Conan Doyle) How big is the data? • Google currently indexes 1012 Web pages – At NIPS’09: they have shown how to estimate logistic regression for 108 documents • • • • MIT dataset has 108 images 106 sentence pairs for machine translation The Netflix data has 108 entries. Dimensions for language: typically 107, for bioinformatics: up to 1012 Mathematics for data processing Statistics Probabilistic graphs Exponential family Kernels Bayesian Non-parametric Random processes High dimensional Abstract spaces Projection Linear algebra Hilbert spaces Metric spaces Topology Differential geometry Information theory Entropy Mutual information Divergence Data compression Differential entropy Channel capacity Optimization Duality Sparsity Sub-modularity Linear programming Integer programming Non-convexity Combinatorics Why does computing needs statistics? • • • • • • • • • • The world is uncertain Making sense of data, e.g. sufficient statistics, clustering Convergence proof Performance bound Consistency Bayes optimal Confidence estimate Most probable explanation Symmetry breaking Randomness as a solution to NP-hard problems What computing has to offer • Massive data and computing power • Computational algorithms – Less memory – Fast processing – Dynamic programming • Parallel processing – Clusters – GPUs Conferences and Journals • Most important and current results in computing are published in conferences, some followed by journal versions • Relevant conferences: – – – – – AAAI/IJCAI COLT/ICML/NIPS/KDD UAI/AISTATS CVPR/ICCV ACL/COLLING • Relevant journals: – – – – – Machine Learning Journal of Machine Learning Research Neural Computation Pattern Analysis and Machine Intelligence Pattern Recognition Content • • • • • Introduction Probabilistic graphical models Statistical machine learning Applications Collaboration Probabilistic graphical models • Non-identically independently distributed • Variable dependencies Graph theory + Probability theory • Directed models – – – – – – Markov chains Hidden Markov models Kalman filters Bayesian networks Dynamic Bayesian networks Probabilistic neural networks • Undirected models – – – – – – Ising models Markov random fields Boltzmann machines Factor graphs Relational Markov networks Markov logic networks Representing variable dependencies using graphs Causes Effects Causes Effects Hidden factors Directed graphs: decomposition • Suitable to encode causality • Domain knowledge can be expressed in conditional probability tables • Graph must be acyclic B A D C DAG examples: Markov chains (a) Markov chain (b) Hidden Markov model (c) Hidden semi-Markov model (d) Factorial hidden Markov model DAG examples: Abstract hidden Markov models (Bui et al, 2002) Some more DAG examples (some borrowed from Bishop’s slides) • • • • • • • • • • • • • • Hidden Markov models Kalman filters Factor analysis Probabilistic principal component analysis Independent component analysis Probabilistic canonical correlation analysis Mixtures of Gaussians Probabilistic expert systems Sigmoid belief networks Hierarchical mixtures of experts Probabilistic Latent Semantic Indexing Latent Dirichlet Allocation Chinese restaurant processes Indian buffet processes Undirected graphs: factorisation • Suitable to encode correlation • More flexible than directed graphs • But lose the notion of causality B A D C Undirected graph examples: Markov random fields {‘Sky’, ‘Water’, ‘Animal’, ‘Car’, ‘Tree’, ‘Building’, ‘Street’} {124} Image from LabelMe Undirected graph examples: Restricted Boltzmann machines h1 h2 h3 P(hk jr ) = wk P(r i jh) = wi k r1 r2 r3 r4 1 P 1 + exp(¡ wk ¡ i wi k s ) X 1 exp(wi ;s + wi k s hk ) Z (i ; h) k wi where s = ri • Useful to discover hidden aspects • Can theoretically represent all binary distributions Conditional independence • Separator • Markov blanket Bad news • Inference in general graphs is intractable • Some reduced to combinatorial optimization • Model selection is really hard! – There are exponentially many graphs of given size – Each of them is likely to be intractable Good news • Chains and trees are easy to compute • There exist good approximate algorithms • Approximate methods are still very useful Approximate inference • Belief propagation • Variational methods • MCMC Belief propagation • Introduced by J. Pearl (1980s) • A major breakthrough – Guaranteed to converge for trees – Good approximation for non-trees • Related to statistical physics (Bethe & Kikuchi free-energies) • Related to Turbo decoding • Local operation, global effect k i k’ j Variational methods MCMC • • • • • Metropolis-Hasting Gibbs/importance/slice sampling Rao-Blackwellisation Reversible jump MCMC Contrastive divergence ? Content • • • • • Introduction Probabilistic graphical models Statistical machine learning Applications Collaboration Statistical machine learning • (Mitchell, 2006): – How can we build computer systems that automatically improve with experience, and – What are the fundamental laws that govern all learning processes? • More concerned about prediction performance in the unseen data – Need consistency guarantee – Need error bounds Statistical machine learning • Inverse problems • Supervised learning: regression/classification • Unsupervised learning: density estimation/clustering • Semi-supervised learning • Manifold learning • Transfer learning & domain adaptation • Multi-task learning • Gaussian processes • Non-parametric Bayesian Classifier example: naïve Bayes {‘Sport’, ‘Social’, ‘Health’} Words Classifier example: MaxEnt • Maximum entropy principle: out of all distributions which are consistent with the data, select the one that has the maximum entropy (Jaynes, 1957) • The solution Gaussian and Laplace priors • Parameter estimation is an ill-posed problem – Needs regularisation theory • Gaussian prior • Laplace prior Transfer learning • Moving from a domain to another domain – May be distribution shifts • The goal is to use as little data as possible to estimate the second task Multitask learning • Multiple predictions based on a single dataset • E.g., for each image, we want to do: – Object recognition – Scene classification – Human and car detection Open problems • Many learning algorithms are not consistent • Many performance bounds are not tight • The dimensions are high, just feature selection is important • Most data is unlabelled • Structured data is pervasive, but most statistical methods assume i.i.d Dealing with unlabelled data ? Content • • • • • Data as a starting point Probabilistic graphical models Statistical machine learning Applications Collaboration Applications • Computational linguistics – – – – Accent restoration Language modelling Statistical machine translation Speech recognition • Multimedia & computer vision • Information filtering – – – – Named Entity Recognition Collaborative filtering Web/Text classification Ranking in search engines Accent restoration http://vietlabs.com/vietizer.html Chiến thắng Real trong trận siêu kinh điển cuối tuần qua cũng như phong độ ấn tượng mùa này khiến HLV trẻ của Barca nhận được những lời tán tụng từ người nhà cũng như đông đảo các cổ động viên. Chien thang Real trong tran sieu kinh dien cuoi tuan qua cung nhu phong do an tuong mua nay khien HLV tre cua Barca nhan duoc nhung loi tan tung tu nguoi nha cung nhu dong dao cac co dong vien. st ¡ vt ¡ P(vjs) = 1 Z ( s) exp( 1 1 st accents vt accentless terms P P ¸ k f k (vc ; s)) P P P Z (s) = v2 V ( s) exp( c k ¸ k f k (vc ; s)) c k Decoding using N-order hidden Markov models cong hoa xa hoi chu nghia viet nam cong còng cóng cõng cọng công cồng cống cổng cộng hoa hòa hóa hỏa họa xa xà xá xả xã xạ hoi hói hỏi hôi hồi hối hội hơi hời hới hỡi hợi chu chú chủ chư chừ chứ chử chữ nghía nghĩa việt viết The Viterbi path: “cộng hòa xã hội chủ nghĩa việt nam” nam nám nạm năm nằm nắm nầm nấm nậm Accent restoration (cont.) • Online news corpus – 426,000+ sentences for training – 28,000+ sentences for testing – 1,400+ accentless terms • compared to 10,000+ accentful terms. – 7,000+ unique unigrams – 842,000+ unique bigrams – 3,137,000+ unique trigrams Language modelling • This is the key of all linguistics problems Pn (v) = Q t P(vt jvt ¡ 1 ; ::; vt ¡ n + 1) • Most useful models are N-grams – Equivalent to (N-1)th order Markov chains – Usually N=3 – Google offers N=5 with multiple billions entries – Smoothing is the key to deal with data sparseness Statistical machine translation • Estimate P(Vietnamese unit | English unit) – Usually, unit = sentence • Current training size: 106 sentence pairs • Statistical methods are state-of-the-arts – Followed by major labs – Google translation services SMT: source-channel approach • P(V) is language model of Vietnamese • P(E|V) is translation model from Vietnamese to English • Subcomponents: – Translation table: from Vietnamese phrases to English phrases – Alignment: position distortion, syntax, idioms SMT: maximum conditional entropy approach • f is called feature – f may be the estimate from the sourcechannel approach Speech recognition • Estimate P(words| sound signals) • Usually the source-channel approach – P(words) is the language model – P(sound|words) is called ‘acoustic model’ • Hidden Markov models are the state-of-the-arts Hidden states Start state End state Acoustic features Multimedia • Mix of audio, video, text, user interaction, hyperlinks, context • Social media – Diffusion, random walks, Brownian motion • Cross-modality – Probabilistic canonical correlation analysis Computer vision • • • • Scene labelling Face recognition Object recognition Video surveillance Information filtering Named entity recognition Boltzmann machines for collaborative filtering h1 h2 h3 P(hk jr ) = wk P(r i jh) = wi k r1 r2 r3 r4 1 P 1 + exp(¡ wk ¡ i wi k s ) X 1 exp(wi ;s + wi k s hk ) Z (i ; h) k wi where s = ri • Boltzmann machines are one of the main methods in the $1mil Netflix competition • This is essentially the matrix completion problem Ranking in search engines Ranking in search engines (cont.) • This is an object ordering problem • We want to estimate the probability of permutation – There are exponentially many permutations – Permutations are query-dependent Content • • • • • Data as a starting point Probabilistic graphical models Statistical machine learning Applications Collaboration Collaboration • IMPCA: Institute for Multi-sensor Processing and Content Analysis – http://impca.cs.curtin.edu.au – Lead by Prof. Svetha Venkatesh • Some Vietnamese guys – Phùng Quốc Định, [http://computing.edu.au/~phung/] • • • • Probabilistic graphical models Topic modelling Non-parametric Bayesian Multimedia – Phạm Đức Sơn, [http://computing.edu.au/~dsp/] • • • • Statistical learning theory Compressed sensing Robust signal processing Bayesian methods – Trần Thế Truyền, [http://truyen.vietlabs.com] • • • • Probabilistic graphical models Learning structured output spaces Deep learning Permutation modelling Scholarships • Master by research – 2+ years full, may upgrade to PhD • PhD – 3+ years full – Strong background in maths and good programming skills • Postdoc – 1-2 year contract • Research fellows – 3-5 year contract • Visiting scholars – 3-12 months Discussion • Collaboration mode