Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What is machine learning? g Challenges in Machine Learning and Data Mining Tu Bao Ho Ho, JAIST Based on materials from 1 9 challenges in ML (Caruana & Joachims) 1. 2. 10 challenging problems in DM (Yang & Wu) The goal of machine learning is to build computer systems that can adapt and learn from their experience (Tom Dietterich). A computer p p program g is said to learn from experience p E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measure by P, improves with experience i (Tom (T Mi Mitchell h ll book, b k p. 2). 2) ML problems can be formulated as Given: (x1, y1), (x2, y2), …, (xn, yn) - xi is description of an object, phenomenon, etc. - yi is some property of xi, if not available learning is unsupervised Find: a function f(x) that f(xi) = yi Finding hypothesis f in a huge hypothesis space F by narrowing the search with constraints (bias) Overview of ML challenges g 1. 1 2. 3 3. 4. 5. 6. 7. 8 8. 9. Generative vs. vs discriminative learning Learning from non-vectorial data B Beyond d classification l ifi ti and d regression i Distributed data mining Machine learning bottlenecks Intelligible models Combining learning methods Unsupervised learning comes of age More informed information access 1. Generative vs. discriminative methods Training classifiers involves estimating f: X Y, or P(Y|X). E Examples: l P(apple P( l | red d round), d) P( P(noun | ““cá”) á”) Generative classifiers Discriminative classifiers Assume some functional form for P(X|Y) P(X|Y), P(Y) Estimate parameters of ( | ), P(Y) ( ) directlyy from P(X|Y), training data, and use Bayes rule to calculate P(Y|X = xi) HMM, Markov random fields, Bayesian networks, Gaussians Naïve Bayes, Gaussians, Bayes etc. (cá: fish, to bet) Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directlyy from trainingg data SVM, logistic regression, traditional neural networks,, nearest neighbors, boosting, MEMM, conditional random fields etc. fields, etc 1. Generative vs. discriminative methods Training classifiers involves estimating f: X Y, or P(Y|X). E Examples: l P(apple P( l | red d round), d) P( P(noun | ““cá”) á”) Generative classifiers Discriminative classifiers Assume some functional form for P(X|Y) P(X|Y), P(Y) Estimate parameters of P(X|Y), ( | ), P(Y) ( ) directlyy from training data, and use Bayes P(logistic X | Y ) P (Yregression, ) SVM, rule to calculate P(Y|X = xi) P(Y | X ) P( Xneural ) traditional networks,, HMM, Markov random nearest neighbors, boosting, P(red round | apple) P(apple) fields, Bayesian networks, P(apple MEMM, | red round ) conditional random P(red round ) Gaussians Naïve Bayes, Gaussians, Bayes etc. fields etc. fields, etc Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directlyy from trainingg data Generative vs. discriminative methods Generative approach Try to build models for the underlying d l i patterns tt Can be learned, adapted, and generalized with small data. Discriminative approach Try to learn to minimize an utility f function ti (e.g. ( classification l ifi ti error)) but not to model, represent, or “understand” the p pattern explicitly p y (detect 99.99% faces in real images and do not “know” that a face has two eyes). eyes) Often need large training data, say 100,000 00,000 labeled abe ed eexamples, a p es, aand d can hardly be generalized. (cá: fish, to bet) Generative vs. discriminative learningg Objective: determine which is better for what, and why Current: Discriminative learning (ANN, DT, KNN, SVM) typically more accurate More complex problems M More fl flexible ibl predictions di ti Vapnik: “When When solving a problem, problem don’t don t first solve a harder problem problem” Making generative methods computationally more feasible When to prefer discriminative vs. generative Objective: learning from non-vectorial non vectorial input data Current: Better with larger data Faster to train Key Challenges: Generative learning (graphical models, HMM) typically more flexible 2. Kernel methods Key Challenges: Mostt learning M l i algorithms l ith workk on flat, fl t fixed fi d length l th ffeature t vectors Each new data type requires a new learning algorithm Difficult to handle strings, gene/protein sequences, natural language parse trees, graph structures, pictures, plots, … One data-interface for multiple p learning g methods One learning method for multiple data types Research alreadyy in progress p g Kernel methods: the basic ideas Input space X x1 Feature space F inverse map -11 x2 (x) (x1) … xn-1 Kernel methods: math background g Input space X x1 x2 … xn-1 (x2) kernel-based algorithm on K (computation done on kernel matrix) :X R2 H R3 (xn-1) (x2) Gram matrix Knxn= {k(xi,xj)} - kernel-based algorithm on K Linear algebra, probability/statistics, functional analysis, optimization Mercer theorem: Any positive definite function can be written as an inner product in some feature space. Kernel trick: Using kernel matrix instead of inner product in the feature space. Every minimizer of min{C ( f , {x , y }) ( f ) admits ( x1 , x2 ) ( x1 , x2 , x x ) 2 1 k( i,xj) = (x k(x ( i).(x ( j) kernel function k: XxX R Kernel matrix Knxn (x1) xn xn kernel function k: XxX R (xn) (x) (xn) (xn-1) k(xi,xj) = (xi).(xj) Feature space F inverse map -1 2 2 f H Representer theorem: i i H m a representa tion of the form f (.) i K (., xi ) i1 1 3. Beyond y classification and regression g Objective: learning to predict complex objects Current: Most machine learning focuses on classification and regression Discriminative methods often outperform generative methods Generative methods used for learning complex objects (e.g. language parsing protein sequence alignment parsing, alignment, information extraction) What is structured p prediction? (Daume) Structured prediction is a fundamental machine learning task involving classification or regression in which the output p or constrained. variables are mutuallyy dependent Such dependencies and constraints reflect sequential, spatial or combinatorial structure in the problem domain, domain and capturing these interactions is often as important for the purposes of prediction as capturing input-output dependencies. Key Challenges: Extend discriminative methods (ANN,, DT,, KNN,, SVM,, …) to more general settings Examples: ranking functions (e.g. Google top-ten, ROC), natural language parsing, parsing finite-state models Find ways to directly optimize desired performance criteria (e.g. ranking performance vs. error rate) Structured prediction (SP) the machine learning task of generatingg outputs g p with complex p internal structure . What is structured p prediction? (Lafferty) y Handwritingg recognition g Text, sound, event logs, biological, handwriting, gene networks, linked data structures like the Web can be viewed as graphs connecting basic data elements-. x y IImportant t t tasks t k involving i l i structured t t dd data t require i the th computation of a labeling for the nodes or the edges of the underlying graph. graph E.g., E g POS tagging of natural language text can be seen as the labeling of nodes representing the successive words with linguistic labels. brace A good labeling will depend not just on individual nodes but also the contents and labels of nearbyy nodes,, that is,, the preceding and following words--thus, the labels are not independent. Sequential structure Structured prediction p Labeling g sequence q data problem p Structured Learning / Structured Prediction X is a random variable over data sequences Y is a random variable over label sequences whose labels are assumed to range over a finite label alphabet A Problem: Learn how to give labels from a closed set Y to a data sequence X x1 x2 x3 X X: Thinking is being Y: noun verb y1 y2 noun y3 structured yt-1 xt-1 (a) Unstructured Model yt xt yt+1 xt+1 (b) Linear-Structured Model POS tagging, phrase types, etc. (NLP), Named entity recognition (IE) Modeling protein sequences (CB) Image segmentation, object recognition (PR) etc etc. 4. Distributed learning g Objective: DM/ML with distributed data Current: Most ML algorithms g assume random access to all data Often data comes from decentralized sources (e. g. sensor networks, multiple organizations, learning across firewalls, diff different t security it systems) t ) Many projects infeasible (e.g. organization not allowed to share data) 5. Full auto: ML for the masses Objective: make learning results more understandable Current: Machine learning methods that are understandable & generate accurate rules Methods for generating explanations Model verification for user acceptance p Reformatting, Sampling, Filling in missing values, Outlier detection Robust performance estimation and model selection “Data Scoup” 7. Ensemble methods Objective: j combiningg learningg methods automaticallyy Current: Methods often achieve good prediction accuracy The prediction rule appears complex & is difficult to verify Lack of trust in the rule Lack of insight Key Challenges: Automatic selection of machine learning method Tools for preprocessing the data Develop methods for distributing data while preserving privacy Develop methods for distributed learning without distributing the data 6. Interpretable p models ML applications li i require i detailed d il d knowledge k l d about b the h algs l Preparing/Preprocessing takes at least 75% of the effort Key Challenges: Key Challenges: Objective: make ML easier to apply to real problems Current: We do not have a single DM/ML method that “does does it all all” Results indicate that combining models results in large improvements in performance Focus on boosting and bagging Key Challenges: Develop methods that combine the best of different learning algs Searching for good combinations might be more efficient than designing one “global” learning algorithm Theoretical explanation for why and when ensemble methods help 8. Unsupervised p learning g Objective: improve state-of-the-art in unsupervised learning Current: Key Challenges: 9. Information access Objective: more informed information access Current: Bag-of-words Retrieval functions exploit document structure and link structure Information retrieval is a process without memory Research focus in 90’s was supervised learning Much progress on supervised learning methods like neural networks, support vector machines, boosting, etc. U Unsupervised i d llearning i needs d tto “catch “ t h up”” More robust M b t and d stable t bl methods th d for f clustering l t i Include user feedback into unsupervised learning (e.g. clustering with constraints, semi semi-supervised supervised learning, transduction) Automatic distance metric learning Clustering g as an interactive data analysis y process p What is data mining? g “Data Data-driven driven discovery of models and patterns from massive observational data sets” Key Challenges: Develop methods for exploiting usage data Learn from the query history of a user / user group Preserving privacy while mining access patterns Exploiting p g common access p patterns and finding g “groups” g p of users Web Expert: agent that learns the web (beyond Google) Topic modelling Statistics,, Inference Languages, g g , Representations Applications Data Management Overview of DM challenges g (ICDM’05) 1. Developing p g a unifying y g theoryy of DM 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Developing a Unifying Theory of Data Mining Scaling Up for High Dimensional Data/High Speed Streams Mining Sequence Data and Time Series Data Mining Complex Knowledge from Complex Data Data Mining in a Network Setting Distributed Data Mining and Mining Multi-agent Data Data Mining for Biological and Environmental Problems Data-Mining-Process Related Problems Security, Privacy and Data Integrity Dealing with Non-static, Unbalanced and Cost-sensitive Data 2. Scalingg up p for high g dimensional data and high speed streams ultra-high dimensional classification p problems (millions or billions of features, e.g., bio data) Ultra-high g speed p data streams Needs unifying research Long standing theoretical issues continuous, online process e.g. how h to monitor i network k packets for intruders? concept drift and environment drift? RFID network and sensor network data Example: p VC dimension. In statistical learning theory, or sometimes computational learning theory, the VC dimension (for Vapnik-Chervonenkis dimension) is a measure of the capacity of a statistical classification algorithm, defined as the cardinality of the largest set of points that the algorithm can shatter. How to avoid spurious correlations? Knowledge discovery on hidden causes? Similar to discovery of Newton’s Law? VC dimension of perceptron is 3. 3. Sequential q and time series data How to efficiently and accurately cluster, classify and predict the trends? Time series data used for predictions are contaminated by noise Excerpt from Jian Pei’s Tutorial http://www.cs.sfu.ca/~jpei/ E l ti vs explanation Exploration l ti Deep research Streams techniques are designed for i di id l problems individual bl no unifying theory Scaling up is needed Thee cu current e t state oof the t e art a t of o data-mining research is too “ad-hoc” How to do accurate shortterm and long-term p predictions? Signal processing techniques introduce lags in the filtered data, w which c reduces educes accuracy accu acy Key in source selection, domain knowledge in rules, and optimization methods Real time series data obtained from Wireless sensors in Hong Kong UST CS department hallway 4. Mining complex knowledge from complex data (complexly structured data) many objects bj are not iindependent d d off each h other, h and d are not off a single i l type. mine the rich structure of relations among objects, E.g.: interlinked Web pages, social networks, metabolic networks in the cell Integration of data mining and knowledge inference The biggest gap: unable to relate the results of mining to the real-world decisions they affect - all they can do is hand the results back to the user. More research on interestingness of knowledge 6. Distributed data mining and mining multi-agent lti t data d t N d to Need t correlate l t the th data d t Adversary data mining: d lib deliberately l manipulate i l the h data to sabotage them (e.g., make k th them produce d false f l negatives) Game theory may be needed for help Player 1:miner Action: H T Player 2 H ((-1,1) 1,1) T (1,-1) (1, 1) T H (1,-1) (1, 1) Outcome Picture from Matthew Pirretti’s slides,penn state An Example of packet streams (data courtesy of NCSA, UIUC) 7. Data mining g for biological g and environmental problems Games seen at the various probes (such as in a sensor network) Community and d Sociall Networks k Linked data between emails, Web pages, blogs, citations, sequences and people Static and dynamic structural behavior Mining in and for Computer Networks detect anomalies (e.g., (e g sudden traffic spikes due to a DoS (Denial of Service) attacks Need to handle 10Gig Ethernet links (a) detect (b) trace back (c ) drop packet Mining graphs Data that are not i.i.d. (independent and identically distributed) 5. Data mining g in a network settingg New problems raise new questions Large scale problems especially so ((-1,1) , ) Biological data mining, such as HIV vaccine design DNA, chemical properties, 3D structures, and d functionall properties need to be fused Environmental data mining Mining for solving the energy crisis Metabolomics Proteomics 3000 metabolites 2,000,000 Proteins Genomics 25,000 Genes 8. Data-mining-process gp related p problems to automate mining process? the composition of data mining i i operations ti Data cleaning, with logging capabilities Visualization and mining automation Need a methodology: help users avoid many data mining mistakes it k What is a canonical set of data mining operations 9. Security, y, privacy p y and data integrity g y How a step in the KDD process consisting p g of methods that produce useful patterns or models from the data Maybe 7090% of effort and cost in KDD 2 5 4 Putting the results in p practical use Interpret and evaluate discovered knowledge Data mining 3 Extract Patterns/Models Collect and preprocess data KDD is inherently interactive and iterative 1 Understand the domain and define problems (In many cases, viewed KDD as data mining) How to ensure the users privacy while their data are being mined? How to do data mining for protection of security and privacy? Knowledge K l d iintegrity t it assessment Alice ’s age Add random number to Age 30 becom es 65 (30+35 ) Perturbation Methods Secure Multi-Party Computation (SMC) Methods 30 | 70K | ... 50 | 40K | ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm 10. Dealing g with non-static,, unbalanced and cost-sensitive data The UCI datasets are small and not highly unbalanced Real world data are large (10^5 features) but only < 1% off the th useful f l classes l (+’ve) (+’ ) There is much information on costs and benefits benefits, but no overall model of profit and loss Data may evolve with a bias introduced by sampling pressure blood test essay ? temperature 39oc ? ? cardiogram ? • Each test incurs a cost • Data extremely unbalanced • Data change with time Some papers can be found here p j jp http://www.jaist.ac.jp/~bao/K417-2008 . . ... . ... Model Để máyy hiểu được ợ nghĩa g các văn bản? D Làm sao máy tính hiểu được nội dung văn bản để tìm kiếm cho hiệu quả? rock 2 1 0 2 0 1 1 granite 1 0 1 0 0 0 0 marble 1 2 0 0 0 0 1 Thông qua chủ đề của văn bản music 0 0 0 1 2 0 0 Latent semantic analysis ((Deerwester et al.,, 1990;; Hofmann, 1999): Biểu diễn văn bản trong một không gian Euclid, mỗi chiều là một tổ hợp tuyến tính các từ (giống PCA). D1 D2 D3 D4 D5 D6 Q1 song 0 0 0 1 0 2 0 band 0 0 0 0 1 0 0 D1 D2 D3 D4 D5 D6 Q1 dim1 -0.888 -0.759 -0.615 -0.961 -0.388 -0.851 -0.845 dim2 0.460 0.652 0.789 -0.276 -0.922 -0.525 0.534 ( ik1 i ) 1 1 1 k k 1 k ( ) i 1 i Dirichlet prior on the per-document topic distributions z wN d M N p ( , z, w , ) p ( ) p ( zn ) p ( wn zn , ) n 1 Joint distribution of topic mixture θ, a set of N topic z, a set of N words w N p (w , ) p ( ) p ( zn ) p ( wn zn , ) d k n 1 zn Marginal distribution of a document by integrating over θ and summing over z M Nd p ( D , ) p ( d ) p ( zdn d ) p ( wdn zdn , ) d k d d 1 n 1 zdn P b bilit off collection Probability ll ti b by product d t off marginal gi l probabilities b biliti off single i gl d documents t T documents mỗi văn bản là một mixture của các chủ đề mỗi chủ đề là một phân bố xác suất trên các từ. “thực phẩm” = {an toàn, rau, thịt, cá, không ngộ độc, không đau bụng …} “mắm ắ tôm” = {tôm, mặn, đậu phụ, thịt chó, lòng lợn, …} “dịch bệnh” = {nhiều người, cấp cứu cứu, bệnh viện viện, thuốc thuốc, vacine, mùa hè, … } D1= {thực phẩm 0.6, mắm ị bệnh ệ 0.8}} tôm 0.35,, dịch C From 16000 documents of AP corpus 100-topic 100 t i LDA model. Each color codes a different factor from which the word is putatively generated documents Latent Dirichlet Allocation (LDA) Per-document topic proportions Topic hyperparameter yp p Per-word topic p assignment g Dirichlet parameter d Observed word Zd,n dn Example p of topics p learned topics Normalized cooccurrence matrix Thí dụ ụ Latent Dirichlet allocation (LDA) model p ( ) V Topic modeling to opics U documents Topic modeling key idea (LDA Blei, (LDA, Blei JMLR 2004) words C dims dims Google cho ra rất nhiều tài liệu liệu, với precision và recall thấp. dims dims documents words Latent semantic analysis words Tìm tài liệu trên Google liên quan 3 chủ hủ đề “th “thực phẩm”, hẩ ” ““mắm ắ tôm”, “dịch bệnh”. words Topic p modeling: g keyy ideas Wd,n d N d Per-topic word proportions t M T Traditional ML vs. TL (P. Langley 06) Dirichlet-Lognormal g (DLN) topic p model w N dM z Model descrption: M th d Method DLN Accuracy 0.5937 1 1 p ((log g x )T 1 ((log g x ) exp (2 ) n / 2 . .x1...xn 2 where log x (log x1 ,..., log xn )T N dM T Spam classification | ~ Dirichlet ( ) | , ~ Lognormal ( , ) z | ~ Multinomial ( ) w | ~ Multinormial ( f ( )) Pr((x1 ,,...,, xn ) w LDA SVM 0.4984 0.4945 test item ms z ttraining ittems test item ms T ttraining ittems Transfer of learning across domains Traditional ML in multiple domains Predicting crime DLN LDA SVM 0.2442 0.1035 0.2261 Humans can learn in many domains. Humans can also transfer from one domain to other domains. (Than Quang Khoat, Ho Tu Bao, 2010) Traditional ML vs. TL Learning Process of Traditional ML Notation Learning Process of Transfer Learning Domain: It consists of two components: A feature space training items , a marginal distribution training items In general, if two domains are different, then they may have different feature spaces or different marginal distributions. Task: Learning System Learning System Given a specific domain and label space predict its corresponding label Learning System , for each in the domain, to p or In ggeneral,, if two tasks are different,, then theyy mayy have different label spaces different conditional distributions Knowledge Learning System Notation For simplicity, we only consider at most two domains and two tasks. Source domain: Task in the source domain: Target domain: Task in the target domain