* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Machine Learning for Information Visualization
Survey
Document related concepts
Transcript
Machine Learning for Information Visualization Guy Lebanon Fei Sha VizWeek 2010 Tutorial Part I intro dim-red patterns clustering regression classification SSL+AL VA Tutorial Part I: Outline Session 1 (Guy Lebanon) Introduction to Machine Learning Dimensionality Reduction Pattern Discovery Clustering Classification Regression Semisupervised and active learning Visual analytics: interactivity and domain knowledge Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Tutorial Part II: Outline Session 2 (Fei Sha) Parameter Estimation maximum likelihood Bayesian inference Model Evaluation and Selection identify and prevent overfitting Validation, cross-validation and regularization Advanced Techniques Identify hidden patterns and structures Kernel PCA and manifold learning Latent Dirichlet Allocation Models for sequential and temporal data Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Introduction: Outline Four questions What is machine learning? What are possible applications? What is its relation to other fields? How can it help visualization? Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA What is machine learning? Computer program whose behavior evolve based on empirical data (Wikipedia) Computer program that learns from experience E in order to improve its performance P on a task T (Tom Mitchell) experience E : images, text, sensor measurements, biological data task T : estimating probabilities, predicting object label, dimensionality reduction, clustering performance P : probability of success, money/time saved, Specific applications? Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA What are possible applications? Spam filtering in email Face recognitions in images Fraud detection (credit card transactions) Web search (Google, Bing) Recommendation systems (Amazon, Netflix) Machine translation e.g., English ⇒ Chinese Speech recognition Information Visualization? What about statistics, AI? Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA What is its relation to other fields? Closely related scientific disciplines: Statistics : emphasis on math, asymptotics AI : emphasis on computer systems designed by hand Data Mining : emphasis on large datasets, efficient computation, and practical applicability ML : applies statistics to large datasets using computers. between data mining and statistics; Differs from AI in learning from experience/data rather than taught by expert each area has its own community, conferences, journals Statistics ≺ Artificial Intelligence ≺ Machine Learning ≺ Data Mining Is it useful for visualization? Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA How can ML help visualization? Embed high dimensional data in two or three dimensions for easy visualization Discover unknown patterns between data attributes Reduce massive data to a small set of coherent clusters Identify irrelevant dimensions or features Model P(Y |X ) for (i) understanding dependencies between X , Y and (ii) stratified visualization and Theoretical framework for introducing interactivity and domain knowledge to visualization Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Taxonomy of data A data taxonomy helps abstract algorithm design from specific datasets. type examples categoric atom numeric atom ordinal atom unordered atoms unordered atoms 1-D ordered atoms 2-D ordered atoms a word in the English language temperature measurement preference A ≺ B ≺ C vital signal (pulse, heart rate, etc.) demographic information financial time series binary image Additional rows may be created recursively. Lebanon and Sha ML for Information Visualization distribution example multinomial normal mallows model Multiv. normal loglinear models gaussian process MRF intro dim-red patterns clustering regression classification SSL+AL VA Dimensionality Reduction: Outline Overview Multidimensional Scaling definition Metric vs. non-Metric MDS Evaluation Procrustes rotation Global vs. local Principal Component Analysis Non-negative Matrix Factorization Examples and case studies (throughout) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Dimensionality Reduction Goal: Embed objects x (1) , . . . , x (n) ∈ X 7→ f (x (1) ), . . . , f (x (n) ) ∈ R2 while approximately preserving spatial geometry x (1) , . . . , x (n) may be high dimensional vectors (text, images), infinite dimensional (spatial-temporal data), or abstract (psychological perceptions) Precise preservation of spatial relationship is usually hopeless. Focus on optimal (least worst) embedding. Geometric approaches (MDS, PCA) Factorization (NMF) Manifold (Isomap, LLE, local MDS) Probabilistic (LDA, pLSA) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Multidimensional Scaling Given a dissimilarity ρ on X , find f (x (1) ), . . . , f (x (n) ) ∈ R2 that minimize the distortion introduced by the embedding X S(f ) = R kf (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) ) i<j R(α, β) = (α2 − β 2 )2 R(α, β) = (α − β)2 R(α, β) = (α2 − β 2 )2 /α2 generally, minimization does not have a closed form and requires iterative optimization local maxima are possible making the result depend on initial guess (use multiple restarts) Computationally difficult for large n Scale, rotation, often meaningless Axis interpretation Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Example: Crime Rates Correlation of crime rates over 50 US States in 1970 US Census (Wilkinson, 1990) crime, murder, rape, robbery, assault, burglary, larceny, autotheft v u u 1.00 0.52 0.34 0.81 0.28 u 0.52 1.00 0.55 0.70 0.68 u u 0.34 0.55 1.00 0.56 0.62 u √ u ρ = 1 − s = u1 − 0.81 0.70 0.56 1.00 0.52 u 0.28 0.68 0.62 0.52 1.00 u u 0.06 0.60 0.44 0.32 0.80 t 0.11 0.44 0.62 0.33 0.70 S(f ) = X 0.06 0.60 0.44 0.32 0.80 1.00 0.55 R kf (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) ) i<j Lebanon and Sha ML for Information Visualization 0.11 0.44 0.62 0.33 0.70 0.55 1.00 intro dim-red patterns clustering regression classification SSL+AL VA Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Objects x (1) , . . . , x (n) may be defined only implicitly Survey of similarity between nations (1-10 scale) among students (Wish, 1971) Brazil France China Congo Cuba Egypt India Israel Japan USSR USA Yugoslavia p Using ρ = 1 − D/10 − I S(f ) = X R kf (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) ) i<j Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Example: Text Visualization Stage 1: Convert each document to a vector x (i) expressing phrase appearances Stage 2: Express dissimilarity using a vector distance e.g., s X (i) (j) (i) (j) ρ(x , x ) = (xw − xw )2 w Computational challenge: for large n subsample or replace individual documents by cluster centroids Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA IN-SPIRE (NVAC & PNNL) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Case Study: Votes and Movie Preferences Visualize preferences A ≺ · · · ≺ C over d items issued by “judges”. Dataset 1: Election votes (APA presidential votes d = 5 candidates) Dataset 2: Joke preferences (Jester dataset d = 100 jokes) Dataset 3: Movie preferences (EachMovie dataset, d = 1628 movies) Challenges: Some “judges” omit preferences concerning some items e.g., unobserved movies Define meaningful distance function between votes that is computationally tractable for large d Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA 1(1.33) , 7(8.72) 62 3(2.09) , 1(3.05) 1546 8(6.25) , 3(6.51) 92 2(1.42) , 4(3.90) 1023 4(2.73) , 5(2.83) 1431 APA d = 5 (Kidwell et al. 2008) Jester d = 100 Lebanon and Sha ML for Information Visualization EachMovie d = 1628 intro dim-red patterns clustering regression classification SSL+AL VA Metric vs. Non-Metric MDS A variation of MDS approximates the original geometry by a monotonic transformation t of the embedded distances (called disparities) The embedded distances have the right ordering (approximately) rather than precise values X min R t(k(f (x (i) ) − f (x (j) )k), ρ(x (i) , x (j) ) (1) t,f i<j Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Evaluation of MDS Shepard’s diagram plots the embedded distances as a function of a the original dissimilarities 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 embedded distances seem to be monotonic increasing in original distances (undervalued for small distances and overvalues for larger distances) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Shepard’s diagram for non-metric MDS + disparities vs. distances 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Lebanon and Sha 0.7 0 0 0.1 0.2 0.3 ML for Information Visualization 0.4 0.5 0.6 0.7 intro dim-red patterns clustering regression classification SSL+AL VA Procrustes Rotation MDS objective function is invariant to rigid body transformation (reflection, rotation, translation) No reason to prefer one rotation+translation over another Procrustes aligns two MDS figures to improve side by side visualization Step 1: center both figures around 0 by subtracting centroid Step 2: rotate by T = arg min trace(A − BT )(A − BT )> TT > =I Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Global vs. Local MDS In some cases local structure (small distances) should be more preserved than global structure (large distances) Intuition: As distances grow their ordering becomes nearly as important as the precise values Motivation: High dimensional data lies on a non-linear manifold in high dimensions. Approximating local distances in 2-D is much easier than the unreasonable task of approximating all distances X S(f ) = w (ρ(x (i) , x (j) ))R k(f (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) ) i<j w monotonic decreasing function Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA (Chen and Buja 2008) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA (Chen and Buja 2008) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Case Study: Search Engine Visualization (Sun et. al 2010) Goal: visualize relationship between seven search engines: Google, Bing, Yahoo, Ask, Altavista, Alltheweb, AOL, Lycos Step 1: send a query or a list of queries to the search engines Step 2: Measure distances among them (including ground truth if possible) Web-App Example Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Queries Tourism Times Square, Sydney Opera House, Eiffel Tower, ... Celebrity Names Michael Bolton, Michael Jackson, Jackie Chan, ... Sports Football, Acrobatics, Karate, Pole Vault, Butterfly Stroke, .. University Names Georgia Institute of Technology, University of Florida, ... Company Goldman Sachs, Facebook, Honda, Cisco Systems, .. Questions How are flying buttresses constructed, ... Stage 1: Average queries within categories to create per-category MDS embedding Stage 2: Procrustes rotation Categories Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Questions w3 Sports w3 8 7 5 1 2 1. altavista 2. alltheweb 3. ask 4. google 5. lycos 6. live 7. yahoo 8. aol 6 4 5 3 1 2 7 1. altavista 2. alltheweb 3. ask 4. google 5. lycos 6. live 7. yahoo 8. aol 4 6 8 Questions w3 Procruste 8 4 7 1 2 5 1. altavista 2. alltheweb 3. ask 4. google 5. lycos 6. live 7. yahoo 8. aol 3 3 6 (Sun et al. 2010) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Principal Component Analysis Same as MDS for Euclidean data: original data x (1) , . . . , x (n) is in q P (i) (j) 2 Rn and ρ(x (i) , x (j) ) = k (xk − xk ) Solution is given in terms of SVD (eigenvalues, eigenvectors) of empirical covariance matrix (efficient even for large n, d) Σ= n 1 X (i) (x − x)(x (i) − x)> n i=1 = U diag(σ1 , σ2 , . . . , σd ) U > ≈ U diag(σ1 , σ2 , 0, . . . , 0) U > Single global solution Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Non-Negative Matrix Factorization arg min W ,H XX (Xij − [WH]ij )2 i s.t. Wij , Hij ≥ 0 j n rows of X ∈ Rn×d are data vectors r rows of H ∈ Rr ×d are non-negative topics/code-words/factors n rows of W ∈ Rn×r represents the non-negative degree of membership of data vectors in the codewords. Maintaining non-negativity prevents one factor from removing content that another factor contributed Iterative optimization required; Factorization not unique. Extremely powerful in uncovering latent factors: clustering compression/coding recommendation systems visualization Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA (Lee and Seung, 1999) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA (Lee and Seung, 1999) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Pattern Discovery: Outline Entropy and conditional entropy Mutual information Association rule mining Example: census data Example: movie recommendation systems Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Entropy H(X ) = X p(x) log x 1 p(x) Measures uncertainty in knowing the value X Expected number of yes/no questions needed to find out X Expected number of bits needed to compress X (cannot do better) Maximum entropy achieved for uniform distribution Minimum entropy achieved for constant or deterministic variables p(X = a) = 1/2 p(X = b) = 1/4 p(X = c) = 1/8 p(X = d) = 1/8 p(Y = a) = 1/4 p(Y = b) = 1/4 p(Y = c) = 1/4 1 1 1 H(X ) = log 2 − log 4 − 2 log 8 = 7/4 2 4 8 1 H(Y ) = 4 log 4 = 2 4 p(Y = d) = 1/4 Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Conditional Entropy H(X |y = y ) = X p(x|Y = y ) log x 1 p(x|Y = y ) Measures uncertainty in knowing the value X if you know Y = y Expected number of yes/no questions needed to find out X if you know Y = y Expected number of bits needed to compress X (cannot do better) if you know Y = y H(X |Y ) = X p(y )H(X |Y = y ) y Measures uncertainty in knowing the value X if you know Y Expected number of yes/no questions needed to find out X if you know Y Expected number of bits needed to compress X (cannot do better) if you know Y Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Mutual Information I (X , Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X ) Symmetric Reduction in number of yes/no questions needed to know X as a result of knowing Y Number of bits needed to compress X as a result of knowing Y I (X , Y ) = 0 for X , Y independent and I (X , X ) = H(X ). Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Association Rule Mining Goal: Mine data (shopping transactions) to detect patterns of behavior Stage 0: Construct a probability estimate p̂, say be maximum likelihood Stage 1: Define a set of candidate binary events A1 , . . . , Ak k Stage 2: Compute I (Ai , Aj ) for all combinations 2 Stage 3: Order pairs Ai , Aj by mutual information and inspect the top Stage 4: Detect precise rule shape i.e., Ai ⇒ Aj or Aci ⇒ Aj or Ai ⇒ Acj , etc. by examining probabilities Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Case Study: Census Questions number in household number of children = = 1 0 ⇒ language at home=English number in home householder status occupation = = = English own {professional/managerial} language in home income marital status number of children = ≤ = = English $40,000 not married 0 ⇒ income ≥ $40,000 ⇒ education 6∈ {college graduate,graduate study} (Hastie, 2009) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Case Study: Netflix Movie preferences Shrek ≺ LOTR: The Fellowship of the Ring Shrek ≺ LOTR: The Fellowship of the Ring Shrek 2 ≺ LOTR: The Fellowship of the Ring Kill Bill 2 ≺ National Treasure Shrek 2 ≺ LOTR: The Fellowship of the Ring LOTR: The Fellowship of the Ring ≺ Monsters, Inc. National Treasure ≺ Kill Bill 2 LOTR: The Fellowship of the Ring ≺ Monsters, Inc. How to Lose a Guy in 10 Days ≺ Kill Bill 2 I, Robot ≺ Kill Bill 2 ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ Shrek 2≺ LOTR: The Return of the King Shrek 2≺ LOTR: The Two Towers Shrek≺ LOTR: The Return of the King Kill Bill 1 ≺ I. Robot Shrek 2≺ LOTR: The Two Towers LOTR: The Two Towers≺ Shrek Pearl Harbor ≺ Kill Bill 1 LOTR: The Return of the King≺ Shrek 50 First Dates≺ Kill Bill 1 The Day After Tomorrow ≺ Kill Bill 1 (Sun et. al 2010) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Kill Bill 1 Maid in Manhattan Two Weeks Notice The Royal Tenenbaums The Royal Tenenbaums The Fast and the Furious Spider-Man Anger Management Memento Maid in Manhattan Maid in Manhattan How to Lose a Guy in 10 Days The Royal Tenenbaums The Wedding Planner Peal Harbor Lost in Translation The Day After Tomorrow The Wedding Planner ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ Kill Bill 2 The Wedding Planner Miss Congeniality Lost in Translation American Beauty Like A ⇒ Like B Gone in 60 Seconds Spider-Man 2 Bruce Almighty Pulp Fiction ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ ⇒ Pulp Fiction Kill Bill: 1 Pulp Fiction Pearl Harbor Like A ⇒ Dislike B The Matrix Memento Pearl Harbor American Beauty Raiders of the Lost Ark (Sun et al., 2010) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Clustering: Outline Motivation and setup K-Means algorithm Example: word clustering Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Clustering Goal: Partition data D = {x (1) , . . . , x (n) } to k distinct clusters such that similar data vectors are assigned to the same cluster arg min k X X kx − µ(j) k2 , µ(j) = average(Sj ) S1 ∪···∪Sk j=1 x∈Sk k-means: Iterate to convergence Assignment: Assign each data vector to the cluster with the closest mean Update: Calculate the new means for each cluster based on revised assignment Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Clustering: Visualization Applications Computational difficulties due to large n e.g., MDS for large text archives Partition data vectors x (1) , . . . , x (n) to k n clusters and proceed with MDS on the cluster centroids Computational difficulties due to large d e.g., MDS for high dimensional data partition data dimensions to k d clusters and proceed with MDS on the clustered dimensions Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Word Clutering in Reuters RCV1 Define similarity using contextual distribution sim(w , v ) = f (p(·|v ), p(·|w )) jan databas nbc feb intranet abc nov server cnn dec softwar hollywood oct internet tv aug netscap viewer apr onlin movi mar web audienc sep browser fox (Dillon et. al 2007) Lebanon and Sha wang chen liu beij wu china chines peng hui ML for Information Visualization ottawa quebec montreal toronto ontario vancouv canada canadian calgari intro dim-red patterns clustering regression classification SSL+AL VA Classification: Outline Motivation and setup Linear regression Regression trees Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Regression and Visualization Predict Y ∈ R given X ∈ Rd based on training data (x (i) , y (i) ), i = 1, . . . , n Understand relationship between X , Y by inferring simple functional dependency rather than scatter plot visualization Variable selection tools enable detecting which data dimensions (or combination of data dimensions) are relevant for predicting X Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Linear Regression Predict Y ∈ R given X ∈ Rd based on training data (x (i) , y (i) ), i = 1, . . . , n Linear regression assumption Y |X ∼ N(θ> X , σ 2 ). Recover θ by P maximum likelihood or least squares n θ̂ = arg minθ i=1 (y (i) − θ> x (i) )2 Predict for new data: ŷ = θ̂> x. Closed form for optimization problem (requires matrix inversion), single global optimum Computationally efficient even for large n, d (millions, billions,...) May be non-linear in X1 , . . . , Xd by regressing non-linear features f1 (X ), . . . , fd0 (X ) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA 45 40 35 MPG 30 25 20 15 10 5 5000 4000 3000 2000 1000 250 200 150 100 50 Horsepower Weight ˆ = θ̂ (c, weight, horsepower , weight ∗ horsepower ) MPG > Lebanon and Sha ML for Information Visualization 0 intro dim-red patterns clustering regression classification SSL+AL VA Regression Trees Dependency of Y and X may have different functional form in different regions of the data space Regression trees are non-parametric regression models where the leaves partition the input space and determine ŷ as the data average Prediction in leaf/region A: ŷ =average(all x (i) in that region) Tree construction: Starting with all training data in the root, iteratively construct tree based on R1 (j, s) = {x : xj ≤ s} R2 (j, s) = {x|xj > s} X X (i) 2 (i) 2 min min (y − c1 ) + min (y − c2 ) j,s c1 x (i) ∈R1 (j,s) c2 x (i) ∈R2 (j,s) See next slide for tree diagrams (Hastie et. al, 2009) Lebanon and Sha ML for Information Visualization Elements of Statistical Learning (2nd Ed.) �Hastie, Tibshirani & Friedman 2009 Chap 9 intro dim-red patterns clustering regression classification SSL+AL VA R5 X2 X2 R2 t4 R3 t2 R4 R1 t1 t3 X1 X1 X 1 ≤ t1 | X2 ≤ t2X1 ≤ t3 R1 R2 X 2 ≤ t4 R3 X2 R4 R5 Lebanon and Sha ML for Information Visualization X1 intro dim-red patterns clustering regression classification SSL+AL VA Classification: Outline Motivation and setup Generative approaches Disciminative approaches Classification trees Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Classification for Visualization Goal: Predict Y ∈ {1, 2, . . . , k} based on X ∈ Rd Visualize dependencies between X and Y by examining classification rules or decision boundary Pre-processing in text visualization: part of speech tagging, named entity recognition, word sense disambiguation Filter relevant data for visualization from massive archive (face images, credit card fraud, articles concerning a certain topic) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Generative Classification Generative classification: fit p̂(x, y ) = p̂(x|y )p̂(y ) based on training data and classify ŷ = arg maxy p(y |x) Fisher’s LDA: estimate p̂(x|y ) using MLE for multivariate Gaussian: p(x|y ) = N(x ; θy , Σy ) Q Naive Bayes: p(x|y ) ≈ i p(xi |y ) (estimate rhs using 1-d MLE estimation In either case p(y ) is estimated using empirical frequency Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Discriminative Classification Discriminative classification: ŷ = sign θ̂> x θ̂ = arg min θ n 1X L(y θ> x (i) ) n i=1 where L1 (r ) = exp (−r ) boosting L2 (r ) = log (1 + exp (−r )) L3 (r ) = max(1 − r , 0) logistic regression support vector machine Typically perform better than generative classifiers Minimization problems convex for the loss functions above Interpretation as maximum conditional likelihood for p(y |x) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Classification Trees Similar to regression trees: iteratively split nodes to create a partition of the space A1 , . . . , Ar where 1 X I (y (i) = 1) |Aj | (i) x ∈Aj ( 1 pAj (y = 1) > pAj (y = −1) k(Aj ) = 0 otherwise p̂Aj (y = 1) = splitScore(A = B ∪ C ) = 1 1 H(p̂B ) + H(p̂C ) |B| |C | Larger trees tend to overfit the training data and perform poorly on future unseen data Possible solution is to prefer shorter trees e.g., choose the smallest tree that is not too much worse than the best tree Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Semisupervised and Active Learning: Outline Two important deviations from standard ML setup Semisupervised learning: learn X → Y using a combination of labeled and unlabeled data. Particularly useful when massive archives of unlabeled data exist: language, speech, internet, images Active learning: learn X → Y by interactively choosing which datapoints are to be labeled. Useful for interactive visualization applications Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Semisupervised Learning (SSL) Goal: Predict Y based on X (classification or regression) based on labeled data (x (i) , y (i) ), i = 1, . . . , n and unlabeled data x (i) , i = n + 1, . . . , n + m. Motivation: In many cases labeled data is much more expensive to obtain than unlabeled data i.e., n m. Prediction accuracy increases with both n, m Approach 1 Generative SSL: maximize likelihood of observed data ( n ) n+m X X X (i) (i) (i) θ̂ = arg max log pθ (x , y ) + log pθ (x , y ) θ i=1 i=n+1 y Approach 2 Discriminate SSL: use unlabeled data to bias selection towards smooth models θ̂ = arg min θ n 1X L(y (i) θ> x (i) ) + R(θ, x (n+1) , . . . , x (n+m) ) n i=1 Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA (Zhou, 2005) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA (Zhou, 2005) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Active Learning Goal: Predict Y from X based on training data (x (i) , y (i) ), i = 1, . . . , n such that x (i) are chosen rather than observed Motivation: prediction X → Y is harder in some areas of the space of data vectors X . Choosing x (i) i = 1, . . . , n to concentrate on challenging areas makes better use of resources Approach: Choose x (i) i = 1, . . . , n sequentially in regions where the predictor is least certain i.e. H(p̂) is highest Violates iid assumption that is central to proving large sample consistency in many cases Useful way to insert user interaction into the modeling process (visual analytics) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Visual Analytics: Outline Motivation and setup Case study: text visualization Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Visual Analytics? Analyze complex data by displaying and examining visual cues dimensionality reduction user interaction domain knowledge iterative process Related disciplines: machine learning, visualization, graphics, human computer interaction Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Raw data: text, images, video, etc. Domain Expert or User Raw Data Representation High dimensional vector data Domain knowledge User feedback Dimensionality Reduction Low dimensional vector data User Visualization evaluated by user Visualization System Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Application: Text Visualization Embed document collection in 2-D while preserving semantically coherent spatial structure. Challenges include: more than one interpretation of semantic coherence (topic,sentiment,author,interest) incorporating domain knowledge incorporating user feedback quantitative evaluation (Mao et al. 2010) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Text Visualization as Metric Learning Standard dimensionality reduction methods (PCA,LLE,t-SNE,etc.) assume Euclidean geometry which is inappropriate for text. Words are (apriori) orthogonal Adjust methods to work on non-Euclidean geometry q dH (x, y ) = (x − y )> H > H(x − y ). (2) where H reflects the relationship between words and visualization goal (equivalent to composing the transformation x 7→ Hx with standard dimensionality reduction techniques) Problem: determine H using domain knowledge, user interaction (no labeled data) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Method A: Manual Specification Define a block-diagonal matrix H = RD from manually constructed word clusters (R is stochastic translation matrix, D is diagonal weighting matrix) 5 0 0 0 0 0.8 0.1 0.1 0 0 0.1 0.8 0.1 0 0 0 5 0 0 0 0.1 0.1 0.8 0 0 0 0 5 0 0 0 0 0 0.9 0.1 0 0 0 3 0 0 0 0 0 3 0 0 0 0.1 0.9 Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Vocabulary Sci & Tech Politics Mid East Comp Others Med Religion Evol Christianity History Others People Others bible gospel amen christians santa HW SW GUI Team Name Others Canoeing catch boxing innings soccer Others Lebanon and Sha Sports ML for Information Visualization Ot intro dim-red patterns clustering regression classification SSL+AL VA Method B: Mahalanobis Distance H > H = Σ−1 where Σ is the covariance matrix of the underlying distribution (estimated from a large dataset) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Method C: Contextual Diffusion Contextual distribution of word v : qv (w ) = p(w appears in x|v appears in x) (3) Matrix H is determined by similarity of contextual distributions !! Xp 2 qu (w )qv (w ) . H(u, v ) = exp −c arccos w Intuitively, the word u will be translated or diffused into v depending on the geometric diffusion between the distributions of likely contexts. Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Method D: Google n-Grams Same as C but with contextual distribution estimated from the Google n-gram dataset (n-gram counts, for n ≤ 5, obtained from the Google crawler based on processing over a trillion words of running text) Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA Method E: Word-Net Define H based on word similarity measures from the Word-Net dataset (tried several standard measures). Lebanon and Sha ML for Information Visualization intro dim-red patterns clustering regression classification SSL+AL VA H=I B C D E PCA (1) 1.5391 1.9314 1.2570 1.2023 1.4475 PCA (2) 1.4085 1.7126 1.3036 1.3407 1.3352 t-SNE (1) 1.1649 1.6172 1.2182 0.7844 1.1762 t-SNE (2) 1.1206 1.3008 1.2331 1.0723 1.1362 H=I B C D E PCA (1) 0.8461 0.6073 0.7381 0.8420 0.8532 PCA (2) 0.5630 0.4614 0.6815 0.5898 0.5868 t-SNE (1) 0.9056 0.8249 0.9110 0.9323 0.9013 t-SNE (2) 0.7281 0.7207 0.6724 0.7359 0.7728 Table 1: Quantitative evaluation of dimensionality reduction for visualization for two tasks in the news article domain. The numbers in the top five rows correspond to measure (i) (lower is better), and the numbers in the bottom five rows correspond to measure (iii) (k = 5) (higher is better). We conclude that contextual diffusion (C), Google n-gram (D), and Word-Net (E) tend to outperform the original H = I . The Mahalanobis distance performs poorly. Lebanon and Sha ML for Information Visualization