Download Machine Learning for Information Visualization

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Machine Learning for
Information Visualization
Guy Lebanon
Fei Sha
VizWeek 2010 Tutorial Part I
intro dim-red patterns clustering regression classification SSL+AL VA
Tutorial Part I: Outline
Session 1 (Guy Lebanon)
Introduction to Machine Learning
Dimensionality Reduction
Pattern Discovery
Clustering
Classification
Regression
Semisupervised and active learning
Visual analytics: interactivity and domain knowledge
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Tutorial Part II: Outline
Session 2 (Fei Sha)
Parameter Estimation
maximum likelihood
Bayesian inference
Model Evaluation and Selection
identify and prevent overfitting
Validation, cross-validation and regularization
Advanced Techniques
Identify hidden patterns and structures
Kernel PCA and manifold learning
Latent Dirichlet Allocation
Models for sequential and temporal data
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Introduction: Outline
Four questions
What is machine learning?
What are possible applications?
What is its relation to other fields?
How can it help visualization?
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
What is machine learning?
Computer program whose behavior evolve based on empirical data
(Wikipedia)
Computer program that learns from experience E in order to improve
its performance P on a task T (Tom Mitchell)
experience E : images, text, sensor measurements, biological data
task T : estimating probabilities, predicting object label,
dimensionality reduction, clustering
performance P : probability of success, money/time saved,
Specific applications?
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
What are possible applications?
Spam filtering in email
Face recognitions in images
Fraud detection (credit card transactions)
Web search (Google, Bing)
Recommendation systems (Amazon, Netflix)
Machine translation e.g., English ⇒ Chinese
Speech recognition
Information Visualization?
What about statistics, AI?
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
What is its relation to other fields?
Closely related scientific disciplines:
Statistics : emphasis on math, asymptotics
AI : emphasis on computer systems designed by hand
Data Mining : emphasis on large datasets, efficient computation, and
practical applicability
ML : applies statistics to large datasets using computers.
between data mining and statistics; Differs from AI in
learning from experience/data rather than taught by
expert
each area has its own community, conferences, journals
Statistics ≺ Artificial Intelligence ≺ Machine Learning ≺ Data Mining
Is it useful for visualization?
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
How can ML help visualization?
Embed high dimensional data in two or three dimensions for easy
visualization
Discover unknown patterns between data attributes
Reduce massive data to a small set of coherent clusters
Identify irrelevant dimensions or features
Model P(Y |X ) for (i) understanding dependencies between X , Y
and (ii) stratified visualization and
Theoretical framework for introducing interactivity and domain
knowledge to visualization
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Taxonomy of data
A data taxonomy helps abstract algorithm design from specific datasets.
type
examples
categoric atom
numeric atom
ordinal atom
unordered atoms
unordered atoms
1-D ordered atoms
2-D ordered atoms
a word in the English language
temperature measurement
preference A ≺ B ≺ C
vital signal (pulse, heart rate, etc.)
demographic information
financial time series
binary image
Additional rows may be created recursively.
Lebanon and Sha
ML for Information Visualization
distribution
example
multinomial
normal
mallows model
Multiv. normal
loglinear models
gaussian process
MRF
intro dim-red patterns clustering regression classification SSL+AL VA
Dimensionality Reduction: Outline
Overview
Multidimensional Scaling
definition
Metric vs. non-Metric MDS
Evaluation
Procrustes rotation
Global vs. local
Principal Component Analysis
Non-negative Matrix Factorization
Examples and case studies (throughout)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Dimensionality Reduction
Goal: Embed objects
x (1) , . . . , x (n) ∈ X 7→ f (x (1) ), . . . , f (x (n) ) ∈ R2
while approximately preserving spatial geometry
x (1) , . . . , x (n) may be high dimensional vectors (text, images),
infinite dimensional (spatial-temporal data), or abstract
(psychological perceptions)
Precise preservation of spatial relationship is usually hopeless. Focus
on optimal (least worst) embedding.
Geometric approaches (MDS, PCA)
Factorization (NMF)
Manifold (Isomap, LLE, local MDS)
Probabilistic (LDA, pLSA)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Multidimensional Scaling
Given a dissimilarity ρ on X , find f (x (1) ), . . . , f (x (n) ) ∈ R2 that minimize
the distortion introduced by the embedding
X S(f ) =
R kf (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) )
i<j
R(α, β) = (α2 − β 2 )2
R(α, β) = (α − β)2
R(α, β) = (α2 − β 2 )2 /α2
generally, minimization does not have a closed form and requires
iterative optimization
local maxima are possible making the result depend on initial guess
(use multiple restarts)
Computationally difficult for large n
Scale, rotation, often meaningless
Axis interpretation
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Example: Crime Rates
Correlation of crime rates over 50 US States in 1970 US Census
(Wilkinson, 1990)
crime, murder, rape, robbery, assault, burglary, larceny, autotheft
v

u
u
1.00 0.52 0.34 0.81 0.28
u
0.52 1.00 0.55 0.70 0.68
u

u
0.34 0.55 1.00 0.56 0.62
u

√
u
ρ = 1 − s = u1 − 
0.81 0.70 0.56 1.00 0.52
u
0.28 0.68 0.62 0.52 1.00
u

u
0.06 0.60 0.44 0.32 0.80
t
0.11 0.44 0.62 0.33 0.70
S(f ) =
X
0.06
0.60
0.44
0.32
0.80
1.00
0.55
R kf (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) )
i<j
Lebanon and Sha
ML for Information Visualization

0.11
0.44

0.62

0.33

0.70

0.55
1.00
intro dim-red patterns clustering regression classification SSL+AL VA
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Objects x (1) , . . . , x (n) may be defined only implicitly
Survey of similarity between nations (1-10 scale) among students (Wish,
1971)
Brazil
France
China
Congo Cuba Egypt
India
Israel Japan
USSR USA Yugoslavia
p
Using ρ = 1 − D/10 − I
S(f ) =
X
R kf (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) )
i<j
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Example: Text Visualization
Stage 1: Convert each document to a vector x (i) expressing phrase
appearances
Stage 2: Express dissimilarity using a vector distance e.g.,
s
X (i)
(j)
(i) (j)
ρ(x , x ) =
(xw − xw )2
w
Computational challenge: for large n subsample or replace individual
documents by cluster centroids
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
IN-SPIRE (NVAC & PNNL)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Case Study: Votes and Movie Preferences
Visualize preferences A ≺ · · · ≺ C over d items issued by “judges”.
Dataset 1: Election votes (APA presidential votes d = 5 candidates)
Dataset 2: Joke preferences (Jester dataset d = 100 jokes)
Dataset 3: Movie preferences (EachMovie dataset, d = 1628
movies)
Challenges:
Some “judges” omit preferences concerning some items e.g.,
unobserved movies
Define meaningful distance function between votes that is
computationally tractable for large d
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
1(1.33) , 7(8.72)
62
3(2.09) , 1(3.05)
1546
8(6.25) , 3(6.51)
92
2(1.42) , 4(3.90)
1023
4(2.73) , 5(2.83)
1431
APA d = 5
(Kidwell et al. 2008)
Jester d = 100
Lebanon and Sha
ML for Information Visualization
EachMovie d = 1628
intro dim-red patterns clustering regression classification SSL+AL VA
Metric vs. Non-Metric MDS
A variation of MDS approximates the original geometry by a
monotonic transformation t of the embedded distances (called
disparities)
The embedded distances have the right ordering (approximately)
rather than precise values
X min
R t(k(f (x (i) ) − f (x (j) )k), ρ(x (i) , x (j) )
(1)
t,f
i<j
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Evaluation of MDS
Shepard’s diagram plots the embedded distances as a function of a the
original dissimilarities
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
embedded distances seem to be monotonic increasing in original distances
(undervalued for small distances and overvalues for larger distances)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Shepard’s diagram for non-metric MDS + disparities vs. distances
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Lebanon and Sha
0.7
0
0
0.1
0.2
0.3
ML for Information Visualization
0.4
0.5
0.6
0.7
intro dim-red patterns clustering regression classification SSL+AL VA
Procrustes Rotation
MDS objective function is invariant to rigid body transformation
(reflection, rotation, translation)
No reason to prefer one rotation+translation over another
Procrustes aligns two MDS figures to improve side by side
visualization
Step 1: center both figures around 0 by subtracting centroid
Step 2: rotate by
T = arg min trace(A − BT )(A − BT )>
TT > =I
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Global vs. Local MDS
In some cases local structure (small distances) should be more
preserved than global structure (large distances)
Intuition: As distances grow their ordering becomes nearly as
important as the precise values
Motivation: High dimensional data lies on a non-linear manifold in
high dimensions. Approximating local distances in 2-D is much
easier than the unreasonable task of approximating all distances
X
S(f ) =
w (ρ(x (i) , x (j) ))R k(f (x (i) ) − f (x (j) )k, ρ(x (i) , x (j) )
i<j
w monotonic decreasing function
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
(Chen and Buja 2008)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
(Chen and Buja 2008)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Case Study: Search Engine Visualization
(Sun et. al 2010)
Goal: visualize relationship between seven search engines: Google,
Bing, Yahoo, Ask, Altavista, Alltheweb, AOL, Lycos
Step 1: send a query or a list of queries to the search engines
Step 2: Measure distances among them (including ground truth if
possible)
Web-App Example
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Queries
Tourism
Times Square, Sydney Opera House, Eiffel Tower, ...
Celebrity Names
Michael Bolton, Michael Jackson, Jackie Chan, ...
Sports
Football, Acrobatics, Karate, Pole Vault, Butterfly Stroke, ..
University Names Georgia Institute of Technology, University of Florida, ...
Company
Goldman Sachs, Facebook, Honda, Cisco Systems, ..
Questions
How are flying buttresses constructed, ...
Stage 1: Average queries within categories to create per-category MDS
embedding
Stage 2: Procrustes rotation
Categories
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Questions w3
Sports w3
8
7
5
1
2
1. altavista
2. alltheweb
3. ask
4. google
5. lycos
6. live
7. yahoo
8. aol
6
4
5
3
1
2
7
1. altavista
2. alltheweb
3. ask
4. google
5. lycos
6. live
7. yahoo
8. aol
4
6
8
Questions w3 Procruste
8
4
7
1
2
5
1. altavista
2. alltheweb
3. ask
4. google
5. lycos
6. live
7. yahoo
8. aol
3
3
6
(Sun et al. 2010)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Principal Component Analysis
Same as MDS for Euclidean
data: original data x (1) , . . . , x (n) is in
q
P (i)
(j) 2
Rn and ρ(x (i) , x (j) ) =
k (xk − xk )
Solution is given in terms of SVD (eigenvalues, eigenvectors) of
empirical covariance matrix (efficient even for large n, d)
Σ=
n
1 X (i)
(x − x)(x (i) − x)>
n
i=1
= U diag(σ1 , σ2 , . . . , σd ) U >
≈ U diag(σ1 , σ2 , 0, . . . , 0) U >
Single global solution
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Non-Negative Matrix Factorization
arg min
W ,H
XX
(Xij − [WH]ij )2
i
s.t.
Wij , Hij ≥ 0
j
n rows of X ∈ Rn×d are data vectors
r rows of H ∈ Rr ×d are non-negative topics/code-words/factors
n rows of W ∈ Rn×r represents the non-negative degree of
membership of data vectors in the codewords.
Maintaining non-negativity prevents one factor from removing
content that another factor contributed
Iterative optimization required; Factorization not unique.
Extremely powerful in uncovering latent factors:
clustering
compression/coding
recommendation systems
visualization
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
(Lee and Seung, 1999)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
(Lee and Seung, 1999)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Pattern Discovery: Outline
Entropy and conditional entropy
Mutual information
Association rule mining
Example: census data
Example: movie recommendation systems
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Entropy
H(X ) =
X
p(x) log
x
1
p(x)
Measures uncertainty in knowing the value X
Expected number of yes/no questions needed to find out X
Expected number of bits needed to compress X (cannot do better)
Maximum entropy achieved for uniform distribution
Minimum entropy achieved for constant or deterministic variables
p(X = a) = 1/2
p(X = b) = 1/4
p(X = c) = 1/8
p(X = d) = 1/8
p(Y = a) = 1/4 p(Y = b) = 1/4 p(Y = c) = 1/4
1
1
1
H(X ) = log 2 − log 4 − 2 log 8 = 7/4
2
4
8
1
H(Y ) = 4 log 4 = 2
4
p(Y = d) = 1/4
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Conditional Entropy
H(X |y = y ) =
X
p(x|Y = y ) log
x
1
p(x|Y = y )
Measures uncertainty in knowing the value X if you know Y = y
Expected number of yes/no questions needed to find out X if you
know Y = y
Expected number of bits needed to compress X (cannot do better) if
you know Y = y
H(X |Y ) =
X
p(y )H(X |Y = y )
y
Measures uncertainty in knowing the value X if you know Y
Expected number of yes/no questions needed to find out X if you
know Y
Expected number of bits needed to compress X (cannot do better) if
you know Y
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Mutual Information
I (X , Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X )
Symmetric
Reduction in number of yes/no questions needed to know X as a
result of knowing Y
Number of bits needed to compress X as a result of knowing Y
I (X , Y ) = 0 for X , Y independent and I (X , X ) = H(X ).
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Association Rule Mining
Goal: Mine data (shopping transactions) to detect patterns of behavior
Stage 0: Construct a probability estimate p̂, say be maximum
likelihood
Stage 1: Define a set of candidate binary events A1 , . . . , Ak
k
Stage 2: Compute I (Ai , Aj ) for all
combinations
2
Stage 3: Order pairs Ai , Aj by mutual information and inspect the
top
Stage 4: Detect precise rule shape i.e., Ai ⇒ Aj or Aci ⇒ Aj or
Ai ⇒ Acj , etc. by examining probabilities
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Case Study: Census Questions
number in household
number of children
=
=
1
0
⇒ language at home=English
number in home
householder status
occupation
=
=
=
English
own
{professional/managerial}
language in home
income
marital status
number of children
=
≤
=
=
English
$40,000
not married
0
⇒ income ≥ $40,000
⇒ education 6∈ {college graduate,graduate study}
(Hastie, 2009)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Case Study: Netflix Movie preferences
Shrek ≺ LOTR: The Fellowship of the Ring
Shrek ≺ LOTR: The Fellowship of the Ring
Shrek 2 ≺ LOTR: The Fellowship of the Ring
Kill Bill 2 ≺ National Treasure
Shrek 2 ≺ LOTR: The Fellowship of the Ring
LOTR: The Fellowship of the Ring ≺ Monsters, Inc.
National Treasure ≺ Kill Bill 2
LOTR: The Fellowship of the Ring ≺ Monsters, Inc.
How to Lose a Guy in 10 Days ≺ Kill Bill 2
I, Robot ≺ Kill Bill 2
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
Shrek 2≺ LOTR: The Return of the King
Shrek 2≺ LOTR: The Two Towers
Shrek≺ LOTR: The Return of the King
Kill Bill 1 ≺ I. Robot
Shrek 2≺ LOTR: The Two Towers
LOTR: The Two Towers≺ Shrek
Pearl Harbor ≺ Kill Bill 1
LOTR: The Return of the King≺ Shrek
50 First Dates≺ Kill Bill 1
The Day After Tomorrow ≺ Kill Bill 1
(Sun et. al 2010)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Kill Bill 1
Maid in Manhattan
Two Weeks Notice
The Royal Tenenbaums
The Royal Tenenbaums
The Fast and the Furious
Spider-Man
Anger Management
Memento
Maid in Manhattan
Maid in Manhattan
How to Lose a Guy in 10 Days
The Royal Tenenbaums
The Wedding Planner
Peal Harbor
Lost in Translation
The Day After Tomorrow
The Wedding Planner
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
Kill Bill 2
The Wedding Planner
Miss Congeniality
Lost in Translation
American Beauty
Like A ⇒ Like B
Gone in 60 Seconds
Spider-Man 2
Bruce Almighty
Pulp Fiction
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
Pulp Fiction
Kill Bill: 1
Pulp Fiction
Pearl Harbor
Like A ⇒ Dislike B
The Matrix
Memento
Pearl Harbor
American Beauty
Raiders of the Lost Ark
(Sun et al., 2010)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Clustering: Outline
Motivation and setup
K-Means algorithm
Example: word clustering
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Clustering
Goal: Partition data D = {x (1) , . . . , x (n) } to k distinct clusters such that
similar data vectors are assigned to the same cluster
arg min
k X
X
kx − µ(j) k2 ,
µ(j) = average(Sj )
S1 ∪···∪Sk j=1
x∈Sk
k-means: Iterate to convergence
Assignment: Assign each data vector to the cluster with the closest
mean
Update: Calculate the new means for each cluster based on revised
assignment
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Clustering: Visualization Applications
Computational difficulties due to large n e.g., MDS for large text
archives
Partition data vectors x (1) , . . . , x (n) to k n clusters and proceed
with MDS on the cluster centroids
Computational difficulties due to large d e.g., MDS for high
dimensional data
partition data dimensions to k d clusters and proceed with MDS
on the clustered dimensions
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Word Clutering in Reuters RCV1
Define similarity using contextual distribution
sim(w , v ) = f (p(·|v ), p(·|w ))
jan
databas
nbc
feb
intranet
abc
nov
server
cnn
dec
softwar
hollywood
oct
internet
tv
aug
netscap
viewer
apr
onlin
movi
mar
web
audienc
sep
browser
fox
(Dillon et. al 2007)
Lebanon and Sha
wang
chen
liu
beij
wu
china
chines
peng
hui
ML for Information Visualization
ottawa
quebec
montreal
toronto
ontario
vancouv
canada
canadian
calgari
intro dim-red patterns clustering regression classification SSL+AL VA
Classification: Outline
Motivation and setup
Linear regression
Regression trees
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Regression and Visualization
Predict Y ∈ R given X ∈ Rd based on training data (x (i) , y (i) ),
i = 1, . . . , n
Understand relationship between X , Y by inferring simple functional
dependency rather than scatter plot visualization
Variable selection tools enable detecting which data dimensions (or
combination of data dimensions) are relevant for predicting X
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Linear Regression
Predict Y ∈ R given X ∈ Rd based on training data (x (i) , y (i) ),
i = 1, . . . , n
Linear regression assumption Y |X ∼ N(θ> X , σ 2 ).
Recover θ by P
maximum likelihood or least squares
n
θ̂ = arg minθ i=1 (y (i) − θ> x (i) )2
Predict for new data: ŷ = θ̂> x.
Closed form for optimization problem (requires matrix inversion),
single global optimum
Computationally efficient even for large n, d (millions, billions,...)
May be non-linear in X1 , . . . , Xd by regressing non-linear features
f1 (X ), . . . , fd0 (X )
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
45
40
35
MPG
30
25
20
15
10
5
5000
4000
3000
2000
1000
250
200
150
100
50
Horsepower
Weight
ˆ = θ̂ (c, weight, horsepower , weight ∗ horsepower )
MPG
>
Lebanon and Sha
ML for Information Visualization
0
intro dim-red patterns clustering regression classification SSL+AL VA
Regression Trees
Dependency of Y and X may have different functional form in different
regions of the data space
Regression trees are non-parametric regression models where the
leaves partition the input space and determine ŷ as the data average
Prediction in leaf/region A: ŷ =average(all x (i) in that region)
Tree construction: Starting with all training data in the root,
iteratively construct tree based on
R1 (j, s) = {x : xj ≤ s}
R2 (j, s) = {x|xj > s}


X
X
(i)
2
(i)
2
min min
(y − c1 ) + min
(y − c2 ) 
j,s
c1
x (i) ∈R1 (j,s)
c2
x (i) ∈R2 (j,s)
See next slide for tree diagrams (Hastie et. al, 2009)
Lebanon and Sha
ML for Information Visualization
Elements of Statistical Learning (2nd Ed.) �Hastie, Tibshirani & Friedman 2009 Chap 9
intro dim-red patterns clustering regression classification SSL+AL VA
R5
X2
X2
R2
t4
R3
t2
R4
R1
t1 t3
X1
X1
X 1 ≤ t1
|
X2 ≤ t2X1 ≤ t3
R1
R2
X 2 ≤ t4
R3
X2
R4
R5
Lebanon and Sha
ML for Information Visualization
X1
intro dim-red patterns clustering regression classification SSL+AL VA
Classification: Outline
Motivation and setup
Generative approaches
Disciminative approaches
Classification trees
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Classification for Visualization
Goal: Predict Y ∈ {1, 2, . . . , k} based on X ∈ Rd
Visualize dependencies between X and Y by examining classification
rules or decision boundary
Pre-processing in text visualization: part of speech tagging, named
entity recognition, word sense disambiguation
Filter relevant data for visualization from massive archive (face
images, credit card fraud, articles concerning a certain topic)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Generative Classification
Generative classification: fit p̂(x, y ) = p̂(x|y )p̂(y ) based on training data
and classify ŷ = arg maxy p(y |x)
Fisher’s LDA: estimate p̂(x|y ) using MLE for multivariate Gaussian:
p(x|y ) = N(x ; θy , Σy )
Q
Naive Bayes: p(x|y ) ≈ i p(xi |y ) (estimate rhs using 1-d MLE
estimation
In either case p(y ) is estimated using empirical frequency
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Discriminative Classification
Discriminative classification:
ŷ = sign θ̂> x
θ̂ = arg min
θ
n
1X
L(y θ> x (i) )
n
i=1
where
L1 (r ) = exp (−r )
boosting
L2 (r ) = log (1 + exp (−r ))
L3 (r ) = max(1 − r , 0)
logistic regression
support vector machine
Typically perform better than generative classifiers
Minimization problems convex for the loss functions above
Interpretation as maximum conditional likelihood for p(y |x)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Classification Trees
Similar to regression trees: iteratively split nodes to create a partition of
the space A1 , . . . , Ar where
1 X
I (y (i) = 1)
|Aj | (i)
x ∈Aj
(
1 pAj (y = 1) > pAj (y = −1)
k(Aj ) =
0 otherwise
p̂Aj (y = 1) =
splitScore(A = B ∪ C ) =
1
1
H(p̂B ) +
H(p̂C )
|B|
|C |
Larger trees tend to overfit the training data and perform poorly on
future unseen data
Possible solution is to prefer shorter trees e.g., choose the smallest
tree that is not too much worse than the best tree
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Semisupervised and Active Learning: Outline
Two important deviations from standard ML setup
Semisupervised learning: learn X → Y using a combination of
labeled and unlabeled data.
Particularly useful when massive archives of unlabeled data exist:
language, speech, internet, images
Active learning: learn X → Y by interactively choosing which
datapoints are to be labeled.
Useful for interactive visualization applications
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Semisupervised Learning (SSL)
Goal: Predict Y based on X (classification or regression) based on
labeled data (x (i) , y (i) ), i = 1, . . . , n and unlabeled data x (i) ,
i = n + 1, . . . , n + m.
Motivation: In many cases labeled data is much more expensive to
obtain than unlabeled data i.e., n m.
Prediction accuracy increases with both n, m
Approach 1 Generative SSL: maximize likelihood of observed data
( n
)
n+m
X
X
X
(i)
(i)
(i)
θ̂ = arg max
log pθ (x , y ) +
log
pθ (x , y )
θ
i=1
i=n+1
y
Approach 2 Discriminate SSL: use unlabeled data to bias selection
towards smooth models
θ̂ = arg min
θ
n
1X
L(y (i) θ> x (i) ) + R(θ, x (n+1) , . . . , x (n+m) )
n
i=1
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
(Zhou, 2005)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
(Zhou, 2005)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Active Learning
Goal: Predict Y from X based on training data (x (i) , y (i) ), i = 1, . . . , n
such that x (i) are chosen rather than observed
Motivation: prediction X → Y is harder in some areas of the space
of data vectors X . Choosing x (i) i = 1, . . . , n to concentrate on
challenging areas makes better use of resources
Approach: Choose x (i) i = 1, . . . , n sequentially in regions where the
predictor is least certain i.e. H(p̂) is highest
Violates iid assumption that is central to proving large sample
consistency in many cases
Useful way to insert user interaction into the modeling process
(visual analytics)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Visual Analytics: Outline
Motivation and setup
Case study: text visualization
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Visual Analytics?
Analyze complex data by displaying and examining visual cues
dimensionality reduction
user interaction
domain knowledge
iterative process
Related disciplines: machine learning, visualization, graphics, human
computer interaction
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Raw data: text,
images, video, etc.
Domain Expert
or User
Raw Data Representation
High dimensional vector
data
Domain knowledge
User feedback
Dimensionality Reduction
Low dimensional vector
data
User
Visualization evaluated
by user
Visualization System
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Application: Text Visualization
Embed document collection in 2-D while preserving semantically coherent
spatial structure. Challenges include:
more than one interpretation of semantic coherence
(topic,sentiment,author,interest)
incorporating domain knowledge
incorporating user feedback
quantitative evaluation
(Mao et al. 2010)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Text Visualization as Metric Learning
Standard dimensionality reduction methods (PCA,LLE,t-SNE,etc.)
assume Euclidean geometry which is inappropriate for text. Words
are (apriori) orthogonal
Adjust methods to work on non-Euclidean geometry
q
dH (x, y ) = (x − y )> H > H(x − y ).
(2)
where H reflects the relationship between words and visualization
goal (equivalent to composing the transformation x 7→ Hx with
standard dimensionality reduction techniques)
Problem: determine H using domain knowledge, user interaction (no
labeled data)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Method A: Manual Specification
Define a block-diagonal matrix H = RD from manually constructed word
clusters (R is stochastic translation matrix, D is diagonal weighting
matrix)



5 0 0 0 0
0.8 0.1 0.1 0
0


0.1 0.8 0.1 0
0
  0 5 0 0 0



0.1 0.1 0.8 0
0   0 0 5 0 0


0
0
0 0.9 0.1 0 0 0 3 0
0 0 0 0 3
0
0
0 0.1 0.9
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Vocabulary
Sci &
Tech
Politics
Mid East
Comp
Others
Med
Religion
Evol
Christianity
History
Others
People
Others
bible
gospel
amen
christians
santa
HW
SW
GUI
Team Name Others
Canoeing
catch
boxing
innings
soccer
Others
Lebanon and Sha
Sports
ML for Information Visualization
Ot
intro dim-red patterns clustering regression classification SSL+AL VA
Method B: Mahalanobis Distance
H > H = Σ−1 where Σ is the covariance matrix of the underlying
distribution (estimated from a large dataset)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Method C: Contextual Diffusion
Contextual distribution of word v :
qv (w ) = p(w appears in x|v appears in x)
(3)
Matrix H is determined by similarity of contextual distributions
!!
Xp
2
qu (w )qv (w )
.
H(u, v ) = exp −c arccos
w
Intuitively, the word u will be translated or diffused into v depending
on the geometric diffusion between the distributions of likely
contexts.
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Method D: Google n-Grams
Same as C but with contextual distribution estimated from the Google
n-gram dataset (n-gram counts, for n ≤ 5, obtained from the Google
crawler based on processing over a trillion words of running text)
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
Method E: Word-Net
Define H based on word similarity measures from the Word-Net dataset
(tried several standard measures).
Lebanon and Sha
ML for Information Visualization
intro dim-red patterns clustering regression classification SSL+AL VA
H=I
B
C
D
E
PCA (1)
1.5391
1.9314
1.2570
1.2023
1.4475
PCA (2)
1.4085
1.7126
1.3036
1.3407
1.3352
t-SNE (1)
1.1649
1.6172
1.2182
0.7844
1.1762
t-SNE (2)
1.1206
1.3008
1.2331
1.0723
1.1362
H=I
B
C
D
E
PCA (1)
0.8461
0.6073
0.7381
0.8420
0.8532
PCA (2)
0.5630
0.4614
0.6815
0.5898
0.5868
t-SNE (1)
0.9056
0.8249
0.9110
0.9323
0.9013
t-SNE (2)
0.7281
0.7207
0.6724
0.7359
0.7728
Table 1: Quantitative evaluation of dimensionality reduction for visualization for two
tasks in the news article domain. The numbers in the top five rows correspond to
measure (i) (lower is better), and the numbers in the bottom five rows correspond to
measure (iii) (k = 5) (higher is better). We conclude that contextual diffusion (C),
Google n-gram (D), and Word-Net (E) tend to outperform the original H = I . The
Mahalanobis distance performs poorly.
Lebanon and Sha
ML for Information Visualization