Download A Two-Stage Approach to Domain Adaptation for Statistical Classifiers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Two-Stage Approach to Domain
Adaptation for Statistical Classifiers
Jing Jiang & ChengXiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
What is domain adaptation?
Nov 7, 2007
CIKM2007
2
Example: named entity recognition
persons, locations, organizations, etc.
train
(labeled)
test
(unlabeled)
standard
NER
supervised
learning
Classifier
New York Times
Nov 7, 2007
85.5%
New York Times
CIKM2007
3
Example: named entity recognition
persons, locations, organizations, etc.
train
(labeled)
test
(unlabeled)
non-standard
NER
(realistic)
setting
Classifier
labeled data
notTimes
available
NewReuters
York
Nov 7, 2007
64.1%
New York Times
CIKM2007
4
Domain difference  performance drop
train
test
ideal setting
NYT
NER
Classifier
New York Times
NYT
85.5%
New York Times
realistic setting
Reuters
NER
Classifier
Reuters
Nov 7, 2007
NYT
64.1%
New York Times
CIKM2007
5
Another NER example
train
test
ideal setting
gene
name
recognizer
mouse
54.1%
mouse
realistic setting
gene
name
recognizer
fly
Nov 7, 2007
28.1%
mouse
CIKM2007
6
Other examples

Spam filtering:


Sentiment analysis of product reviews




Public email collection  personal inboxes
Digital cameras  cell phones
Movies  books
Can we do better than standard supervised
learning?
Domain adaptation: to design learning methods that
are aware of the training and test domain difference.
Nov 7, 2007
CIKM2007
7
How do we solve the
problem in general?
Nov 7, 2007
CIKM2007
8
Observation 1
domain-specific features
wingless
daughterless
eyeless
apexless
…
Nov 7, 2007
CIKM2007
9
Observation 1
domain-specific features
wingless
daughterless
eyeless
apexless
…
• describing phenotype
• in fly gene nomenclature
• feature “-less” weighted high
feature still
useful for other
organisms?
CD38
PABPC5
…
No!
Nov 7, 2007
CIKM2007
10
Observation 2
generalizable features
…decapentaplegic
and wingless are
expressed in
analogous patterns in
each…
Nov 7, 2007
…that CD38 is expressed
by both neurons and glial
cells…that PABPC5 is
expressed in fetal brain
and in a range of adult
tissues.
CIKM2007
11
Observation 2
generalizable features
…decapentaplegic
and wingless are
expressed in
analogous patterns in
each…
…that CD38 is expressed
by both neurons and glial
cells…that PABPC5 is
expressed in fetal brain
and in a range of adult
tissues.
feature “X be expressed”
Nov 7, 2007
CIKM2007
12
General idea: two-stage approach
domain-specific features
Source
Domain
Target
Domain
generalizable features
features
Nov 7, 2007
CIKM2007
13
Goal
Source
Domain
Target
Domain
features
Nov 7, 2007
CIKM2007
14
Regular classification
Source
Domain
Target
Domain
features
Nov 7, 2007
CIKM2007
15
Generalization: to emphasize generalizable
features in the trained model
Source
Domain
features
Nov 7, 2007
Target
Domain
Stage 1
CIKM2007
16
Adaptation: to pick up domain-specific
features for the target domain
Source
Domain
features
Nov 7, 2007
Target
Domain
Stage 2
CIKM2007
17
Regular semi-supervised learning
Source
Domain
Target
Domain
features
Nov 7, 2007
CIKM2007
18
Comparison with related work

We explicitly model generalizable features.


We do not need labeled target data but we need
multiple source (training) domains.


Previous work models it implicitly [Blitzer et al. 2006, BenDavid et al. 2007, Daumé III 2007].
Some work requires labeled target data [Daumé III 2007].
We have a 2nd stage of adaptation, which uses
semi-supervised learning.

Previous work does not incorporate semi-supervised
learning [Blitzer et al. 2006, Ben-David et al. 2007, Daumé
III 2007].
Nov 7, 2007
CIKM2007
19
Implementation of the twostage approach with logistic
regression classifiers
Nov 7, 2007
CIKM2007
20
Logistic regression classifiers
0.2
4.5
5
-0.3
3.0
:
:
2.1
-0.9
0.4
0
1
0
0
1
:
:
0
1
0
-less
p( y | x, w) 
T
y
exp( w x)
exp( w

p binary features
T
y'
x)
y'
X be expressed
… and wingless are
expressed in…
wywyT x x
Nov 7, 2007
CIKM2007
21
Learning a logistic regression classifier
0.2
0
regularization term
4.5
1
2
5
0
ˆ
w

arg
min

w
-0.3
0
w
3.0
1
large weights T

: penalize
:
N
exp( w y x) 
1
:
:

log

complexity
T
2.1control
0 model
N i 1
exp( w y ' x) 
-0.9
1
y'

0.4
0


wyT x
Nov 7, 2007

log likelihood of
training data
CIKM2007
22
Generalizable features in weight vectors
Nov 7, 2007
D1
D2
DK
0.2
4.5
5
-0.3
3.0
:
:
2.1
-0.9
0.4
3.2
0.5
4.5
-0.1
3.5
:
:
0.1
-1.0
-0.2
0.1
0.7
4.2
0.1
3.2
:
:
1.7
0.1
0.3
w1
w2
…
CIKM2007
wK
K source
domains
domain-specific
features
generalizable
features
23
We want to decompose w in this way
0.2
4.5
5
-0.3
3.0
:
:
2.1
-0.9
0.4
Nov 7, 2007
h non-zero
entries for h
generalizable
features
=
CIKM2007
0
0
4.6
0
3.2
:
:
0
0
0
+
0.2
4.5
0.4
-0.3
-0.2
:
:
2.1
-0.9
0.4
24
Feature selection matrix A
0
matrix A selects h generalizable
features
1
0
0
1
:
:
0
1
0
0 0 1 0 0 … 0
0 0 0 0 1 … 0
:
:
0 0 0 0 0 … 1
A
Nov 7, 2007
CIKM2007
x
0
1
:
:
0
h
z = Ax
25
Decomposition of w
weights for domain-specific features
weights for generalizable features
0.2
0
0.2
4.5
1
4.5
4.6
0
5
0
0.4
3.2
1
-0.3
0
-0.3
+ -0.2
=
:
:
3.0
1
:
:
:
:
:
3.6
0
:
:
:
2.1
0
2.1
-0.9
1
-0.9
0.4
0
0.4
Nov 7, 2007
wT x
=
vT z
CIKM2007
+
uT x
0
1
0
0
1
:
:
0
1
0
26
Decomposition of w
wTx = vTz + uTx
= vTAx + uTx
= (Av)Tx + uTx
w = AT v + u
Nov 7, 2007
CIKM2007
27
Decomposition of w
shared by all domain-specific
domains
0.2
4.5
5
-0.3
3.0
:
:
2.1
-0.9
0.4
Nov 7, 2007
w
=
0
0
1
0
0
0
0
0
0
1
…
…
…
…
…
0
0
0
0
0
:
:
4.6
3.2
:
:
3.6
+
0 0 … 0
0 0 … 0
0 0 … 0
=
CIKM2007
AT v
+
0.2
4.5
0.4
-0.3
-0.2
:
:
2.1
-0.9
0.4
u
28
Framework for generalization
Source
Domain
Fix A, optimize:
Target
Domain
K


2
k
k 2
( vˆ ,{uˆ })  arg min   v  s  u 
v ,{u k }  
k 1

1

K
Nk
1

k 1 N k

log p( y | x ; A v  u )

i 1

Nk
k
i
k
i
T
k
regularization
term data
k
likelihood
of
labeled
w
λs >> 1: to penalize domain-specific features
from K source domains
Nov 7, 2007
CIKM2007
29
Framework for adaptation
Source
Domain
Fix A, optimize:
Target
Domain
K


2
t
k
k 2
t 2
( vˆ , uˆ ,{uˆ })  arg min   v  s  u  t u 
v ,u t {u k }  
k 1

1  Nk 1 Nk
k
k
T
k
 

log
p
(
y
|
x
;
A
v

u
)

i
i
K  1  k 1 N k i 1
1 m
t
t
T
t 
  log p( yi | x i ; A v  u ) 
m i 1

Nov 7, 2007
likelihood of pseudo
target domain
λt = 1 << λs : to pick up labeled
domain-specific
examples
features in the target domain
CIKM2007
30
How to find A? (1)

Joint optimization
K
2


2
k
k
( Aˆ , vˆ ,{uˆ })  arg min   v  s  u 
A,v ,{u k }  
k 1

1

K

Nk
1

k 1 N k

log p( y | x ; A v  u )

i 1

Nk
k
i
k
i
T
k
Alternating optimization
Nov 7, 2007
CIKM2007
31
How to find A? (2)

Domain cross validation


Idea: training on (K – 1) source domains and test
on the held-out source domain
Approximation:



wfk: weight for feature f learned from domain k
wfk: weight for feature f learned from other domains
rank features by
K
k
w
 f wf
k
k 1

See paper for details
Nov 7, 2007
CIKM2007
32
Intuition for domain cross validation
domains
D1
D2
…
Dk-1
w
w
…
…
…
1.5 expressed
…
…
…
0.05 -less
2.0
1.8
0.1
Nov 7, 2007
Dk (fly)
CIKM2007
…
…
…
expressed
…
…
…
-less
1.2
…
…
-less
…
…
expressed
…
…
33
Experiments

Data set




BioCreative Challenge Task 1B
Gene/protein name recognition
3 organisms/domains: fly, mouse and yeast
Experiment setup


2 organisms for training, 1 for testing
F1 as performance measure
Nov 7, 2007
CIKM2007
34
Experiments: Generalization
Source Target
DomainDomain
using generalizable features is effective
Method
F+MY
M+YF
Y+FM
BL
0.633
0.129
0.416
DA-1
(joint-opt)
0.627
0.153
0.425
DA-2
(domain CV)
0.654
0.195
0.470
F: fly
Source Target
DomainDomain
Nov 7, 2007
M: mouse
Y: yeast
domain cross validation is more
effective than joint optimization
CIKM2007
35
Experiments: Adaptation
Source Target
DomainDomain
Method
BL-SSL
DA-2-SSL
F: fly
Source Target
DomainDomain
Nov 7, 2007
F+MY
0.633
0.759
M: mouse
M+YF
0.241
0.305
Y+FM
0.458
0.501
Y: yeast
domain-adaptive bootstrapping is more
effective than regular bootstrapping
CIKM2007
36
Experiments: Adaptation
domain-adaptive SSL is more effective,
especially with a small number of pseudo labels
Nov 7, 2007
CIKM2007
37
Conclusions and future work

Two-stage domain adaptation



Two ways to find generalizable features


Generalization: outperformed standard supervised
learning
Adaptation: outperformed standard bootstrapping
Domain cross validation is more effective
Future work


Single source domain?
Setting parameters h and m
Nov 7, 2007
CIKM2007
38
References



S. Ben-David, J. Blitzer, K. Crammer & F.
Pereira. Analysis of representations for domain
adaptation. NIPS 2007.
J. Blitzer, R. McDonald & F. Pereira. Domain
adaptation with structural correspondence
learning. EMNLP 2006.
H. Daumé III. Frustratingly easy domain
adaptation. ACL 2007.
Nov 7, 2007
CIKM2007
39
Thank you!
Nov 7, 2007
CIKM2007
40
Related documents