Download A Two-Stage Approach to Domain Adaptation for Statistical Classifiers

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign What is domain adaptation? Nov 7, 2007 CIKM2007 2 Example: named entity recognition persons, locations, organizations, etc. train (labeled) test (unlabeled) standard NER supervised learning Classifier New York Times Nov 7, 2007 85.5% New York Times CIKM2007 3 Example: named entity recognition persons, locations, organizations, etc. train (labeled) test (unlabeled) non-standard NER (realistic) setting Classifier labeled data notTimes available NewReuters York Nov 7, 2007 64.1% New York Times CIKM2007 4 Domain difference  performance drop train test ideal setting NYT NER Classifier New York Times NYT 85.5% New York Times realistic setting Reuters NER Classifier Reuters Nov 7, 2007 NYT 64.1% New York Times CIKM2007 5 Another NER example train test ideal setting gene name recognizer mouse 54.1% mouse realistic setting gene name recognizer fly Nov 7, 2007 28.1% mouse CIKM2007 6 Other examples  Spam filtering:   Sentiment analysis of product reviews     Public email collection  personal inboxes Digital cameras  cell phones Movies  books Can we do better than standard supervised learning? Domain adaptation: to design learning methods that are aware of the training and test domain difference. Nov 7, 2007 CIKM2007 7 How do we solve the problem in general? Nov 7, 2007 CIKM2007 8 Observation 1 domain-specific features wingless daughterless eyeless apexless … Nov 7, 2007 CIKM2007 9 Observation 1 domain-specific features wingless daughterless eyeless apexless … • describing phenotype • in fly gene nomenclature • feature “-less” weighted high feature still useful for other organisms? CD38 PABPC5 … No! Nov 7, 2007 CIKM2007 10 Observation 2 generalizable features …decapentaplegic and wingless are expressed in analogous patterns in each… Nov 7, 2007 …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. CIKM2007 11 Observation 2 generalizable features …decapentaplegic and wingless are expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. feature “X be expressed” Nov 7, 2007 CIKM2007 12 General idea: two-stage approach domain-specific features Source Domain Target Domain generalizable features features Nov 7, 2007 CIKM2007 13 Goal Source Domain Target Domain features Nov 7, 2007 CIKM2007 14 Regular classification Source Domain Target Domain features Nov 7, 2007 CIKM2007 15 Generalization: to emphasize generalizable features in the trained model Source Domain features Nov 7, 2007 Target Domain Stage 1 CIKM2007 16 Adaptation: to pick up domain-specific features for the target domain Source Domain features Nov 7, 2007 Target Domain Stage 2 CIKM2007 17 Regular semi-supervised learning Source Domain Target Domain features Nov 7, 2007 CIKM2007 18 Comparison with related work  We explicitly model generalizable features.   We do not need labeled target data but we need multiple source (training) domains.   Previous work models it implicitly [Blitzer et al. 2006, BenDavid et al. 2007, Daumé III 2007]. Some work requires labeled target data [Daumé III 2007]. We have a 2nd stage of adaptation, which uses semi-supervised learning.  Previous work does not incorporate semi-supervised learning [Blitzer et al. 2006, Ben-David et al. 2007, Daumé III 2007]. Nov 7, 2007 CIKM2007 19 Implementation of the twostage approach with logistic regression classifiers Nov 7, 2007 CIKM2007 20 Logistic regression classifiers 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 0 1 0 0 1 : : 0 1 0 -less p( y | x, w)  T y exp( w x) exp( w  p binary features T y' x) y' X be expressed … and wingless are expressed in… wywyT x x Nov 7, 2007 CIKM2007 21 Learning a logistic regression classifier 0.2 0 regularization term 4.5 1 2 5 0 ˆ w  arg min  w -0.3 0 w 3.0 1 large weights T  : penalize : N exp( w y x)  1 : :  log  complexity T 2.1control 0 model N i 1 exp( w y ' x)  -0.9 1 y'  0.4 0   wyT x Nov 7, 2007  log likelihood of training data CIKM2007 22 Generalizable features in weight vectors Nov 7, 2007 D1 D2 DK 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 3.2 0.5 4.5 -0.1 3.5 : : 0.1 -1.0 -0.2 0.1 0.7 4.2 0.1 3.2 : : 1.7 0.1 0.3 w1 w2 … CIKM2007 wK K source domains domain-specific features generalizable features 23 We want to decompose w in this way 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 Nov 7, 2007 h non-zero entries for h generalizable features = CIKM2007 0 0 4.6 0 3.2 : : 0 0 0 + 0.2 4.5 0.4 -0.3 -0.2 : : 2.1 -0.9 0.4 24 Feature selection matrix A 0 matrix A selects h generalizable features 1 0 0 1 : : 0 1 0 0 0 1 0 0 … 0 0 0 0 0 1 … 0 : : 0 0 0 0 0 … 1 A Nov 7, 2007 CIKM2007 x 0 1 : : 0 h z = Ax 25 Decomposition of w weights for domain-specific features weights for generalizable features 0.2 0 0.2 4.5 1 4.5 4.6 0 5 0 0.4 3.2 1 -0.3 0 -0.3 + -0.2 = : : 3.0 1 : : : : : 3.6 0 : : : 2.1 0 2.1 -0.9 1 -0.9 0.4 0 0.4 Nov 7, 2007 wT x = vT z CIKM2007 + uT x 0 1 0 0 1 : : 0 1 0 26 Decomposition of w wTx = vTz + uTx = vTAx + uTx = (Av)Tx + uTx w = AT v + u Nov 7, 2007 CIKM2007 27 Decomposition of w shared by all domain-specific domains 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 Nov 7, 2007 w = 0 0 1 0 0 0 0 0 0 1 … … … … … 0 0 0 0 0 : : 4.6 3.2 : : 3.6 + 0 0 … 0 0 0 … 0 0 0 … 0 = CIKM2007 AT v + 0.2 4.5 0.4 -0.3 -0.2 : : 2.1 -0.9 0.4 u 28 Framework for generalization Source Domain Fix A, optimize: Target Domain K   2 k k 2 ( vˆ ,{uˆ })  arg min   v  s  u  v ,{u k }   k 1  1  K Nk 1  k 1 N k  log p( y | x ; A v  u )  i 1  Nk k i k i T k regularization term data k likelihood of labeled w λs >> 1: to penalize domain-specific features from K source domains Nov 7, 2007 CIKM2007 29 Framework for adaptation Source Domain Fix A, optimize: Target Domain K   2 t k k 2 t 2 ( vˆ , uˆ ,{uˆ })  arg min   v  s  u  t u  v ,u t {u k }   k 1  1  Nk 1 Nk k k T k    log p ( y | x ; A v  u )  i i K  1  k 1 N k i 1 1 m t t T t    log p( yi | x i ; A v  u )  m i 1  Nov 7, 2007 likelihood of pseudo target domain λt = 1 << λs : to pick up labeled domain-specific examples features in the target domain CIKM2007 30 How to find A? (1)  Joint optimization K 2   2 k k ( Aˆ , vˆ ,{uˆ })  arg min   v  s  u  A,v ,{u k }   k 1  1  K  Nk 1  k 1 N k  log p( y | x ; A v  u )  i 1  Nk k i k i T k Alternating optimization Nov 7, 2007 CIKM2007 31 How to find A? (2)  Domain cross validation   Idea: training on (K – 1) source domains and test on the held-out source domain Approximation:    wfk: weight for feature f learned from domain k wfk: weight for feature f learned from other domains rank features by K k w  f wf k k 1  See paper for details Nov 7, 2007 CIKM2007 32 Intuition for domain cross validation domains D1 D2 … Dk-1 w w … … … 1.5 expressed … … … 0.05 -less 2.0 1.8 0.1 Nov 7, 2007 Dk (fly) CIKM2007 … … … expressed … … … -less 1.2 … … -less … … expressed … … 33 Experiments  Data set     BioCreative Challenge Task 1B Gene/protein name recognition 3 organisms/domains: fly, mouse and yeast Experiment setup   2 organisms for training, 1 for testing F1 as performance measure Nov 7, 2007 CIKM2007 34 Experiments: Generalization Source Target DomainDomain using generalizable features is effective Method F+MY M+YF Y+FM BL 0.633 0.129 0.416 DA-1 (joint-opt) 0.627 0.153 0.425 DA-2 (domain CV) 0.654 0.195 0.470 F: fly Source Target DomainDomain Nov 7, 2007 M: mouse Y: yeast domain cross validation is more effective than joint optimization CIKM2007 35 Experiments: Adaptation Source Target DomainDomain Method BL-SSL DA-2-SSL F: fly Source Target DomainDomain Nov 7, 2007 F+MY 0.633 0.759 M: mouse M+YF 0.241 0.305 Y+FM 0.458 0.501 Y: yeast domain-adaptive bootstrapping is more effective than regular bootstrapping CIKM2007 36 Experiments: Adaptation domain-adaptive SSL is more effective, especially with a small number of pseudo labels Nov 7, 2007 CIKM2007 37 Conclusions and future work  Two-stage domain adaptation    Two ways to find generalizable features   Generalization: outperformed standard supervised learning Adaptation: outperformed standard bootstrapping Domain cross validation is more effective Future work   Single source domain? Setting parameters h and m Nov 7, 2007 CIKM2007 38 References    S. Ben-David, J. Blitzer, K. Crammer & F. Pereira. Analysis of representations for domain adaptation. NIPS 2007. J. Blitzer, R. McDonald & F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006. H. Daumé III. Frustratingly easy domain adaptation. ACL 2007. Nov 7, 2007 CIKM2007 39 Thank you! Nov 7, 2007 CIKM2007 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Two-Stage Approach to Domain Adaptation for Statistical Classifiers