Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Heterogeneous Cross Domain Ranking in Latent Space Bo Wang1, Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4 1Nanjing University of Aeronautics and Astronautics 2Tsinghua University 3IBM T.J. Watson Research Center, USA 4Peking University 1 Introduction • The web is becoming more and more heterogeneous • Ranking is the fundamental problem over web – unsupervised v.s. supervised – homogeneous v.s. heterogeneous 2 Motivation Dr. Tang Association... write SVM... cite Pc member ISWC IJCAI WWW ? Prof. Wang publish write write 1) How to capture the correlation SDM between Authors heterogeneous objects? ICDM 2) How to preserve the preference orders PAKDD between objects across heterogeneous domains? EOS... Semantic... write write Data Mining: Concepts and Techniques Limin KDD ISWC publish Conferences write Main Challenges cite publish cite IJCAI Principles of Data Mining publish WWW cite Query: “data mining” Papers write Tree CRF... publish publish Prof. Li ? Annotation... coauthor Write coauthor ? P. Yu ? ? Dr. Tang Tree CRF... SVM... EOS... Prof. Wang Limin Heterogeneous cross domain ranking 3 Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion 4 Related Work • Learning to rank – Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07] [Yue, 07] – Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08] – Ranking adaptation: [Chen, 08] • Transfer learning – Instance-based: [Dai, 07] [Gao, 08] – Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07] – Model-based: [Bonilla, 08] 5 Outline • Related Work • Heterogeneous cross domain ranking – Basic idea – Proposed algorithm: HCDRank • Experiments • Conclusion 6 Query: “data mining” Conference Source Domain Expert KDD KDD A X PKDD SDM B Y PAKDD ADMA C Z Target Domain Jiawei Han Alice Jie Tang Jerry KDD A Jiawei Han PKDD B Jerry PAKDD C Jie Tang Bo Wang Bob Tom Bob mis-ranked pairs KDD X SDM Y ADMA Z Alice Bo Wang Tom Latent Space 7 mis-ranked pairs The Proposed Algorithm — HCDRank How to optimize? How to define? Non-convex Dual problem 8 alternately optimize matrix M and D O(2T*sN logN) O((2T+1)*sN log(N) + d3 Construct transformation matrix O(d3) learning in latent space O(sN logN) 9 Outline • Related Work • Heterogeneous cross domain ranking • Experiments – Ranking on Homogeneous data – Ranking on Heterogeneous data – Ranking on Heterogeneous tasks • Conclusion 10 Experiments • Data sets – Homogeneous data set: LETOR_TR • 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR – Heterogeneous academic data set: ArnetMiner.org • 14,134 authors, 10,716 papers, and 1,434 conferences – Heterogeneous task data set: • 9 queries, 900 experts, 450 best supervisor candidates • Evaluation measures – MAP – NDCG 11 Ranking on Homogeneous data • LETOR_TR – We made a slight revision of LETOR 2.0 to fit into the crossdomain ranking scenario – three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR • Baselines 12 TREC2003_TR TREC2004_TR Cosine Similarity=0.01 Cosine Similarity=0.23 OHSUMED_TR 13 Cosine Similarity=0.18 Training Time 14 Ranking on Heterogeneous data • ArnetMiner data set (www.arnetminer.org) 14,134 authors, 10,716 papers, and 1,434 conferences • Training and test data set: – 44 most frequent queried keywords from log file • Author collection: Libra, Rexa and ArnetMiner • Conference collection: Libra, ArnetMiner • Ground truth: – Conference: online resources – Expert: two faculty members and five graduate students from CS provided human judgments for expert ranking 15 Feature Definition 16 Features Description L1-L10 Low-level language model features H1-H3 High-level language model features S1 How many years the conference has been held S2 The sum of citation number of the conference during recent 5 years S3 The sum of citation number of the conference during recent 10 years S4 How many years have passed since his/her first paper S5 The sum of citation number of all the publications of one expert S6 How many papers have been cited more than 5 times S7 How many papers have been cited more than 10 times Expert Finding Results 17 Feature Correlation Analysis 18 Ranking on Heterogeneous tasks • Expert finding task v.s. best supervisor finding task • Training and test data set: – expert finding task: ranking lists from ArnetMiner or annotated lists – best supervisor finding task: 9 most frequent queries from log file of ArnetMiner • For each query, we collected 50 best supervisor candidates, and sent emails to 100 researchers for annotation • Ground truth: – Collection of feedbacks about the candidates (yes/ no/ not sure) 19 Feature Definition Features L1-L10 H1-H3 B1 B2 B3 B4 B5 B6 B7 B8 SumCo1-SumCo8 AvgCo1-AvgCo8 SumStu1-SumStu8 AvgStu1-AvgStu8 20 Description Low-level language model features High-level language model features The year he/she published his/her first paper The number of papers of an expert The number of papers in recent 2 years The number of papers in recent 5 years The number of citations of all his/her papers The number of papers cited more than 5 times The number of papers cited more than 10 times PageRank score The sum of coauthors’ B1-B8 scores The average of coauthors’ B1-B8 scores The sum of his/her advisees’ B1-B8 scores The average of his/her advisees’ B1-B8 scores Best supervisor finding results 21 Experimental Results 22 Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion 23 Conclusion • Formally define the problem of heterogeneous cross domain ranking and propose a general framework • We provide a preferred solution under the regularized framework by simultaneously minimizing two ranking loss functions in two domains • The experimental results on three different genres of data sets verified the effectiveness of the proposed algorithm 24 Data Set 25 Ranking on Heterogeneous data • A subset of ArnetMiner (www.arnetminer.org) 14134 authors, 10716 papers, and 1434 conferences • 44 most frequent queried keywords from log file • Author collection: – For each query, we gathered top 30 experts from Libra, Rexa and ArnetMiner • Conference collection: – For each query, we gathered top 30 conferences from Libra and ArntetMiner • Ground truth: – Three online resources • http://www.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html • http://www3.ntu.edu.sg/home/ASSourav/crank.htm • http://www.cs-conference-ranking.org/conferencerankings/alltopics.html – Two faculty members and five graduate students from CS provided human judgments 26 Ranking on Heterogeneous tasks • For expert finding task, we can use results from ArnetMiner or annotated lists as training data • For best supervisor task, 9 most frequent queries from log file of ArnetMiner are used – For each query, we sent emails to 100 researchers • Top 50 researchers by ArnetMiner • Top 50 researchers who start publishing papers only in recent years (91.6% of them are currently graduates or postdoctoral researchers) – Collection of feedbacks • 50 best supervisor candidates (yes/ no/ not sure) • Also add other candidates – Ground truth 28