Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IE for Low-resource Languages Heng Ji Outline • Name Translation Mining • Bi-lingual Dictionary Induction • Cross-lingual Projection 2 Why Name Translation Our Goal: Break the Language Barrier Online Language Populations (Total: 801.4 Million, Sept 2004) • Standard MT is Simply not Enough Source Text 俄塔社援引紧急情况部莫斯科市总局新闻处处长博贝列夫 (Bo Bei Lie Fu)的话... • Reference Translation The Russian news agency Tass, quoting Director Bobylev of the news office of the Moscow city headquarters of the Emergency Situation Department... • Various MT System Translations o Russia 's Tass news agency quoted the ministry for emergency situations of the Moscow city , Director of Information Services , German Gref... Itar-Tass quoted the Emergency Situations Ministry in Moscow City Administration Director Bo , yakovlev... Russia 's Tass news agency of the Ministry of Emergency Situations Moscow city administration of Addis Ababa , Director of Information Services... Russian news agency quoted the ministry of emergency situations in Moscow city administration of the Director of Information Services , A. Kozyrev... Itar-Tass quoted the Emergencies Ministry in Moscow , the Director of information in 1988 lev... Name Translation Maze English Phonetic Name Semantic Name Chinese Semantic+ Phonetic Name Semantic Name 花旗银行 解放之虎 “Colorful-flag Bank” Citibank Liberation Tiger 长江 “Long River” Yangtze River Phonetic Name 尤申科 可伶可俐 欧佩尔吧 “You shen ke” Yushchenko “Ke Ling Ke Li” Clean Clear Opal Bar 清华大学学报 华尔街 “The Journal of “Hua Er Street” Wall Street 尤干斯克石油天然气 公司 Semantic+ Phonetic Name Need advanced Tsinghua University” transliteration Tsinghua Da Xue model Xue Bao But not only these… Yuganskneftegaz Oil and Gas Company Name Translation Maze English Phonetic Name Semantic Name Semantic+ Phonetic Name Semantic Name … … … 红军 Red Army (in China) Use Global Phonetic Name … … … Context 亚西尔·阿拉法特 Yasser Arafat (PLO Chairman) … 圣地亚哥市 Santiago City (in Chile) Chinese Semantic+ Phonetic Name … … Context-Dependent Name Liverpool Football Club (England) English Yasir Arafat (Cricketer) San Diego City (in CA) 潘基文 Pan Jiwen (Chinese) No-Clue Name Ban Ki-Moon (Korean Foreign Minister) 林一 Lin Yi (Chinese) Hayashi Hajime (Japanese Writer) Motivation • Traditional methods use supervised transliteration and LM re-scoring • To discover name pairs from comparable corpora o About similar topics, but are not in general translations of each other o Naturally available; e.g. many news agencies release multi-lingual news articles on the same day • Limitation of Previous approaches o Require a supervised name transliteration module as baseline, exploit the distribution evidence from comparable corpora only for re-scoring o Limited to names which are phonetically transliterated; while many organizations are often rendered semantically o Cannot disambiguate names according to context • Toward transliteration-free approach: Constructing Information Networks o There are no document-level or sentence-level alignments, but names, relations and events in one language tend to co-occur with their counterparts in the other o Information extraction (IE) techniques are currently available for some non-English languages 7 Bilingual Information Networks (Ji, 2009) 库瓦斯 Arequipa Sibling Leader 2. 蒙特西诺斯 1. 国家情报局 Arrest/2001-06-25 3. 卡西俄 Located Located Leader 利马 藤森 Birth-Place 1. National Intelligence Service Arrest/2001-06-25 3. Callao 4.秘鲁 Capital Birth-Place Leader 2. Montesinos Located Located 4. Peru Located Jorge Chavez Intl. Airport 8 (Lin et al., 2011) •3000 languages are endangered; Important to cross-lingual access a range of languages •Goal: Mine name translation pairs from Wikipedia Infoboxes 9 Contextual Cue (Klementiev et al., 2011) 13 Temporal Cue 14 Orthographic and Phonetic Cue • Transliteration based match • Getting the phonetic representation of English and Chinese candidates • For example, “father” would be transformed to “faDR”, “港” would be transformed to “gang3”. • Splitting the phonetic representations into basic phoneme units. o Note: There’s some questions about the original paper. • Building a phoneme pronunciation similarity (PPS) table • Treating the problem as a weighted longest common subsequence problem • Finding the optimal longest common subsequence • Normalizing the score of the optimal solution by dividing the maximum length of two sequences • Using the normalized score as the phonetic similarity score of two representations 15 Advanced Person Name Transliteration Averaged Perceptron Name Transliteration Model Selects transcription from English name lists based on edit distance Generates transcriptions if name not on the list Char-based MT Name Transliteration Model No reordering model due to monotonicity of the task Tune model scaling factors for maximum transliteration accuracy Feed in both tokens and pinyins, generate NBest transliterations Combination of two achieved 3.6% and 6% higher accuracy than each alone (Freitag and Khadivi, 2007) 16/22 Global Name Selection with English Resources Find the correct name translation by comparing contexts with: Large English corpus (Kalmar and Blume, 2007) Multi-token names Name frequency in the English corpus Document context • Person titles (within-document co-reference resolution) • Co-occurring entities • Document date • Document topic Not using the edit distance models with Asian target names (regardless of whether Mandarin, Cantonese, Korean, Japanese, etc.), select the best one based on context Large English name list and Gigaword Re-score the N-best transliteration hypotheses Build a large character-level LM 16.7% relative error reduction than name transliteration only 17/22 Example of Using Document Context Lawyer … 据国际文传电讯社和伊塔塔斯社报道,格里戈里 ·帕斯科的 Grigory Pasko 律师詹利·雷兹尼克向俄最高法院提 出上诉。 报道说,他请求法 zhan li lei zi ni ke 庭宣布有罪判决无 效,并取消对帕斯科的刑事立案。 帕斯科于 24.11 amri 28.31 reznik 有期徒刑,罪名是非法参加一个高级 2001 年 12 月被判处四年 23.09 obry 26.40 rezek 军事指挥官 一个军事法庭说他意 图将 22.57 zeri 会议,并在会上做笔记。 25.24 linic 20.82 henri 23.95 riziq 笔记提供给他曾供职的日本媒体。 帕斯科的判决包括已服刑的时 20.00 henry 23.25 二刑期后,他于今年一月因表现良好被释放。 ryshich 间。在服满三分之 Genri HenryReznik, Reznik Goldovsky's lawyer, asked 19.82 genri 22.66 lysenko Russian Supreme Court Chairman Genri Reznik 他坚持称自己是无辜的,并表示军方因其披露俄 罗斯海军的环境 19.67 djari 22.58 ryzhenko Vyacheslav Lebedev…. 19.57 jafri 22.19 linnik 破坏而惩罚他,这包括向海里倾 倒放射性废弃物。 据国际文传 电讯社报道,雷兹尼克表示他在帕斯 科获释当日提交的最初一 份上诉状从未到达过最 高法院主席团手中。 这名律师说法院的 >90% accurate! 军事委 员会拒绝对上诉进行审理。国际文传电讯社报道,雷兹尼 克表示他在新诉状 的抬头上直接写着最高法院院长维亚切斯拉 Vyacheslav Lebedev 夫· 列别捷夫,并要求此案不由军事法官考虑,“因 为军事司法 18/22 制度对帕斯科采取了偏见态度” Mining from Code-switch Webpages (Lin et al., 2008) • Searching the parallel data on the web (Resnik 2003) • Searching the comparable corpus on the web (Fung 1998) Mining Key Phrase Translations from Web Corpora 19 Bilingual Information on the Web • Searching the parallel data on the web (Resnik 2003) • Searching the comparable corpus on the web (Fung 1998) • Anchor texts pointing to the same page (Lu 2004) Mining Key Phrase Translations from Web Corpora 20 Bilingual Information on the Web • Limited bilingual resources as parallel/comparable on the web o STRAND: 3,500 English-Chinese document pairs and fewer than 2,500 for English-French. (Resnik 2003 ) o Comparable corpora: from 10 years Xinhua Chinese and English stories (2GB) only 110K sentence pairs (44MB) are found as “parallel”. (Zhao & Vogel 2002) o Anchor text mining: from 2M web pages, 2.8MB Chinese text and 3.1MB English text found as potential translations. • More bilingual information on the web in the form of mixed language webpage o Parallel text are not needed in most cases o The Chinese authors usually include the original English for the key phrases • For consistency • To give the readers Mining more Keyinformation Phrase Translations from about Web Corpora • If they are not sure the translation in Chinese 21 Web pages of mixed languages Mining Key Phrase Translations from Web Corpora 22 Web pages of mixed languages Mining Key Phrase Translations from Web Corpora 23 Mining translations from mixed-lang. pages • Crawling the Chinese web pages that contain English text. (Zhang and Vines, SIGIR 2004) o Use Google to locate the webpages containing the Chinese terms o English expressions occur next to the Chinese terms are considered as their translations o Crawled 2GB web data, 1,168 distinct English terms found, 61% are correct translations • Searching the Chinese terms among the English pages. (Cheng et al. SIGIR 2004) o Use Google to retrieve “English” pages containing the Chinese terms o Extract translations from the snippets o LiveTrans systemMining Key Phrase Translations 24 from Web Corpora Mining translations from mixed-lang pages Mining Key Phrase Translations 25 from Web Corpora Pros and cons of these approaches Web Resources Crawling? Available Difficulty in Extraction? Searching for parallel data Yes Limited Hard Searching for comparable data Yes Moderate Harder Mining Anchor Text “Yes” Limited Easy Extracting translation from mixed-lang page Yes Abundant Moderate Search in English pages No Small Moderate Mining Key Phrase Translations from Web Corpora 26 Cross-lingual Projection 1. Training Data Projection 2. Test Data Projection 3. Model (e.g., pivot features) projection Training Data Projection 1. Find a large, parallel bilingual corpus o E/G part of EUROPARL (25m words) 2. Assign semantic roles on English side o Train automatic tagger on English data 3. Project semantics over to a low-resource incident language o o o Step 1: Find semantic equivalences via word alignment Step 2: Project frame Step 3: Project roles Result: Large IL annotated corpus Projection: Example Arriving Peter comes home Arriving Peter kommt nach Hause Three assumptions to make this work Assumption 1 Semantic representation is parallel Arriving Peter comes home Arriving Peter kommt nach Hause Assumption 2 There is always parallel lexical material that is semantically equivalent Arriving Peter comes home Arriving Peter kommt nach Hause Assumption 3 Word Alignment provides semantic equivalence Arriving Peter comes home Arriving Peter kommt nach Hause Word Alignment as Semantic Equivalence • Current Word Alignment models use co-occurrence to determine alignment o But co-occurrence != semantic equivalence decide insist entscheiden Entscheidung treffen bestehen darauf Problems: Phrasal verbs, Idioms, Support Verbs (Funktionsverbgefuege), Noise proper