Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovering Semantic Relations (for Proteins and Digital Devices) Barbara Rosario Intel Research Outline • Semantic relations – Protein-protein interactions (joint work with Marti Hearst) – Digital devices (joint work with Bill Schilit, Google and Oksana Yakhnenko, Iowa State University) • Models to do text classification and information extraction • Two new proposals for getting labeled data Text mining • Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text • Example: a (human) analysis of titles of articles in the biomedical literature suggested a role of magnesium deficiency in migraines [Swanson] Text mining • Text: – – – – Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker 1: Extract semantic entities from text Text mining • Text: – – – – Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker 1: Extract semantic entities from text Stress Magnesium Migraine Calcium channel blockers Text mining (cont.) • Text: – – – – Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker 2: Classify relations between entities Stress Lead to loss Migraine Associated with Magnesium Prevent Calcium channel blockers Subtype-of (is a) Text mining (cont.) • Text: – – – – Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker 3: Do reasoning: find new correlations Stress Lead to loss Migraine Associated with Magnesium Prevent Calcium channel blockers Subtype-of (is a) Relations • The identification and classification of semantic relations is crucial for the semantic analysis of text • Protein-protein interactions • Relations for digital devices Protein-protein interactions • Applications throughout biology • There are several protein-protein interaction databases (BIND, MINT,..), all manually curated • Most of the biomedical research and new discoveries are available electronically but only in free text format. • Automatic mechanisms are needed to convert text into more structured forms Protein-protein interactions • Supervised systems require manually trained data, while purely unsupervised are still to be proven effective for these tasks. • We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins HIV-1, Protein interaction database • “The goal of this project is to provide scientists a summary of all known interactions of HIV-1 proteins with host cell proteins, other HIV-1 proteins, or proteins from disease organisms associated with HIV/AIDS” • There are 2224 interacting protein pairs and 51 types of interaction http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/ HIV-1, Protein interaction database Protein 1 Protein 2 10000 10015 1017 10197 … 155871 155030 155871 155348 Interaction Paper ID activates binds induces degraded by 11156964 14519844, … 9223324 10893419 Protein-protein interactions • Idea: use this to “label data” Protein 1 Protein 2 Interaction 10000 155871 activates Paper ID 11156964 Extract from the paper all the sentences with Protein 1 and Protein 2 … Label them with the interaction given in the database Protein-protein interactions • Idea: use this to “label data” Protein 1 Protein 2 Interaction 10000 155871 activates Paper ID 11156964 Extract from the paper all the sentences with Protein 1 and Protein 2 activates … activates Label them with the interaction given in the database Protein-protein interactions • Use citations Protein 1 Protein 2 Interaction 10000 155871 activates Paper ID 11156964 • Find all the papers ID 9918876 that cite the papers in the database ID 9971769 Protein-protein interactions Protein 1 Protein 2 Interaction 10000 155871 activates • From the papers, extract the citation sentences; from these extract the sentences with Protein 1 and Protein 2 • Label them activates Paper ID 11156964 ID 9918876 ID 9971769 Protein-protein interactions • Task: Interaction • Given the sentences extracted from paper ID and/or the citation sentences: • Determine the interaction given in the HIV-1 database for paper ID • Identify the proteins involved in the interaction (protein name tagging, or role extraction). Papers Citances Degrades 60 63 Synergizes with 86 101 103 64 Binds 98 324 Inactivates 68 92 Interacts with 62 100 Requires 96 297 119 98 Inhibits 78 84 Suppresses 51 99 Stimulates Upregulates The models (1) Naïve Bayes (NB) for interaction classification. The models (2) Dynamic graphical model (DM) for protein interaction classification (and role extraction). Dynamic graphical models • Graphical model composed of repeated segments • HMMs (Hidden Markov Models) – POS tagging, speech recognition, IE t1 t2 tN-1 tN w1 w2 wN-1 wN POS Tags Words HMMs • Joint probability distribution – P(t1,.., tN, w1,.., wN) = P(t1) P(ti|ti-1)P(wi|ti) • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data t1 t2 tN-1 tN w1 w2 wN-1 wN HMMs • Joint probability distribution – P(t1,.., tN, w1,.., wN) = P(t1) P(ti|ti-1)P(wi|ti) • Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data • Inference: P(t1 , t2 ,… tN | w1 , w2 ,… wN) t1 t2 tN-1 tN w1 w2 wN-1 wN Graphical model for role and relation extraction Interaction Roles Features – Markov sequence of states (roles) – States generate multiple observations – Relation generate the state sequence and the observations Analyzing the results • Hiding the protein names: “Selective CXCR4 antagonism by Tat” becomes: “Selective PROT1 antagonism by PROT2” – To check whether the interaction types could be unambiguously determined by the protein names. • Compare results with a trigger words approach Results: interaction classification Model Classification accuracies All Papers Citances DB 60.5 57.8 53.4 NB 58.1 57.8 55.7 No Protein Names DB 60.5 44.4 52.3 NB 59.7 46.7 53.4 Trigger words 25.8 40.0 26.1 Baseline: most frequent inter. 21.8 11.1 26.1 Results: proteins extraction Recall Precision F-measure All 0.74 0.85 0.79 Papers 0.56 0.83 0.67 Citances 0.75 0.84 0.79 Conclusions of protein-protein interaction project • Difficult and important problem: the classification of (ten) different interaction types between proteins in text • The dynamic graphical model DM can simultaneously perform protein name tagging and relation identification • High accuracy on both problems (well above the baselines) • The results obtained removing the protein names indicate that our models learn the linguistic context of the interactions. • Found evidence supporting the hypothesis that citation sentences are a good source of training data, most likely because they provide a concise and precise way of summarizing facts in the bioscience literature. • Use of a protein-interaction database to automatically gather labeled data for this task. Relations for digital devices • Identification of activities/relations between device pairs. • What can you do with a given device pair? – – – – – – Digital camera and TV Media server and computer Media server and wireless router Toshiba laptop and wireless audio adapter PC and DVR TV and DVR Looking for relations • You can searches the Web? – Google searches TV DVR and PC DVR • Current search engines find cooccurrence of query terms • Often you need to find semantically related entities • For text mining, inference and for search (IR) Looking for relations • You can searches the Web? – Google searches PC DVR and TV DVR • You may want to see instead all the sentences in which the two devices are involved in an activity/relation and get a sense of what you can do with these devices • Activities_between(PC DVR) – From which you learn for example that » Can build a Better DVR out of an Old PC » Any modern Windows PC can be used for DVR duty • Activities_between(TV DVR) – From whichyou learn for example that » DVR allows you to pause live TV » Can watch Google Satellite TV through your "internet ready" Google DVR Looking for relations • We can frame this problem as a classification problem: • Given a sentence containing two digital devices, is there a relations between them expressed in the sentence or not? Looking for relations • Media server and computer – The Allegro Media Server application reads the iTunes music library file to find the music stored on your computer • YES – You will use the FTP software to transfer files from your computer to the media server • YES – The media server has many functions and it needs to be a high-end computer with plenty of hard drive space to store the very large video files that get created • YES – Sometimes you might want to play faster than your computer, or your Internet connection, or your media server, can handle • NO – Anderson , George Homsy, A Continuous Media I/O Server and Its Synchronization Mechanism, Computer, v.24 n.10, p.51-57, October 1991 • NO – GSIC > Research Computer System > Obtaining Accouts > Media Server • NO Looking for relations • Media server and wireless router – For example, if you access a local media server in your house that is connected to a wireless router that has a port speed of only 100 Mbps [..] • YES – Besides serving as a router, a wireless access point, and a four-port switch, the WRTSL54GS includes a storage link and a media server • YES – It has a built in video server, media server, home automation, wireless router, internet gateway • NO Our system • Set of 57 pairs of digital devices • Searched the Web (Google) using the device pairs as queries • From the Web pages retrieved, we extracted the text (3627) excerpts containing both devices • We labeled them (YES or NOT) • Trained a classification system Our FUTURE system • Will allow to identify the Web pages containing relations. – Could display only those. – Could highlight only sentences with relations – For digital devices, this would allow, for example, useful queries for troubleshooting • Searching the web is one of the principle methods used to seek out information and to resolve problems involving digital devices for home networks Our FUTURE system • Possible extensions of the project to get the activities types – We look at the sentences extracted and come up with a set of possible activities. Build a (multi) classification system to classify the different activities (supervised) – Extract the most indicative words for the activities (like the words highlighted here); cluster them to get “activity clusters” (unsupervised) Our system • Set of 50 Device Pairs • Search the Web (Google) using the device pairs as query • From the Web pages retrieved, we extracted the sentences containing both devices • We labeled them (YES or NOT) • Trained a classification system Labeling with Mechanical Turk • To train a classification system, we need labels – Time consuming, subjective, different for each domain and task – (But unsupervised systems work usually worse) • We used a web service, Mechanical Turk (MTurk, http://www.mturk.com) that allows to create and post a task that requires human intervention, and offers a reward for the completion of the task. Mechanical Turk HIT for labeling relations Surveys Surveys Mechanical Turk • We created a total of 121 surveys consisting of 30 questions. • Our reward to users was between 15 and 30 cents per survey (< 1 cent for text segment) – We obtained labels for 3627 text segments for under $70. • HIT completed (by all 3 “workers”) within a few minutes to a half-hour – We had perfect agreement for 49% of all sentences – 5% received all three labels (discarded) – 46% two labels were assigned (the majority vote was used to determine the final label) • 1865 text segments were labeled YES • 1485 text segments were labeled NO Classification • Now we have labeled data • Need a (binary) classifier Summary (from lecture 17) • Algorithms for Classification • Binary classification – – – – – Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multilayer Neural Networks • Multi-Class classification – Decision Trees – Naïve Bayes – K nearest neighbor Support Vector Machine (SVM) M • Large Margin Classifier w x + b = -1 • Linearly separable case • Goal: find the hyperplane that maximizes the margin Support vectors T From Lecture 17 wTxa + b = 1 b From Gert Lanckriet, Statistical Learning Theory Tutorial wT x + b = 0 45 Graphical models • Directed (like Naïve Bayes and HMM) • Undirected (Markov Network) Maximum Margin Markov Networks • Large Margin Classifier + (undirected) Markov Networks [Taskar 03] – To combine the strengths of the two methods: • High dimensional feature space, strong theoretical guarantees • Problem structure, ability to capture correlation between labels Benjamin Taskar, Carlos Guestrin, and Daphne Koller. 2003. Max-margin markov networks. In NIPS. Directed Maximum Margin Model • Large Margin Classifier + (directed) graphical model (Naïve Bayes) • MMNB: Maximum Margin Naïve Bayes – Essentially, to combine the strengths of graphic models (better at interpreting data, worse performance in classification) with discriminative models (better performance, unintelligible working mechanism) Results • Compare with Naïve Bayes and Perceptron (Weka) • Classification accuracy: – MMNB: 79.98 – Naïve Bayes: 75.62 – Perceptron: 63.03 Conclusion • Semantic relations • Two projects: interactions between proteins and relations between digital devices • Statistical models (dynamic graphical models, maximum margin naïve bayes) • Creative ways of obtaining labeled data: protein database and “paying” people (Mturk) Thanks! Barbara Rosario [email protected] Intel Research Additional slides All device pairs • • • • • • • • • • • • • desktop wireless router PC stereo digital camera television pc wireless audio adapter digital camera tv set pc wireless router ibm laptop buffalo media player Phillips stereo pc ibm laptop linksys wireless router prismq media player wireless router ibm laptop squeezebox stereo laptop ibm laptop wireless audio adapter • stereo toshiba laptop • kodak camera television • toshiba laptop buffalo media player • laptop linksys wireless router • toshiba laptop linksys wireless router • laptop media server • toshiba laptop netgear wireless router • laptop squeezebox • toshiba laptop squeezebox • laptop stereo • toshiba laptop wireless audio adapter • laptop wireless audio adapter All device pairs (cont.) • buffalo media player wireless router • laptop wireless router • buffalo media server wireless router • linkstation home server wireless router • camera tv • linkstation multimedia server wireless router • computer linksys wireless router • media player wireless router • computer media server • media server linksys wireless router • computer stereo • media server netgear wireless router • computer wireless audio adapter • media server wireless router • computer wireless router • network media player wireless router • desktop media server • nikon camera television • desktop stereo • pc media server • desktop wireless audio adapter • pc squeezebox