Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Flexible Text Mining using Interactive Information Extraction David Milward [email protected] Text mining vs. Data Mining • Data mining – getting new knowledge from databases – suggesting new relationships, trends, patterns • Text mining 2 – getting nuggets of information from text – extracting relationships – structured results to feed into data mining, visualisation or databases company activity company Sanofi bid Aventis Roche partner Antisoma Text Data Mining • Emphasizes finding new knowledge from text • Typically knowledge that is implicit within multiple documents 3 What is the relationship to IR? • IR finds the most relevant documents • Text mining finds information from within documents, or across documents – What drugs are used for psoriasis treatment? – Who are associated directly or indirectly with the Board of Exxon? • There is overlap … – we often search to answer a question, not to find a document 4 Traditional Information Extraction • Uses natural language processing to distinguish – Sanofi bid for Aventis – Aventis bid for Sanofi • Provides structured results for easy review and analysis company activity company Sanofi bid Aventis Roche partner Antisoma • Uses normalised terminology to allow integration with databases e.g. – Preferred term: Sanofi, – Synonyms: Sanofi Pasteur, Sanofi Synthelabo, Sanofi Synthélabo … • But: – typically limited to patterns on a single sentence – constructing, testing and running queries can take days • Appropriate if you always have the same question e.g. want to run over a newsfeed every night 5 I2E: Interactive Information Extraction • A new concept • Encompasses – keywords → documents – patterns → relationships (structured output) • Queries ranging from: – General Motors – General Motors & acquisition in the same document – Automotive companies & acquisitions in the same sentence – What companies is General Motors associated with? • Not limited to patterns within sentences e.g. – Merger and acquisition activity in documents mentioning Japan • Fast, scalable, versatile 6 I2E Information Extraction NLP Structured Output Text Search Taxonomies/ Ontologies Linguistic Processing • Groups words into meaningful units • Morphology allows search for different forms of words sentences noun phrases verb groups morphology - match entities match actions different forms We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 . 7 Monitoring Merger and Acquisition Activity 8 Company Positions 9 Using I2E in the Life Sciences • Good resources – Scientific abstracts are readily available in XML – Large number of existing taxonomies/terminologies Very large scale •• Relatively large scale – 16 million abstracts relevant to life – 17 million Growing abstracts???? a year sciences. –– Large Largenumbers numbersof ofinternal internalreports reports and full-text articles and full-text articles –– Internal be>>1000 Internaldocuments documentscan often 1000 pages, may be PDF images pages, may be PDF images –– Taxonomies/terminologies Taxonomies/terminologies are are large, often deeply structured large, often deeply structured e.g. >• 100K 350Kconcepts nodes, ??? synonyms > 400K synonyms – Still need to augment terminology – Still need to areas augment terminology for specific for specific areas 10 Examples of Pharma Questions • R&D – Which proteins interact with metabolite X? – What are the reaction kinetics for canonical pathway Y? – What attributes are common to sets of biomarker genes – What are the known associations between expressed genes and environmental factors. – What dosages of compound B cause adverse reactions? • Competitive Intelligence – Which companies are working on technology C? – What compounds are available for in-licensing in a disease area? – Which research groups are my competitors collaborating with? 11 Linking Drugs to Adverse Events 12 Measurements • Extraction of numerical parameters, – e.g. amounts, dosages, concentrations 13 Benefits of Flexible Text Mining • The ideal final query may use – co-occurrence of terms within a document or sentence – a precise linguistic pattern – a mixture of both • It depends on – the nature of the task – the availability of terminologies – the kind of documents (news vs. science, abstract vs. full text) – the time available to check results • Flexibility to mix different techniques is also critical for fast development of queries – e.g. start with broad queries to explore the “results space”, then home in 14 I2E: Better Results, Faster 10 Count of Link 9 8 [c] Reln suppress 7 regulate phosphorylate 6 mediate interact inhibit 5 induce inactivate 4 co-express block 3 bind activate 2 1 0 BCL2 CDKN1A DMPK EPHB2 INS MAP2K1 MAPK1 [c] Gene2 15 MAPK3 MAPK7 RB1 STK3 VIM Fast query creation Fast return of results Fast review and analysis Impact of I2E • Significant reduction in time spent searching/reading the literature – weeks reduced to days or hours • Structure the unstructured to – provide systematic and comprehensive review of information content – enable integration with traditional structured data – allow complex analysis of literature derived information – generate hypotheses, gain insight 16