Download Presentation - British Computer Society

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Flexible Text Mining using
Interactive Information Extraction
David Milward
[email protected]
Text mining vs. Data Mining
• Data mining
–
getting new knowledge from databases
–
suggesting new relationships, trends, patterns
• Text mining
2
–
getting nuggets of information
from text
–
extracting relationships
–
structured results to feed into
data mining, visualisation or
databases
company activity company
Sanofi
bid
Aventis
Roche partner Antisoma
Text Data Mining
• Emphasizes finding new knowledge from text
• Typically knowledge that is implicit within multiple
documents
3
What is the relationship to IR?
• IR finds the most relevant documents
• Text mining finds information from within
documents, or across documents
– What drugs are used for psoriasis treatment?
– Who are associated directly or indirectly with the
Board of Exxon?
• There is overlap …
– we often search to answer a question, not to find a
document
4
Traditional Information Extraction
• Uses natural language processing to distinguish
– Sanofi bid for Aventis
– Aventis bid for Sanofi
• Provides structured results for easy review and analysis
company activity company
Sanofi
bid
Aventis
Roche partner Antisoma
• Uses normalised terminology to allow integration with
databases e.g.
– Preferred term: Sanofi,
– Synonyms: Sanofi Pasteur, Sanofi Synthelabo, Sanofi Synthélabo …
• But:
– typically limited to patterns on a single sentence
– constructing, testing and running queries can take days
• Appropriate if you always have the same question e.g.
want to run over a newsfeed every night
5
I2E: Interactive Information Extraction
• A new concept
• Encompasses
– keywords → documents
– patterns → relationships (structured output)
• Queries ranging from:
– General Motors
– General Motors & acquisition in the same
document
– Automotive companies & acquisitions in the
same sentence
– What companies is General Motors
associated with?
• Not limited to patterns within sentences
e.g.
– Merger and acquisition activity in
documents mentioning Japan
• Fast, scalable, versatile
6
I2E
Information Extraction
NLP
Structured
Output
Text Search
Taxonomies/
Ontologies
Linguistic Processing
• Groups words into meaningful units
• Morphology allows search for different forms of
words
sentences
noun phrases
verb groups
morphology -
match entities
match actions
different forms
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
7
Monitoring Merger and Acquisition Activity
8
Company Positions
9
Using I2E in the Life Sciences
• Good resources
– Scientific abstracts are readily
available in XML
– Large number of existing
taxonomies/terminologies
Very large scale
•• Relatively
large scale
– 16 million abstracts relevant to life
– 17
million Growing
abstracts???? a year
sciences.
–– Large
Largenumbers
numbersof
ofinternal
internalreports
reports
and
full-text
articles
and full-text articles
–– Internal
be>>1000
Internaldocuments
documentscan
often
1000
pages,
may
be
PDF
images
pages, may be PDF images
–– Taxonomies/terminologies
Taxonomies/terminologies are
are
large,
often
deeply
structured
large, often deeply structured e.g.
>• 100K
350Kconcepts
nodes, ??? synonyms
> 400K synonyms
– Still need to augment terminology
– Still
need to areas
augment terminology
for specific
for specific areas
10
Examples of Pharma Questions
• R&D
– Which proteins interact with metabolite X?
– What are the reaction kinetics for canonical pathway Y?
– What attributes are common to sets of biomarker genes
– What are the known associations between expressed genes
and environmental factors.
– What dosages of compound B cause adverse reactions?
• Competitive Intelligence
– Which companies are working on technology C?
– What compounds are available for in-licensing in a disease
area?
– Which research groups are my competitors collaborating
with?
11
Linking Drugs to Adverse Events
12
Measurements
• Extraction of numerical parameters,
– e.g. amounts, dosages, concentrations
13
Benefits of Flexible Text Mining
• The ideal final query may use
– co-occurrence of terms within a document or sentence
– a precise linguistic pattern
– a mixture of both
• It depends on
– the nature of the task
– the availability of terminologies
– the kind of documents (news vs. science, abstract vs. full text)
– the time available to check results
• Flexibility to mix different techniques is also critical for fast
development of queries
– e.g. start with broad queries to explore the “results space”,
then home in
14
I2E: Better Results, Faster
10
Count of Link
9
8
[c] Reln
suppress
7
regulate
phosphorylate
6
mediate
interact
inhibit
5
induce
inactivate
4
co-express
block
3
bind
activate
2
1
0
BCL2
CDKN1A
DMPK
EPHB2
INS
MAP2K1
MAPK1
[c] Gene2
15
MAPK3
MAPK7
RB1
STK3
VIM

Fast query
creation

Fast return of
results

Fast review and
analysis
Impact of I2E
• Significant reduction in time spent searching/reading
the literature
– weeks reduced to days or hours
• Structure the unstructured to
– provide systematic and comprehensive review of
information content
– enable integration with traditional structured data
– allow complex analysis of literature derived information
– generate hypotheses, gain insight
16