Download Information extraction - Arizona State University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Biological information extraction
from natural language text
Chitta Baral
Arizona State University
Goal
• Extract `simple’ information from text.
• This is somewhat simpler than complete natural
language understanding
• Examples of `simple’ information (structure is
anticipated)
– John was in Phoenix in March
at( John, Phoenix, March)
– Protein-x in presence of enzyme y breaks down to components z
and w.
breaks_in_presence_of( x, y, [z , w] )
• Not so `simple’ information (meta-informations,
unanticipated or untargeted structure)
– John only visits cities where he has a friend
Main approach
• Use extraction rules that can extract the targeted
information
– Extract P(X,Y,Z) from a sentence if in that sentence X is a
proper noun, Y is a verb that immediately follows the noun and Z
is a noun phrase that immediately follows Y.
• Coming up with extraction rules
– Manually
– Learning extraction rules
• Develop your own learning program
• Cast your problem appropriately so as to use existing learning
programs (such as Progol, FOIL, etc.)
• Take an existing information extraction system and make
appropriate changes to it so as to make it applicable for our case
Learning extraction rules
• Mark the text of what is to be extracted
• Parse the text (with markings) and do part
of speech tagging
• Extract pattern
• Use the pattern on other text, and add
conditions or modify pattern to avoid false
positives.
• Repeat the above steps until an
acceptable performance is achieved.
An example
• HMBA could inhibit the MEC-1 cell
proliferation by down-regulation of PCNA
expression, it could also induce apoptosis
effectively that might be through the way of upregulation of bax and bcl-2 gene expression.
• Interaction(HMBA, inhibit, MEC-1 cell
proliferation)
• Interaction(HMBA, down-regulation, PCNA
expression)
Parsing and POS tagging
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
[
word([tag= 'NNP' ,arg(1)],'HMBA'),
vg([word([tag= 'MD'],'could'),
word([tag = 'VB' ,arg(2)],'inhibit')]),
ng([arg(3)], [word([tag= 'DT'],'the'),
word([tag= 'NNP'],'MEC-1'),
word([tag= 'NN'],'cell'),
word([tag= 'NN'],'proliferation')
]
),
word([tag= 'IN'],'by'),
word([tag= 'NN'],'down-regulation'),
word([tag= 'IN'],'of'),
ng([word([tag= 'NNP'],'PCNA'),
word([tag= 'NN'],'expression')
]),
word([tag= ','],','),
word([tag= 'PRP'],'it'),
vg([word([tag= 'MD'],'could'),
word([tag= 'RB'],'also'),
word([tag= 'VB'],'induce')
]),
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
word([tag= 'NN'],'apoptosis'),
word([tag= 'RB'],'effectively'),
word([tag= 'WDT'],'that'),
vg([word([tag= 'MD'],'might'),
word([tag= 'VB'],'be')]),
word([tag= 'IN'],'through'),
ng([word([tag= 'DT'],'the'),
word([tag= 'NN'],'way')
]),
word([tag= 'IN'],'of'),
word([tag= 'NN'],'up-regulation'),
word([tag= 'IN'],'of'),
word([tag= 'NN'],'bax'),
word([tag= 'CC'],'and'),
ng([word([tag= 'JJ'], 'bcl-2'),
word([tag= 'NN'],'gene'),
word([tag= 'NN'],'expression')
])
•
]
An alternate way to code
•
•
•
•
•
•
•
•
•
sentence(s).
first(s, p1).
next(p1,p2). next(p2,p3). next(p3,p4). next(p4,p5).
next(p5,p6). next(p6,p7). next(p7,p8). next(p8,p9).
next(p9,p10). next(p10,p11). next(p11,p12).
next(p12,p13).
next(p13,p14). next(p14,p15). next(p15,p16).
next(p16,p17).
next(p17,p18). next(p18,p19). next(p19,p20).
next(p20,empty).
type(p1, word). tag(p1, nnp). content(p1, hmba).
marked(p1,arg1).
type(p2, vg). …
POS tags
•
•
•
•
•
•
•
•
•
•
•
NNP – proper noun
MD -- modal
VB – verb base form
DT -- determiner
NN – common noun
IN -- preposition
PRP
RB -- adverb
WDT -CC – coordinating conjunction
JJ -- adjective
Extracted interaction rule
• extract( [ word([tag = NNP],_h18724),
word([tag = VB],_h18725),
ng(_h18726)
],
interact(_h18724,_h18725,_h18726),
true).
Tagged text
• Interact (HMBA,
[word ([tag = MD], could),
word ([tag = VB], inhibit)],
[word ([tag = DT], the),
word ([tag = NNP],MEC-1),
word ([tag = NN], cell),
word ([tag = NN], proliferation)]).
• Interact (HMBA, down-regulation,
[word ([tag = NNP],PCNA),
word ([tag = NN], expression)]).
Prolog code for learning extraction
rules
•
•
:-import append/3 from basics.
learn( S):- find_interact( S,I,P), nl, write( I), nl, write( P), write_file( P,I).
– P : extraction pattern
– I : interaction fact
– S: tagged text
•
•
•
•
find_interact([word([T,arg(1)],_) | R], interact (A,B,C), P ) :A=X, pattern ([ word ([T],A)|PR],P),
find_interact (SR, interact (A,B,C),PR).
More rules for find_interact.
pattern( W,P):- P=W.
write_file( P,I):- E=extract (P, I, true), open( 'extract.P', append, F),
write( F, E), write( F,'.'), nl( F), close( F).
A set of extraction patterns
•
•
•
•
•
•
•
•
extract( [ word ([tag = 'NNP'],_h13664),word([tag = 'VB'],_h13665),
word ([tag = 'NNP'],_h13666)],interact(_h13664,_h13665,_h13666),true).
extract( [word ([tag ='NNP'],_h62915),vg(_h62916),ng(_h62917)],
interact(_h62915,_h62916,_h62917),true).
extract( [word ([tag = 'NNP'],_h112469), word ([tag = 'NN'],_h112470),
ng(_h112471)], interact(_h112469,_h112470,_h112471),true).
extract( [word ([tag = 'NNP'],_h161953),word([tag = 'NN'],_h161954),
word ([tag = 'NNP'],_h161955)],
interact(_h161953,_h161954,_h161955),true).
extract( [word ([tag = 'VB'],_h17857),vg(_h17858),ng(_h17859)],
interact(_h17857,_h17858,_h17859),true).
extract( [word ([tag = 'NNP'],_h42739),word([tag = 'NN'],_h42740),ng(_h42741)],
interact(_h42739,_h42740,_h42741),true).
extract( [word ([tag = 'NNP'],_h44071),word([tag = 'NN'],_h44072),ng(_h44073)],
interact(_h44071,_h44072,_h44073),true).
extract( [word ([tag = 'NNP'],_h16431),word([tag = 'NN'],_h16432),ng(_h16433)],
interact(_h16431,_h16432,_h16433),true).
Code that extracts patterns
•
•
•
•
•
•
•
•
:- load_dyn( 'extract.P').
matcher(_,[],_).
matcher( [SH|ST],[SH|PT],_) :- matcher(ST,PT,_).
matcher( [SH|ST],[PH|PT],_) :- SH \== PH,
matcher( ST,[PH|PT],_).
run( S):- process( S).
process(S) :- extract( P,F,_), matcher( S,P,_),
write_file(F),
fail.
process(_).
write_file(I):- open( 'interact.P', append,File), write(File,I),
write(File,'.'),nl(File), close(File).
Applications of interest
• Finding interaction between genes and
proteins
• Given a set of genes, say obtained using
micro array experiments, using such
extracted information get a rough idea
about the various genes and proteins that
interact with these genes.
• Now build a pathway.
Related documents