Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Biological information extraction from natural language text Chitta Baral Arizona State University Goal • Extract `simple’ information from text. • This is somewhat simpler than complete natural language understanding • Examples of `simple’ information (structure is anticipated) – John was in Phoenix in March at( John, Phoenix, March) – Protein-x in presence of enzyme y breaks down to components z and w. breaks_in_presence_of( x, y, [z , w] ) • Not so `simple’ information (meta-informations, unanticipated or untargeted structure) – John only visits cities where he has a friend Main approach • Use extraction rules that can extract the targeted information – Extract P(X,Y,Z) from a sentence if in that sentence X is a proper noun, Y is a verb that immediately follows the noun and Z is a noun phrase that immediately follows Y. • Coming up with extraction rules – Manually – Learning extraction rules • Develop your own learning program • Cast your problem appropriately so as to use existing learning programs (such as Progol, FOIL, etc.) • Take an existing information extraction system and make appropriate changes to it so as to make it applicable for our case Learning extraction rules • Mark the text of what is to be extracted • Parse the text (with markings) and do part of speech tagging • Extract pattern • Use the pattern on other text, and add conditions or modify pattern to avoid false positives. • Repeat the above steps until an acceptable performance is achieved. An example • HMBA could inhibit the MEC-1 cell proliferation by down-regulation of PCNA expression, it could also induce apoptosis effectively that might be through the way of upregulation of bax and bcl-2 gene expression. • Interaction(HMBA, inhibit, MEC-1 cell proliferation) • Interaction(HMBA, down-regulation, PCNA expression) Parsing and POS tagging • • • • • • • • • • • • • • • • • • • • • • [ word([tag= 'NNP' ,arg(1)],'HMBA'), vg([word([tag= 'MD'],'could'), word([tag = 'VB' ,arg(2)],'inhibit')]), ng([arg(3)], [word([tag= 'DT'],'the'), word([tag= 'NNP'],'MEC-1'), word([tag= 'NN'],'cell'), word([tag= 'NN'],'proliferation') ] ), word([tag= 'IN'],'by'), word([tag= 'NN'],'down-regulation'), word([tag= 'IN'],'of'), ng([word([tag= 'NNP'],'PCNA'), word([tag= 'NN'],'expression') ]), word([tag= ','],','), word([tag= 'PRP'],'it'), vg([word([tag= 'MD'],'could'), word([tag= 'RB'],'also'), word([tag= 'VB'],'induce') ]), • • • • • • • • • • • • • • • • • • word([tag= 'NN'],'apoptosis'), word([tag= 'RB'],'effectively'), word([tag= 'WDT'],'that'), vg([word([tag= 'MD'],'might'), word([tag= 'VB'],'be')]), word([tag= 'IN'],'through'), ng([word([tag= 'DT'],'the'), word([tag= 'NN'],'way') ]), word([tag= 'IN'],'of'), word([tag= 'NN'],'up-regulation'), word([tag= 'IN'],'of'), word([tag= 'NN'],'bax'), word([tag= 'CC'],'and'), ng([word([tag= 'JJ'], 'bcl-2'), word([tag= 'NN'],'gene'), word([tag= 'NN'],'expression') ]) • ] An alternate way to code • • • • • • • • • sentence(s). first(s, p1). next(p1,p2). next(p2,p3). next(p3,p4). next(p4,p5). next(p5,p6). next(p6,p7). next(p7,p8). next(p8,p9). next(p9,p10). next(p10,p11). next(p11,p12). next(p12,p13). next(p13,p14). next(p14,p15). next(p15,p16). next(p16,p17). next(p17,p18). next(p18,p19). next(p19,p20). next(p20,empty). type(p1, word). tag(p1, nnp). content(p1, hmba). marked(p1,arg1). type(p2, vg). … POS tags • • • • • • • • • • • NNP – proper noun MD -- modal VB – verb base form DT -- determiner NN – common noun IN -- preposition PRP RB -- adverb WDT -CC – coordinating conjunction JJ -- adjective Extracted interaction rule • extract( [ word([tag = NNP],_h18724), word([tag = VB],_h18725), ng(_h18726) ], interact(_h18724,_h18725,_h18726), true). Tagged text • Interact (HMBA, [word ([tag = MD], could), word ([tag = VB], inhibit)], [word ([tag = DT], the), word ([tag = NNP],MEC-1), word ([tag = NN], cell), word ([tag = NN], proliferation)]). • Interact (HMBA, down-regulation, [word ([tag = NNP],PCNA), word ([tag = NN], expression)]). Prolog code for learning extraction rules • • :-import append/3 from basics. learn( S):- find_interact( S,I,P), nl, write( I), nl, write( P), write_file( P,I). – P : extraction pattern – I : interaction fact – S: tagged text • • • • find_interact([word([T,arg(1)],_) | R], interact (A,B,C), P ) :A=X, pattern ([ word ([T],A)|PR],P), find_interact (SR, interact (A,B,C),PR). More rules for find_interact. pattern( W,P):- P=W. write_file( P,I):- E=extract (P, I, true), open( 'extract.P', append, F), write( F, E), write( F,'.'), nl( F), close( F). A set of extraction patterns • • • • • • • • extract( [ word ([tag = 'NNP'],_h13664),word([tag = 'VB'],_h13665), word ([tag = 'NNP'],_h13666)],interact(_h13664,_h13665,_h13666),true). extract( [word ([tag ='NNP'],_h62915),vg(_h62916),ng(_h62917)], interact(_h62915,_h62916,_h62917),true). extract( [word ([tag = 'NNP'],_h112469), word ([tag = 'NN'],_h112470), ng(_h112471)], interact(_h112469,_h112470,_h112471),true). extract( [word ([tag = 'NNP'],_h161953),word([tag = 'NN'],_h161954), word ([tag = 'NNP'],_h161955)], interact(_h161953,_h161954,_h161955),true). extract( [word ([tag = 'VB'],_h17857),vg(_h17858),ng(_h17859)], interact(_h17857,_h17858,_h17859),true). extract( [word ([tag = 'NNP'],_h42739),word([tag = 'NN'],_h42740),ng(_h42741)], interact(_h42739,_h42740,_h42741),true). extract( [word ([tag = 'NNP'],_h44071),word([tag = 'NN'],_h44072),ng(_h44073)], interact(_h44071,_h44072,_h44073),true). extract( [word ([tag = 'NNP'],_h16431),word([tag = 'NN'],_h16432),ng(_h16433)], interact(_h16431,_h16432,_h16433),true). Code that extracts patterns • • • • • • • • :- load_dyn( 'extract.P'). matcher(_,[],_). matcher( [SH|ST],[SH|PT],_) :- matcher(ST,PT,_). matcher( [SH|ST],[PH|PT],_) :- SH \== PH, matcher( ST,[PH|PT],_). run( S):- process( S). process(S) :- extract( P,F,_), matcher( S,P,_), write_file(F), fail. process(_). write_file(I):- open( 'interact.P', append,File), write(File,I), write(File,'.'),nl(File), close(File). Applications of interest • Finding interaction between genes and proteins • Given a set of genes, say obtained using micro array experiments, using such extracted information get a rough idea about the various genes and proteins that interact with these genes. • Now build a pathway.