Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Practical Protein Sequence Alignment With Algebraic Dynamic Programming Lyle Kopnicky PacSoft Research Group Tim Sheard, Adviser Bioinformatics GTTAGCGTGAATCTGTACTGAG • DNA, RNA and proteins are strings • Strings contain information • Some problems • Determine relatedness of strands of DNA • Figure out how RNA folds on itself • Identify proteins in a sample Tools for bioinformatics • Written in a general-purpose programming language such as C • Designed to solve a narrow range of problems • When problem doesn’t fit tool: • Tweak data to fit tool – awkward, inefficient, may not fully solve problem • Write new tools – time consuming, errorprone, require maintenance The disconnect #ifndef SS strncpy(pgm_name, "gsw", MAX_FN); #else strncpy(pgm_name, "ssw", MAX_FN); #endif standard_pam("BL50",ppst); ppst->nsq = naa; ppst->nsqx = naax; for (i=0; i<=ppst->nsqx; i++) { ppst->sq[i]=aa[i]; /* sq = aa */ ppst->hsq[i]=haa[i]; /* hsq = haa */ ppst->sqx[i]=aax[i]; /* sq = aax */ ppst->hsqx[i]=haax[i]; /* hsq = haax */ } ppst->sq[ppst->nsqx+1] = ppst->sqx[ppst->nsqx+1] = '\0'; memcpy(qascii,aascii,sizeof(qascii)); /* set up the c_nt[] mapping */ ppst->c_nt[0]=0; for (i=1; i<=nnt; i++) { ppst->c_nt[i]=gc_nt[i]; ppst->c_nt[i+nnt]=gc_nt[i]+nnt; } } The disconnect • General purpose languages are easily available, efficient, BUT • Biologists think: amino acids, matching, classification • General-purpose languages provide characters, procedures, objects • Domain experts may not be expert programmers • A lot of design, development and maintenance time Domains and tools General-purpose languages Physics Bioinformatics Architecture Domain Domain Tool Tool Finance Domain Tool Mathematics Domain Chemistry Libraries Domain-specific languages • Represent a problem in the way domain experts think of it • Examples: Excel, HTML, Matlab • General enough to capture a domain, specific enough to reduce design time • Small change in requirements = small change in program • Easy to answer “what-if” questions • Can be implemented efficiently using known techniques Collaboration with OHSU lab • Dr. Srinivasa Nagalla’s laboratory • Goals • Gain domain knowledge • Discover goals of biologists • Potential users of DSL • Began with protein identification problem Protein identification problem breakup into peptides chromatography SD gel ? ? ? ? Protein identification problem tandem mass spectrometry de novo sequencing database search AVAQFPRR AVAQFPRR AVAQFPRR AVAQFPRR AVAQFPRR AVAQF AVA A QFPRR PRR R ? [Po]QFPVGR …TRSSRAGLQFPVGRVHRLLR… A[AV]QFPVGR [Po]QFPRR [241.10]QFPVGR A[AV]QFPRR [241.10]QFPVGR Database search • Not an exact match • Mutations – substitutions, insertions, deletions • Modifications – amino acids altered by another molecule • De novo sequencer outputs a list of ambiguous queries, e.g. [229.07]HhNyG[PS][198.1]QHADD[ep]VD[Rz]R unidentified mass unordered pair Sequence alignment • Find the best match between query string and target (database) string • Each match is also called an alignment • Alignments are scored Target string WTABRRFCWGYPD KWGGSCASPNE F WT PDPYK Query string Alignment tools • Smith-Waterman dynamic programming algorithm most common • O((n+m)2), where n and m are length of strings • Run on every query-target pair, and data sets are large • Requires precise query string • FASTA, BLAST address speed problem • First look for runs of exact matches • Align on localized area Fitting queries into FASTA • Generate all possible exact queries • Could be dozens of possibilities [241.10]NEM[NP]YR NQ QN GG GAN GNA GQG GGQ AGGG NP PN • Each one must be aligned – takes time • Output must be converted back to original query string Algebraic Dynamic Programming • Robert Giegerich, 2000 • Generic approach to solving dynamic programming problems using parsing • Domain-specific embedded language in Haskell • Ambiguous queries can be represented directly as a grammar • Searching with an ambiguous query slower, but only one search instead of dozens • Still takes O((n+m)2) time Trimming the search space • Filter target strings and localize search • Five-character substring exact match [241.10]NEM[NP]YR NEMNP NEMPN EMNPY EMPNY MNPYR MPNYR exact • Matching techniques • Boyer-Moore on each substring-target pair • Pre-index database Boyer-Moore • Finds an exact occurrence of one string inside another • Doesn’t check every position – knows when to skip ahead • Can run in sublinear time SYNSNTLNNDIMLIKLKSAASLN x SAASL SYNSNTLNNDIMLIKLKSAASLN x SAASL Pre-indexing the database • Build a tree in which to look up substrings, find positions in database • Substring tree: A substring at each node • Suffix tree: Path along labeled edges describes substring substring tree suffix tree DTA, 3 ADT, 2 TADTA TAD, 2 3 2 1 Sample output Query string: [229.07]HhNyG[PS][198.1]QHADD[EP]VD[Rz]R:{21} Target string: >MK14_HUMAN (Q16539) Mitogen-activated protein kinase 14 MSQERPTFYRQELNKTIWEVPERYQNLSPVGSGAYGSVCAAFDTKTGLRVAVKKLSRPFQSIIH AKRTYRELRLLKHMKHENVIGLLDVFTPARSLEEFNDVYLVTHLMGADLNNIVKCQKLTDDHVQ FLIYQILRGLKYIHSADIIHRDLKPSNLAVNEDCELKILDFGLARHTDDEMTGYVATRWYRAPE IMLNWMHYNQTVDIWSVGCIMAELLTGRTLFPGTDHIDQLKLILRLVGTPGAELLKKISSESAR NYIQSLTQMPKMNFANVFIGANPLAVDLLEKMLVLDSDKRITAAQALAHAYFAQYHDPDDEPVA DPYDQSFESRDLLIDEWKSLTYDEVISFVPPPLDQEEMES:{360} Target range: 290-326 LDSDKRITAAQALAHA YFAQYHDPDDEPVADPYDQSF :|XvvvXX:|^|X^||//|^|XXX -HhNyGPS-Q HA DDEPV DRzR Score: 31 Time: 0.045 secs Time and space usage Small set Full set 5 4800 Query length 7–17 up to 21 Number of target strings 179 8500 up to 700 up to 7000 Number of queries Target string length Small set Full set time time space No heuristic 2m35s — — Boyer-Moore 3s — — pre-indexing 1s 5m 750MB lookup 1s 1h49m 1GB Substring tree Smith-Waterman (1981) • Local alignment problem • Dynamic programming algorithm • Pathways through table represent alignments • Entry represents best score of an alignment starting here, ending anywhere S.-W. alignment scoring start/end = 0 WTABRRFCWTYPDG WKGGSCASPNE GE F WT PDPYDAW QAPT gap: -w x length match/substitution: s(a1,a2) Smith-Waterman Phase 1 R F W T P D P Y D A W 0 1 0 0 0 F C W T Y P D W K 0 0 1 0 0 1 0 0 0 1 0 2 1 0 0 0 1 1 2 1 0 0 0 1 1 2 0 0 0 0 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 Insertion, deletion or substitution Smith-Waterman Phase 2 F W T P D P Y D A W R 2 2 1 0 0 0 1 0 0 0 F 3 1 2 1 0 0 0 1 0 0 C 3 2 1 2 1 0 0 0 1 0 a W 2 3 1 1 2 1 0 0 0 1 T 1 2 2 1 1 2 1 0 0 0 Y 0 1 2 1 1 1 2 0 0 0 P 0 0 1 2 0 1 1 1 0 0 Trace back maximal pathways D 1 0 0 0 1 0 0 1 1 0 a W 0 1 0 0 0 0 0 0 0 1 CWTYPD FWT PD K 0 0 0 0 0 0 0 0 0 0 Today’s problems one-to-many W NAC transpositions AN NA endpoint scoring FGAK +5 dual representations AGNCF... ... 85 116 39 85 100 Trying to fit new data into old tools… We need a way to model new problems quickly Recurrence relations • Basis of traditional DP • Hard to design and understand • Mixes together search space, scoring, and order of evaluation • Subscript errors are common in implementation Smith-Waterman recurrence relation H i , j max{ H i 1, j 1 s( S1 (i ), S 2 ( j )), H i , j 1 w, H i 1, j w,0} Algebraic Dynamic Programming • Giegerich, 2000 • Grammars describe search space • Set of functions (evaluation algebra) specifies scoring • Solution space is pared down by an objective function first string $ gnirts dnoces Grammar & eval. algebra localign = start <<< string ~~~ internal ~~~ string ... h internal = subst delete insert end ... h <<< aa ~~~ internal ~~~ aa ||| <<< aa ~~~ internal ||| <<< internal ~~~ aa ||| <<< string ~~~ ’$’ ~~~ string subst(a1,score,a2) insert(score,a) delete(a,score) end(str1,$,str2) h[score1,...,scorek] = = = = = score + s(a1,a2) score – w score – w 0 max[score1,...,scorek] Traditional DP vs. ADP Traditional DP solutions ADP Tricky to design & implement Grammar and algebra like recurrence relations you think, no subscripts Difficult to extend to alternate problem descriptions Careful about order of evaluation Just change grammar and evaluation algebra Order of evaluation automatic Time complexity depends on Time complexity depends recurrence relation on form of grammar Very fast in C Haskell can be slow Design of a DSL • Implement sample applications • Use a flexible, higher-order language • Abstract out common themes • Data structures • Operations • Decide how to handle errors • Run-time errors • Type system • Speed up implementation by generating C • From embedded language, like PAN • Using standalone compiler The next steps • • • • • • Compare our alignments with biologists’ Accumulate protein scores Try 2-3 character statistical matches Pre-index using suffix trees Reduce memory usage Look for runs of exact matches as in FASTA Our contributions • Built a working relationship with bio lab to understand domain • Demonstrated feasibility of aligning large data sets in Haskell • Constructed a framework for experimenting with representations for alignments, and heuristics for filtering target strings, localizing searches