Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Dates in the history of sequence information 1953 1965 Experimental Amino acid sequence of insulin. Base sequence of a tRNA molecule,75 bases Sequence analysis 1970 1976 1970s Sequence of MS2 RNA, 3569 bases Birth of cloning 1975 1977 1978 Global alignment Needleman-Wunsch PAM matrix (Dayhoff) DNA sequencing (Sanger,F) Bacteriophage genome fd (6408 bases) 1981 Local alignment Smith-Waterman 1982 EMBL database 1985 FASTA 1985 Multiple sequence alignment 1st attempts 1988 Neural networks (protein secondary structure) 1989 Clustal 1989 Profilesearch 1991 BLAST 1994 Profile HMMs 1994 Context-free grammars (RNA secondary structure) 1995 Genomes of H influenzae & M genitalium (1.8 / 0.6 million bases) 1997 2000 >95% of human genome (3000 million bases) PSI-BLAST Pairwise alignment methods Sequence assembly Storing sequences - sequence formats General comments on sequence alignment in the context of database searches Determining the sequence of a longer piece of DNA 1) Detailed map or 2) Shotgun sequencing How many fragments should be picked to satisfy the expectation of sufficient overlap? 5-10 fold coverage needed Full sequence string is length L Read length = 400 Number of fragments picked should be between 5L/400 - 10L/400 Example: E. coli genome 4.5 Mb => 62,000-125,000 fragments Human genome 3000 MB => 40 million fragments Shotgun DNA sequencing * The analysis of experimental raw data Base-calling software (like ‘Phred’): * Derive base sequence * bases are assigned a quality score Good quality: Bad quality: Shotgun DNA sequencing * The analysis of experimental raw data Base-calling software (like ‘Phred’): * Derive base sequence * bases are assigned a quality score You may select to remove bad sequence on the basis of these scores * Cleaning up sequences Quality filtering Vector filtering * Sequence assembly Sequence assembly * Overlap detection * Fragment layout * Deciding the consensus Step 1 : Suffix-prefix matching: AGCTGGGCCCATTAACG AGCT GCC TTAACG ‘prefix’ ‘substring’ ‘suffix’ Two sequences S1 and S2. If no sequencing errors: Compute longest suffix of S1 that exactly matches a prefix of S2 . However, there are sequencing errors: Find a suffix of S1 and a prefix of S2 whose similarity is maximum over all suffix-prefix pairs of S1 and S2 Heuristic speedup : -> BLAST-like approaches (two strings with important overlap should have at least one significantly long common substring ) ‘The shortest superstring problem’ Given a set of k strings P = {S1, S2, …. Sk}, a superstring of the set P is a single string that contains every string in P as a substring. Shortest superstring , denoted S*(P) Step 2 Substring layout 1) String pair with highest scoring prefix-suffix match is selected and merged 2) Next highest scoring pair is selected and merged etc…. Sequence assembly in practise: 1) Phrap (Phil Green) ("phragment assembly program", or "phil's revised assembly program"; a homonym of "frappe" = French for "swat") Makes use of efficient implementation of SW algorithm (= ‘swat’) to find overlaps. 2) CAP3 (‘contig assembly program’) (Huang) Makes use of BLAST-like approach to find overlaps CAP3 Clipping of 5’ and 3’ low quality regions Base quality values are used in computation of overlaps Forward - reverse constraints a b Repeats make the assembly process difficult Scaffold - a collection of ordered contigs with approximately known distances between them Sequence formats / ASCII-Binary formats: In ASCII text format (= human readable) each character is stored as a byte, for instance the ASCII code of 'A' = 65 as a decimal number = 01000001 as a binary number However, sequence data is often stored in a binary format: For instance in a binary system the three bases may be stored as: A = 00 T = 01 C = 10 G = 11 In this way there is room in one byte (= 8 bits) for 4 bases. Typically databases are downloaded by the bioinformatician in ASCII format but then reformatted in a binary format for use with different sequence analysis tools. One example is databases for blast searches that may be formatted by the NCBI utility 'formatdb’ Sequence formats Examples 1. Embl ID XX AC XX SV XX DT DT XX DE .... .... SQ .... .... LISOD standard; DNA; PRO; 756 BP. X64011; S78972; X64011.1 28-APR-1992 (Rel. 31, Created) 30-JUN-1993 (Rel. 36, Last updated, Version 6) L.ivanovii sod gene for superoxide dismutase Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other; cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 60 120 180 2. Fasta >LISOD L.ivanovii sod cgttatttaa ggtgttacat gtaatttctt ttcacataaa ttaccaaaat taccttatac .... .... gene for superoxide dismutase agttctatgg aaatagggtc tatacctttc gccttacaat taataaacaa tccgaggagg aatttttaat gacttacgaa ttatgatgct ttggagccga attttgataa agaaacaatg 3. GCG lisod.seq .... .... Length: 756 October 27, 2000 13:17 Type: N Check: 5188 1 cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc 51 gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg 101 aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct .. Multiple sequence format of GCG 1.msf Name: Name: Name: Name: Name: Name: Name: Name: MSF: 44 Type: P ftsy_bucai ftsy_ecoli ftsy_aquae ftsy_bacsu sr54_aciam sr54_aerpe sr54_arcfu sr54_aquae October 24, 2002 12:41 Len: Len: Len: Len: Len: Len: Len: Len: 44 44 44 44 44 44 44 44 Check: Check: Check: Check: Check: Check: Check: Check: 4221 2326 2339 6177 6296 7291 122 345 Check: 9117 .. Weight: Weight: Weight: Weight: Weight: Weight: Weight: Weight: 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 // ftsy_bucai ftsy_ecoli ftsy_aquae ftsy_bacsu sr54_aciam sr54_aerpe sr54_arcfu sr54_aquae 1 KNS.EKLYFL RDA.EALYGL KEG.EKIKEL QDP.KEVQSV PTYIERREWF PPGVTRRDWM LPALNAKEQI PKNLSPAEFV LKRKMFNILK LKEEMGEILA LKKELKELLK ISEKLVEIYN IKIVYDELSN IKIVYEELVK LKIVYEELLR IKTVYEELVD KVEIP...LE KVDEP...LN NCQ...GELK SGDEQISELN LFGGDKEPKV LFGGDQEPQV GVGEGLEIPL ILGGEK.... ISSHSPFVIL VEGKAPFVIL IPEKVGAVLL IQDGRLNVIL IPDKIPYVIM DPPKTPWIVL KKAK....IM .ADLKKGTVL 44 VVGV MVGV FVGV LVGV LVGV LVGV LVGL FVGL FASTA multiple sequence format >ftsy_bucai, 44 bases, 52FF1180 checksum. KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV >ftsy_ecoli, 44 bases, B14506B6 checksum. RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV >ftsy_aquae, 44 bases, 27571BEE checksum. KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV >ftsy_bacsu, 44 bases, FA023A4F checksum. QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV >sr54_aciam, 44 bases, 2FC13632 checksum. PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV >sr54_aerpe, 44 bases, 37AFB895 checksum. PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV >sr54_arcfu, 44 bases, 8294461 checksum. LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL >sr54_aquae, 44 bases, 794768A2 checksum. PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL CLUSTAL multiple sequence format CLUSTAL W (1.81) multiple sequence alignment ftsy_bucai ftsy_ecoli ftsy_aquae ftsy_bacsu sr54_aciam sr54_aerpe sr54_arcfu sr54_aquae KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL .:. :: ::.**: Phylip multiple sequence format 8 44 ftsy_bucai ftsy_ecoli ftsy_aquae ftsy_bacsu sr54_aciam sr54_aerpe sr54_arcfu sr54_aquae KNS-EKLYFL RDA-EALYGL KEG-EKIKEL QDP-KEVQSV PTYIERREWF PPGVTRRDWM LPALNAKEQI PKNLSPAEFV LKRKMFNILK LKEEMGEILA LKKELKELLK ISEKLVEIYN IKIVYDELSN IKIVYEELVK LKIVYEELLR IKTVYEELVD KVEIP---LE KVDEP---LN NCQ---GELK SGDEQISELN LFGGDKEPKV LFGGDQEPQV GVGEGLEIPL ILGGEK---- ISSHSPFVIL VEGKAPFVIL IPEKVGAVLL IQDGRLNVIL IPDKIPYVIM DPPKTPWIVL KKAK----IM -ADLKKGTVL VVGV MVGV FVGV LVGV LVGV LVGV LVGL FVGL Conversion of sequence formats - Readseq (one useful version of readseq is part of the SAM package http://www.cse.ucsc.edu/research/compbio/sam2src/) Readseq can convert between the following formats: 1. 2. 3. 4. 5. 6. 7. 8. 9. IG/Stanford GenBank/GB NBRF EMBL GCG DNAStrider Fitch Pearson/Fasta Zuker (in-only) 10. 11. 12. 13. 14. 15. 16. 17. 18. Olsen (in-only) Phylip3.2 Phylip Plain/Raw PIR/CODATA MSF ASN.1 PAUP/NEXUS Pretty (out-only) GCG package utilities: Convert * * * * * Reformat FromEMBL FromGenBank FromFasta ToFastA from text EMBL Genbank Fasta GCG to GCG GCG GCG GCG Fasta Clustal: clustalw my_alignment -convert -output=gcg clustalw my_alignment -convert -output=phylip Downloading bioinformatics data and software for local use