Download Dates in the history of sequence information Experimental

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Dates in the history of sequence information
1953
1965
Experimental
Amino acid sequence of insulin.
Base sequence of a tRNA molecule,75 bases
Sequence analysis
1970
1976
1970s
Sequence of MS2 RNA, 3569 bases
Birth of cloning
1975
1977
1978
Global alignment
Needleman-Wunsch
PAM matrix (Dayhoff)
DNA sequencing (Sanger,F)
Bacteriophage genome fd
(6408 bases)
1981
Local alignment
Smith-Waterman
1982 EMBL database
1985 FASTA
1985 Multiple sequence alignment
1st attempts
1988 Neural networks
(protein secondary
structure)
1989 Clustal
1989 Profilesearch
1991 BLAST
1994 Profile HMMs
1994 Context-free grammars
(RNA secondary structure)
1995
Genomes of H influenzae & M genitalium
(1.8 / 0.6 million bases)
1997
2000
>95% of human genome
(3000 million bases)
PSI-BLAST
Pairwise alignment methods
Sequence assembly
Storing sequences - sequence formats
General comments on sequence alignment in the
context of database searches
Determining the sequence of a longer piece of DNA
1) Detailed map
or
2) Shotgun sequencing
How many fragments should be picked to satisfy the expectation
of sufficient overlap?
5-10 fold coverage needed
Full sequence string is length L
Read length = 400
Number of fragments picked should be between 5L/400 - 10L/400
Example:
E. coli genome 4.5 Mb => 62,000-125,000 fragments
Human genome 3000 MB => 40 million fragments
Shotgun DNA sequencing
* The analysis of experimental raw data
Base-calling software (like ‘Phred’):
* Derive base sequence
* bases are assigned a quality score
Good quality:
Bad quality:
Shotgun DNA sequencing
* The analysis of experimental raw data
Base-calling software (like ‘Phred’):
* Derive base sequence
* bases are assigned a quality score
You may select to remove bad sequence on the
basis of these scores
* Cleaning up sequences
Quality filtering
Vector filtering
* Sequence assembly
Sequence assembly
* Overlap detection
* Fragment layout
* Deciding the consensus
Step 1 : Suffix-prefix matching:
AGCTGGGCCCATTAACG
AGCT GCC TTAACG
‘prefix’
‘substring’
‘suffix’
Two sequences S1 and S2.
If no sequencing errors: Compute longest suffix of S1 that exactly
matches a prefix of S2 .
However, there are sequencing errors:
Find a suffix of S1 and a prefix of S2
whose similarity is maximum over all suffix-prefix pairs
of S1 and S2
Heuristic speedup : -> BLAST-like approaches (two strings with
important overlap should have at least one significantly long
common substring )
‘The shortest superstring problem’
Given a set of k strings P = {S1, S2, …. Sk},
a superstring of the set P is a single string that contains
every string in P as a substring.
Shortest superstring , denoted S*(P)
Step 2 Substring layout
1) String pair with highest scoring prefix-suffix match is
selected and merged
2) Next highest scoring pair is selected and merged
etc….
Sequence assembly in practise:
1) Phrap (Phil Green)
("phragment assembly program",
or "phil's revised assembly program";
a homonym of "frappe" = French for
"swat")
Makes use of efficient implementation of
SW algorithm (= ‘swat’) to find overlaps.
2) CAP3 (‘contig assembly program’)
(Huang)
Makes use of BLAST-like approach to
find overlaps
CAP3
Clipping of 5’ and 3’ low quality regions
Base quality values are used in computation
of overlaps
Forward - reverse constraints
a
b
Repeats make the assembly process difficult
Scaffold - a collection of ordered contigs with approximately
known distances between them
Sequence formats / ASCII-Binary formats:
In ASCII text format (= human readable) each
character is stored as a byte, for instance
the ASCII code of 'A' = 65 as a decimal
number = 01000001 as a binary number
However, sequence data is often stored
in a binary format:
For instance in a binary system the three
bases may be stored as:
A = 00
T = 01
C = 10
G = 11
In this way there is room in one byte (= 8 bits) for 4 bases.
Typically databases are downloaded by the bioinformatician in ASCII
format but then reformatted in a binary format for use with
different sequence analysis tools. One example is databases for
blast searches that may be formatted by the NCBI utility 'formatdb’
Sequence formats
Examples
1. Embl
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
....
....
SQ
....
....
LISOD
standard; DNA; PRO; 756 BP.
X64011; S78972;
X64011.1
28-APR-1992 (Rel. 31, Created)
30-JUN-1993 (Rel. 36, Last updated, Version 6)
L.ivanovii sod gene for superoxide dismutase
Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;
cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat
gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa
ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg
60
120
180
2. Fasta
>LISOD L.ivanovii sod
cgttatttaa ggtgttacat
gtaatttctt ttcacataaa
ttaccaaaat taccttatac
....
....
gene for superoxide dismutase
agttctatgg aaatagggtc tatacctttc gccttacaat
taataaacaa tccgaggagg aatttttaat gacttacgaa
ttatgatgct ttggagccga attttgataa agaaacaatg
3. GCG
lisod.seq
....
....
Length: 756
October 27, 2000 13:17
Type: N
Check: 5188
1
cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc
51
gccttacaat gtaatttctt ttcacataaa taataaacaa tccgaggagg
101
aatttttaat gacttacgaa ttaccaaaat taccttatac ttatgatgct
..
Multiple sequence format of GCG
1.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
Name:
MSF: 44
Type: P
ftsy_bucai
ftsy_ecoli
ftsy_aquae
ftsy_bacsu
sr54_aciam
sr54_aerpe
sr54_arcfu
sr54_aquae
October 24, 2002 12:41
Len:
Len:
Len:
Len:
Len:
Len:
Len:
Len:
44
44
44
44
44
44
44
44
Check:
Check:
Check:
Check:
Check:
Check:
Check:
Check:
4221
2326
2339
6177
6296
7291
122
345
Check: 9117 ..
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
//
ftsy_bucai
ftsy_ecoli
ftsy_aquae
ftsy_bacsu
sr54_aciam
sr54_aerpe
sr54_arcfu
sr54_aquae
1
KNS.EKLYFL
RDA.EALYGL
KEG.EKIKEL
QDP.KEVQSV
PTYIERREWF
PPGVTRRDWM
LPALNAKEQI
PKNLSPAEFV
LKRKMFNILK
LKEEMGEILA
LKKELKELLK
ISEKLVEIYN
IKIVYDELSN
IKIVYEELVK
LKIVYEELLR
IKTVYEELVD
KVEIP...LE
KVDEP...LN
NCQ...GELK
SGDEQISELN
LFGGDKEPKV
LFGGDQEPQV
GVGEGLEIPL
ILGGEK....
ISSHSPFVIL
VEGKAPFVIL
IPEKVGAVLL
IQDGRLNVIL
IPDKIPYVIM
DPPKTPWIVL
KKAK....IM
.ADLKKGTVL
44
VVGV
MVGV
FVGV
LVGV
LVGV
LVGV
LVGL
FVGL
FASTA multiple sequence format
>ftsy_bucai, 44 bases, 52FF1180 checksum.
KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV
>ftsy_ecoli, 44 bases, B14506B6 checksum.
RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV
>ftsy_aquae, 44 bases, 27571BEE checksum.
KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV
>ftsy_bacsu, 44 bases, FA023A4F checksum.
QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV
>sr54_aciam, 44 bases, 2FC13632 checksum.
PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV
>sr54_aerpe, 44 bases, 37AFB895 checksum.
PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV
>sr54_arcfu, 44 bases, 8294461 checksum.
LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL
>sr54_aquae, 44 bases, 794768A2 checksum.
PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL
CLUSTAL multiple sequence format
CLUSTAL W (1.81) multiple sequence alignment
ftsy_bucai
ftsy_ecoli
ftsy_aquae
ftsy_bacsu
sr54_aciam
sr54_aerpe
sr54_arcfu
sr54_aquae
KNS-EKLYFLLKRKMFNILKKVEIP---LEISSHSPFVILVVGV
RDA-EALYGLLKEEMGEILAKVDEP---LNVEGKAPFVILMVGV
KEG-EKIKELLKKELKELLKNCQ---GELKIPEKVGAVLLFVGV
QDP-KEVQSVISEKLVEIYNSGDEQISELNIQDGRLNVILLVGV
PTYIERREWFIKIVYDELSNLFGGDKEPKVIPDKIPYVIMLVGV
PPGVTRRDWMIKIVYEELVKLFGGDQEPQVDPPKTPWIVLLVGV
LPALNAKEQILKIVYEELLRGVGEGLEIPLKKAK----IMLVGL
PKNLSPAEFVIKTVYEELVDILGGEK-----ADLKKGTVLFVGL
.:.
::
::.**:
Phylip multiple sequence format
8 44
ftsy_bucai
ftsy_ecoli
ftsy_aquae
ftsy_bacsu
sr54_aciam
sr54_aerpe
sr54_arcfu
sr54_aquae
KNS-EKLYFL
RDA-EALYGL
KEG-EKIKEL
QDP-KEVQSV
PTYIERREWF
PPGVTRRDWM
LPALNAKEQI
PKNLSPAEFV
LKRKMFNILK
LKEEMGEILA
LKKELKELLK
ISEKLVEIYN
IKIVYDELSN
IKIVYEELVK
LKIVYEELLR
IKTVYEELVD
KVEIP---LE
KVDEP---LN
NCQ---GELK
SGDEQISELN
LFGGDKEPKV
LFGGDQEPQV
GVGEGLEIPL
ILGGEK----
ISSHSPFVIL
VEGKAPFVIL
IPEKVGAVLL
IQDGRLNVIL
IPDKIPYVIM
DPPKTPWIVL
KKAK----IM
-ADLKKGTVL
VVGV
MVGV
FVGV
LVGV
LVGV
LVGV
LVGL
FVGL
Conversion of sequence formats - Readseq
(one useful version of readseq is part of the SAM package
http://www.cse.ucsc.edu/research/compbio/sam2src/)
Readseq can convert between
the following formats:
1.
2.
3.
4.
5.
6.
7.
8.
9.
IG/Stanford
GenBank/GB
NBRF
EMBL
GCG
DNAStrider
Fitch
Pearson/Fasta
Zuker (in-only)
10.
11.
12.
13.
14.
15.
16.
17.
18.
Olsen (in-only)
Phylip3.2
Phylip
Plain/Raw
PIR/CODATA
MSF
ASN.1
PAUP/NEXUS
Pretty (out-only)
GCG package utilities:
Convert
*
*
*
*
*
Reformat
FromEMBL
FromGenBank
FromFasta
ToFastA
from
text
EMBL
Genbank
Fasta
GCG
to
GCG
GCG
GCG
GCG
Fasta
Clustal:
clustalw my_alignment -convert -output=gcg
clustalw my_alignment -convert -output=phylip
Downloading bioinformatics data and
software for local use
Related documents