Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequence database similarity search -Input: sequence query -Output: list of similar sequences (“hits”) found in the database - Sequence database similarity search implies pairwise alignments of the query to all entries in the database. - A straightforward dynamic programming algorithm is not efficient in this case (slow). - A faster search can be realized using search for “words”: stretches of similar oligomers in two sequences ( the query and a subject sequence from the database). FASTA Biochemistry: Pearson and Lipman A 100 50 50 B 166F Proc. Natl. Acad. Sci. USA 85 (1988) (Pearson & Lipman, 1988) 100 1 N ''\X\\\' I 50 50 \\' * \\ I 100. 100 \\ * C50 '~\' \ \l 100 50 100 FIG. 1. Identification of sequence similarities by FASTA. The four steps used by the FASTA program to calculate the initial and optimal similarity scores between two sequences are shown. (A) Identify regions of identity. (B) Scan the regions using a scoring 2445 only the band around each initial region but also potential after the distance beforewith and at sequence alignments for some for (A) A search for “words”, instance, initialdiagonal initial Starting atmatches. the end of the any region, an leastregion. 4 consecutive Then optimization (6) complementarity proceeds in the reverse direction region in the matrix can beuntil all possible alignment scores have gone to zero. The location of using formula intodirection is thescored maximal localsome similarity score that in thetakes reverse account words andthat the proceeds distances in the then used tothe startnumber a secondofoptimization between them.An The best path (say,starting 10) diagonal forward direction. from the forward optimal regionsisare thisThe step. maximum thenselected displayedat(5). local homologies can be displayed as sequence alignments (see Fig. 2B) or on a two-dimensional matrix graphic plot (seeusing Figs.a2A and (B) The best 10 regions arestyle rescored 3).substitution matrix, allowing conservative Statistical Significance. The rapid sequence to substitutions and shorter runs of identitiescomparison algorithms we have developed also provide additional tools similaritysignificance score. Theofsubregions forcontribute evaluatingtothethestatistical an alignment. with are maximal scores for of the best with 1.1 There approximately 5000each protein sequences, diagonals (“initial regions”). million aminoare acididentified residues, in the NBRF protein sequence library, and any computer program that searches the library a similarity by(C) calculating sequence in the The algorithm joinsscore somefor of each the compatible library find acomputing highest scoring sequence, regardless of initialwill regions, an optimal combination. whether the alignment between the query and library sequence is biologically meaningful or not. Accompanying the (D) Within some band centered around the previous version of FASTP was a program for the evaluation scoring intitial region, the alignment one seofhighest statistical significance, RDF, which compares is recalculated using permuted a dynamic programming with randomly of the potentially quence versions related sequence. algorithm. We have written a new version of RDF (RDF2) that has several improvements. (i) RDF2 calculates three scores for each shuffled sequence: one from the best single initial region (as found by FASTP), a second from the joined initial regions (used by FASTA), and a third from the optimized diagonal. BLAST: Basic Local Alignment Search Tool BLAST is the most popular program for sequence database similarity search. Seq1 First publicartion: Altschul et al. (1990). S≥T Main strategy: - Searching for “words” in a subject sequence from the database satisfying a criterion of a word size at least W and a score (S) at least T compared to a word in the query. - If a word is found, BLAST algorithm attempts to extend it and improve the score S. - The algorithm is designed for local alignments: if further extension does not improve S, the alignment region between the query and the subject sequence (“sequence hit”) with the maximal S is returned to the user. - The result of BLAST is a list of hits, ordered according to their significance (Evalues). W Seq2 BLAST: Basic Local Alignment Search Tool Say, searching with a query: ...FDRIGDGETKLVTPVPT... “w-mers”: words that score at least T when compared to some word (e.g. VTP) in the query. With W=3; T=11 and BLOSUM62 matrix, w-mer scores calculated for VTP: VTP 16 ITP 15 LTP 13 MTP 13 ATP 12 TTP 12 CTP 11 FTP 11 YTP 11 VSP 12 VAP 11 VNP 11 VVP 11 Subject ...VDQHGAPPEQRITPRQQ... contains ITP (S=15) => the algorithm proceeds with the extension phase (e.g. alignment by dynamic programming) Query Sbjct ...FDRIGDGETKLVTPVPT... ...VDQHGAPPEQRITPRQQ... Score improved ? Word extension search in the original BLAST algorithm (Altschul et al., 1990) Score HSP, high-scoring segment pair X: significance decay S: minimum score to return a hit in the output T : word threshold Extension length The statistics of pairwise alignments Expected number (E-value) of ungapped HSPs with score at least S in the alignment of sequences with sufficiently large lengths m and n: E = K m n exp (- 𝛌S), where K and 𝛌 depend on scoring system and monomer frequencies. Normalized raw score S’ = ( 𝛌S - lnK) / ln2 is a “bit score” characterizing HSP significance : E = m n 2-S’ (not dependent on scoring system). For gapped local alignments the statistics can be determined from large-scale comparisons of quasi-random sequences. The statistics of pairwise alignments Expected number (E-value) of ungapped HSPs with score at least S in the alignment of sequences with sufficiently large lengths m and n: E = K m n exp (- 𝛌S), where K and 𝛌 depend on scoring system and monomer frequencies. Normalized raw score S’ = ( 𝛌S - lnK) / ln2 is a “bit score” characterizing HSP significance : E = m n 2-S’ (not dependent on scoring system). For gapped local alignments the statistics can be determined from large-scale comparisons of quasi-random sequences. Global alignments: no general statistical theory. Significance can be determined by generating a large number of alignments of permuted sequences (the same lengths and monomer frequencies as those of sequences in question). Gapped BLAST (Altschul et al., 1997) Two-hit approach: initial search for two non-overlapping hits of score at least T, within a distance A of one another on a diagonal in sequence space. S≥T Two-hit approach: initial search for two nonoverlapping hits of score at least T, within a distance A of one another on a diagonal in sequence space: S≥T A Ungapped extension: If ungapped extension is better than some threshold Sg. E.g. chosen so that not more than one gapped extension is invoked per 50 database sequences, corresponding to Sg = 22 bits: Gapped extension is triggered. W S’ ≥ Sg BMC Bioinformatics 2009, 10:421 http://www.biomedcentral.com/1471-2105/10/4 BLAST+ (Camacho et al., 2009) Scanning N More sequence? Setup Trace-back Y Read query Find word matches Read options Calculate improved score and insertions/deletions Gap free extensions Mask query Gapped extensions Build lookup table N Matches? Y Save hits Figure 1 of a BLAST search Schematic Schematic of a BLAST search. The first phase is "setup". The query is read, low-complexity or other filtering might be Exercises (Sequence databases, sequence alignment, sequence database similarity search) 1. The NCBI reference sequence of human beta-globin mRNA has the accession NM_000518. What is the accession number of the encoded protein ? How many amino acids does it contain ? 2. A plant transcription factor TOE1 (TARGET OF EARLY ACTIVATION TAGGED 1) was recently suggested to regulate development of specialized organs called nodules in legume plants, e.g. in soybean. Using the TOE1 sequence from the model organism Arabidopsis thaliana (NP_001189625), search for similar proteins in soybean (Glycine max) using BLAST program. How many hits with Glycine max proteins are yielded by BLAST search ? What is the name of protein and accession number yielded as the best G.max hit ? Is its similarity to the TOE1 query significant ? Give the E-value of this alignment, the number of identical amino acids and the number of conservative substitutions (with positive scores). 3. The following DNA sequence fragment, containing some mutation, was isolated from a patient: tttgctccccgcgcgctgtttttctcagtgactttcagcgggcggaaaag In what gene the mutation is located? On which chromosome? How many nucleotides are changed? 4. A special option of BLAST for pairwise alignment of two sequences (bl2seq) is sometimes a quick way to determine similarity between two closely related sequences. (a) For instance, determine identity percentage in the alignment of two genomes of Zika virus: the one of an isolate from Suriname, October 2015 (accession KU312312) and the reference genome (NC_012532). - Also, using bl2seq for proteins, determine identities and positives in the alignment of polyproteins encoded by these two genomes. Locate the amino positions of gaps in the alignment: what are the residues inserted (or deleted) in the Suriname isolate as compared to the reference genome? - Are there insertions or deletions in the Suriname isolate polyprotein as compared to the polyprotein of Zika virus from a French Polynesia outbreak in 2013 (KJ776791) ? (b) Is this BLAST option also optimal for the proteins of assignment 2: A.thaliana TOE1 (NP_001189625) and its homolog from G.max ? What is the alternative algorithm/program in this case ? Determine the percentage of similarity between these two proteins. Send your solutions to [email protected] Exercises (Sequence databases, sequence alignment, sequence database similarity search) 5. It has been reported that a transcript annotated as a long non-coding RNA in mouse genome encodes a peptide of 34 amino acids with the following sequence: MAEKESTSPHLIVPILLLVGWIVGCIIVIYIVFF. It was also suggested that a transcript annotated as a long non-coding RNA in human genome (Accession NR_037902) might also contain a small open reading frame (ORF) encoding similar peptide. Determine the nucleotide positions of this ORF in the human transcript, the sequence of the peptide and its length. 6. Using ENTREZ Gene database, determine the differences between alternative splicing isoforms of the human microtubule-associated protein tau (MAPT, GeneID 4137). How many exons are contained in the tau gene according to the RefSeqGene data? How many exons do alternative transcripts lack? 7. Calculate pairwise alignments of two homologous segments (PA-segments, Accessions CY046942 and EF626633) of influenza A and B viruses using different algorithms: (a) Make the global optimal alignment by needle algorithm (www.ebi.ac.uk/Tools/emboss/align/); (b) Using the option of BLAST for two sequences (bl2seq), align these two nucleotide sequences with blastn algorithm; (c) Use bl2seq again, but with tblastx: alignment of translated nucleotide sequences. What are the main differences between the alignment results? Try to explain the origins of these differences. What are the advantages/ disadvantages of each of these approaches in this case? Send your solutions to [email protected]