Download BLAST- bioinformatics

Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to the theory and practice of performing the variety of sequence similiarity searches available via the NCBI BLAST service Searching primary databases using sequence similarity Basic Local Alignment Search Tool BLAS T BLAST is a computer algorithm that returns sequences in the database with the highest percentage of bases in common to the query sequence Sequence alignments • The first line of evidence that two sequences are, because of a shared evolutionary history, related. • If two sequences (DNA or protein) are related by descent, they will be related by sequence. A B C The “evolutionary distance” between A and B is smaller, and therefore, under an assumption of brownian motion (no selection) a given stretch of DNA or protein will share more nucleotides or amino acids in common. All search algorithms will produce results when queried! • The trick is to be able to (i) trust and evaluate the result, and (ii) to be able to quantify this evaluation. Reasons for performing BLAST searches • There can be many reasons, but common ones are: Human or Computational annotation: For non-model systems, for which little bench work has been done compared to a model system, sequence alignments with known, experimentally verified genes, can aid in the assignment of function Evolutionary: discovering similar sequences in different organisms allows one to ask whether and how sequence-level changes result in functional changes. Can be done for coding or non-coding (i.e. regulatory regions) . Multiple sequence alignments can help identify conserved regions of coding sequences, which might have functional significance, or to help understand evolutionary relationships among difficult to classify organisms. Multiple sequence alignments can also help with the development of primers in order to easily clone out a cDNA from an organism for which genome sequencing has not been done. Sequence similarity can only be ascertained by aligning two sequences ACGGCATCCGACGCTTAGCGGACTATCGATCTGA ACCCGGCCTACGGCTACTCGCTTAGCGGACTCGG Some basic concepts: Sequence Similarity (Data) Homology (Inference) Percent similarity of base pairs Similar because of common between any two sequences, descent from an ancestor that over a given length of contained that sequence. sequence Continuous quantity any real number (0-100%) Categorical quantity, either two sequences are homologous or not* *Because of a variety of genomic rearrangement phenomena, two sequences that code for a non-homologous protein per se, can contain sub-sequences that are indeed homologous. This is actually a source of false positive hits during Blast searches Two kinds of Homology: Paralogy vs. Orthology Orthologues are two sequences that are related because that sequence existed in an ancestor. Paralogues are two or more sequences that are related because a gene duplication event Paralogous sequences AGCCTATGGCAA ACGCTAGGGCTT ACGCTATGGCAA ACGCTAGGGGCAA Orthologous sequences Ancestral sequence ACGCTACGGGCAA Global versus local alignment Local Alignment Global Alignment Best alignment of two sequences, based on regions of sub-sequence (the local part) with the highest similarity Best alignment of two sequences along their entire length. Used with highly similar sequences. Better at finding weak similarities and functional domains within coding sequences Better for multiple sequence alignments Most common in database searches Not particularly useful for database searches. DNA versus protein alignment DNA Can be much longer protein Only as long as the longest coding sequence (<5K a.a.) Because of the wobble Are more sensitive effect, are less because amino acids sensitive and tend to tend to be conserved miss sequence along functional relationships among domains, so even distantly related weak total similarities species can still be detected Dotplot: for comparing two sequences • A dotplot is a simple techniqe that yields a graphical, but not statistical, representation of sequence similarity (see Fig 11.1) A C G T C G T A A C G A G G T A Dot plots • Good: Easy visualization of syntenic regions of long regions of genomic sequence Easy visualization of exon/intron boundaries Easy visualization and enumeration of tandemly repeated sequence elements • Bad: Can only compare two sequences at a time. No numerical evaluation of the degree of similarity – we examine this next Scoring matrices • When searching a complex database full of billions of nucleotides worth of sequences, we must not only identify related sequences, but develop the ability to score how “good” the match is, and then rank these “hits” in a list or a table. A scoring matrix is: an empirical weighting scheme used in all sequence comparisons Example here? DNA Scoring matrices A G C T Are all transitions and transversions equally likely? Different scoring matrices make different assumptions about this, but it should be clear that the wobble effect, numerical probability and molecular mech. matters here. Amino Acid Scoring Matrices • A bit more complicated, because: there are 20 possible substitutions at any particular site Some substitutions are more constrained by function than others. In other words, we need to distinguish between absolute conservation (dark blue) and functional conservation (light blue). Some amino acids are more rare among all proteins, or within proteins, so changes in these amino acids must be given higher weight. Depending upon evolutionary distance, some amino acid changes are more likely than others. The log odds ratio • A scoring matrix is a probability matrix, which is an attempt to understand the probability of all pairwise substitutions, given how often they are actually observed to change in known sequences. • Probabilities are calculated as being less than, or greater than observed by random chance, hence the negative numbers. PAM250 Matrix PAM amino acid scoring matrix • The Point Accepted Mutation considers only mutations at the single site level. Original matrices were done using sequences with more than 85% similarity, which means they are very closely related The term acceptance refers to functional conservation of protein function, even if sequence changes. Amino acids can be group according to their “chemistries.” Because some amino acids are very similar in their chemistries we should not score substitutions between them the same as between two amino acids that are very different in their chemistries Remember this? Assumptions of the PAM matrices • Substitutions at a given site are (i) independent of previous changes at that site and are independent of changes to adjacent sites. • One PAM unit corresponds to 1 a.a. change/ 100 a.a., or 1% divergence in sequence. • PAM160 = 160 (total) changes/100 a.a. This can be problematic because this represents an extrapolation of probabilities calculated for closely related sequence, and so and error will simply be multiplied BLOSUM matrices • Considered the fact that BLOcks of sequence corresponding to secondary structure (i.e. functional domains like catalytic sites, DNA binding regions etc.) are likely to display different SUbstitution probabilities. • And, BLOSUM matrices considered subsitution probabilities across several evolutionary distances, and so are more accurate than PAM for weaker sequence relationships. • E.g. BLOSUM62 matrix means that sequences with no more than 62% sequence similarity were used to calculate substitution probabilities. GAPS and penalties • Other types of mutations involve insertions and deletions of sequence, collectively called indels ACGATCGTCATCGATCGA ACGATCTCGATCGA These two sequences only align well across <half of their sequences. ACGATCGTCATCGATCGA ACGATC - - - - TCGATCGA If we introduce four gaps, it is obvious that these sequences are more related than we thought. But there must be a penalty for doing this (i.e. lowering the overall score) because you , Two kinds of Gap scoring methods Affine gap penalty G + Ln; where G is a penalty for introducing a gap, and L is the penalty for lengthening the range of the gapped region (G > Ln) Non-affine gap penalty Ln; No penalty for opening, and where L is a fixed penalty for every gap So, How does BLAST work? • Words, Neighbourhoods, and High Scoring Segment pairs, Oh My! • The Word is the minimum length of sequence that is used to start a search, usually three amino acids (RDQYPQW). • Neighbourhoods are similar words to the query word, (e.g. RDQ vs RBQ vs RDE). These are subject to the scoring matrix • A High scoring segment pair is the region of sequence for which the highest scoring Word can be extended the most as matching sequence Detection of high scoring segments. Lexa M et al. Bioinformatics 2011;27:2510-2517 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Let’s Blast!!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download BLAST- bioinformatics