* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT presentation
Deoxyribozyme wikipedia , lookup
Non-coding DNA wikipedia , lookup
Human genome wikipedia , lookup
Metagenomics wikipedia , lookup
Protein moonlighting wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Helitron (biology) wikipedia , lookup
Expanded genetic code wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Point mutation wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Alignments Year Apr-98 Oct-97 Apr-97 Oct-96 Apr-96 Oct-95 Apr-95 Oct-94 Apr-94 Oct-93 Apr-93 Sep-92 Dec-91 Mar-91 Jun-90 Sep-89 Dec-88 Jun-88 Sep-87 Feb-87 May-86 May-85 Sep-84 Dec-82 Bases GenBank Growth Chart 1600000000 1400000000 1200000000 1000000000 800000000 600000000 400000000 200000000 0 Evolutionary basis of Alignment • Enable the researcher to determine if two sequences display sufficient similarity to justify the inference of homology. • Similarity is an observable quantity that may be expressed as say %identity or some other measure. • Homology is a conclusion drawn from this data that the two genes share a common evolutionary history. Conclusion • Genes either are or are not homologous. • There are no degrees of homology as are there in similarity. • While it is presumed that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes we do not actually know what the ancestral sequence was. Conclusion • An alignment thus just reflects the probable evolutionary history of two genes/proteins. • Residues that have aligned and are not identical represent substitutions. • Regions in which the residues of one sequence correspond to nothing in the other would be interpreted as either an insertion/deletion. These regions are represented in an alignment as Gaps • Continued . Certain regions more conserved than others - Crucial residues (structure/function) • There may be certain regions conserved but not functionally related - historical reasons. • Specially, from closely related species- have not had sufficient time to diverge. • MOTD: Experimental tests are must for validation and computational analysis just provides basic insight. An interesting example • • • • Appear to share a high degree of similarity. Should have similar biological function. Hypothetical statement. Crystalline: lens matrix of vertebrate eye. E.coli metabolic enzyme - quinone oxido reductase • Function has changed during the course of evolution. • CAREFUL!! Global Alignment: • An alignment that spans the entire length of the protein like the one in the previous example. • Best for proteins that have not diverged substantially. • Have single globular domain. Local alignment: • Many proteins appear to be mosaics of modular domains. • Modular structure of two proteins involved in Blood clotting. F2 E F1 E F1 E K K Catalytic K Catalytic Continued • Besides the catalytic domain that provides that serine protease activity there are other domains. • Two types of Fibronectin repeats. • A domain with similarity to EGF. • And a “Kringle” domain. • Can be repeated and can appear in different order. • For such cases “Local Alignment”. Continued... • Another case where local alignments might be used is at the nucleotide level when one tries to compare the nucleotide sequence of a spliced RNA to its Genomic DNA. • Each Exon is in a distinct local alignment. Comparing two sequences • Gap: Finds the alignment of two complete sequences (Global Alignment), maximises the number of matches, minimises the number of Gaps. • BestFit: Aligns the best segment of similarity between two sequences (local alignment) Evaluation of Alignment Accuracy Evaluation of alignment accuracy • What is a good alignment? • The amino acid sequence codes for the protein three dimensional structure. • when an alignment of two or more sequences is made, the implication is that the equivalent residues are performing similar structural roles in the native folded protein. • The best judge of alignment accuracy is thus obtained by comparing alignments resulting from sequence comparison with those derived from protein three dimensional structures. • Care must be taken when performing the comparison since within protein families, some regions show greater similarity than others. Check by • Monte Carlo Simulation – To check the accuracy of alignment of say e.g A and B. – Randomise B and calculate the % identity. – Iterate 1000 times and see out of thousand how many times the % identity is more than the actual sequence and calculate the probability of getting the alignment score by chance. Scoring Schemes Identity scoring • This is the simplest scoring scheme. • Amino acid pairs are classified into two types: identical and non-identical • Non-identical pairs are scored 0 and identical pairs given a positive score (usually 1) • The scoring scheme is generally considered less effective than schemes that weight non-identical pairs Genetic code scoring • Genetic code scoring was introduced by Fitch. • Considers the minimum number of DNA/RNA base changes (0,1,2 or 3) that would be required to inter-convert the codons for the two amino acids. • The scheme has been used both in the construction of phylogenetic trees and in the determination of homology between protein sequences having similar three dimensional structures. Chemical similarity scoring • Give greater weight to the alignment of amino acids with similar physico-chemical properties. • Classified amino acids on the basis of polar or non-polar character, size, shape and charge. PAM • Scoring scheme based on observed substitutions. • Derived by analysing the substitution frequencies seen in alignments of sequences. • This is something of a chicken and egg problem, since in order to generate the alignments, one really needs a scoring scheme but in order to derive the scoring scheme one needs the alignments! BLAST • Compile a list of High Scoring words towards the query sequence. • All w-mers with a score of at least T. Flavors of BLAST • Blastp/blastn - Match protein and nucleic acids against resp. databases. • Blastx - Match Nucleic acid against protein database that is matching at amino acid level. • Tblastx - Nucleic acid against a nucleic acid database but matching is done at the protein level. • Tblastn - amino acid against a nucleic acid database but matching is at amino acid level. Flavors of BLAST • Blast2/Advanced Blast/Wu -Blast – Can perform gapped alignments. • BLAST2.0 – Introduced a window factor A • Two hits must located in a window size of A . • Ignored random hits. • Can perform gapped alignments. Flavors of BLAST • PSI Blast- Position specific iterative blast. – Perform iterative database searches. – The results from each search and incorporated into a “Position specific scoring matrix” which is used for further searching. • PHI Blast - Pattern Hit Initiated Blast – Input a protein sequence query sequence and a pattern contained in that sequence. – Search for other protein sequences that contain the pattern and have significant similarity to the query – The pattern becomes the most rigid part of the search. PSI BLAST • A profile can be understood as a table that lists the frequencies of finding each of the 20 amino acids at each position in a conserved protein domain. • Building a profile can be tedious. • PSI-BLAST: A profile is constructed and iteratively refined. • Take a query sequence. • Make a profile with the initial search result. • Use this profile in a second pass search of the database. • Additional sequences found are used to refine the profile. • An interesting case of HIT (Histidine Triad Protein DB Search) FASTA • Identify all exact matches of word size k or greater between the query and the database sequence. • This word size is what is called k-tup and is usually set as 2 for proteins and 1-6 for nucleic acids. • Higher word size: » Faster, less sensitive and more selective. FASTA • Penalty - gaps • Penalty - creation of Gap. (3) • Penalty -extension of Gap. (1). Also called bias. FASTA • Rescan the 10 regions with highest similarity score. • Calculate the scores using the scoring matrix. • Trim the ends of the region to include only those residues which contribute to the highest score. • This results in 10 partial alignments without gaps. FASTA • The score of the highest scoring initial region is stored as the init1 score. • Try and join regions to see if they lie around the same diagonal (longest possible nonoverlapping alignment with gaps). • Penalty for gaps. The score of the highest scoring region at the end of joining is called as the initn score. • Optimise the alignment and score is called as the opt score. Low Complexity Regions • Biased composition and can lead to confusing results during database searches. • Homo polymeric runs and short period repeats or to the more subtle cases where one or few residues are over represented. • SEG : partitions seqs into LCR and HCR. • More than 50% of the proteins in the db contains at least 1 LCR. • LCR do not fit the residue by residue sequence conservation. So you may see a lot of false positives. Repetitive Elements • False positives. • More in DNA than protein searches. Mostly, found in the untranslated regions of the message. • Represented in the results as warning sequences. • Do a preliminary search against Alu repeats Database Multiple Alignments • Multiple sequence alignment algorithms allow you to compare and align more than two related sequences. • Very useful when analysing a family of proteins. • E.g: ClustalW and Pileup Method • Create a tree by comparing the most similar sequences step by step. • The two most similar sequences are aligned and then the next two sequences are aligned that are most similar. • Re-adjust the gaps so that the alignment is maximum and the gaps are least. GenBank - Clean and Up-to-date • Examples: – Jurassic Park - Michael Crichton. Continued.... • John Mallata - Univ. of Washington. – Evolutionary biologist...comparative genomics...realised after several months of work that the Xenopus sequence he was relying on was incorrect...found the error by accidentally coming across the correct sequence in literature. • Cases where Hamster sequence is called Human by mistake. Continued.... • Genes are placed on wrong chromosomes. • 5 years ago people sequenced only those genes for which they knew the function. Now the reverse is true. • There are very few genes in the database that are characterised and function is usually determined by computer programs that can be tripped domains have different roles. Continued...... • Database education is important. • Peer Bork , EBI estimates , about 15% of the information in GenBank to be unverified and not up-to-date. • Don’t assume that all information in the databases is correct.