* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Database Searching and Pairwise Alignment
Expression vector wikipedia , lookup
Community fingerprinting wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Magnesium transporter wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Interactome wikipedia , lookup
Metalloprotein wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Gene expression wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Non-coding DNA wikipedia , lookup
Western blot wikipedia , lookup
Biochemistry wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Biosynthesis wikipedia , lookup
Genetic code wikipedia , lookup
Proteolysis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein structure prediction wikipedia , lookup
An Introduction to Bioinformatics Database Searching - Pairwise Alignments AIMS To explain the principles underlying local and global alignment programs To explain what substitution matrices are and how they are used To introduce the commonly used pairwise alignment programs To explore the significance of alignment results OBJECTIVES Carry out FastA and Blast searches To select appropriate substitution matrices To evaluate the significance of alignment/search results INTRODUCTION • Sequence comparisons • Protein v Protein • DNA v DNA • Protein v DNA • DNA v Protein • Pair-wise comparison • Methodology Similarity v Homology………. “If two genes shared a common ancestor then they are homologous” They did or they didn’t, they are or they arn’t % Homology Definitions Similarity v Homology……. But :• Comparison of two sequences complex • Differences need to be quantified • infer homology from degree of similarity Information theory……….? Protein sequence = message bits = log2M 4.19 bits per residue bit: The amount of information required to distinguish between two equally likely choices Ref: Molecular Information theory - http://www-lmmb.ncifcrf.gov/~toms/ http://www.lecb.ncifcrf.gov/~toms/paper/nano2/latex/index.html Are two proteins related ? • Average protein size of 150 residues • Information content of 630 bits. • Probability that two random sequences specify the same message is 2-630 or about 10-190. • Convergent evolution giving rise to two similar sequences would be very rare • If two sequences exhibit significant similarity arose from a common ancestor and are homologous. Basic concept • The English alphabet contains 26 letters, that of DNA 4, and that of protein 20 • Measure similarity or dissimilarity Basic concept………. • Hamming Distance AGATCTAG ACGA AGGCATCATGCAGT • Measure No of differences between two sequences • The answer to the above is………….. 10 • The proportional or p-distance. Hamming distance divided by the total sequence length, so ranges from 0 to 1. In the above example the pdistance is 10/14 Basic concept………. The log-odds ratio. - measure of how unlikely two sequences should be so similar. - based on the observed frequencies of each of the characters (bases or amino acids) in the sequences, and the probability of observing each homologous pair in the two sequences. - positive score, measuring similarity, calculated by adding the scores from pre-calculated matrices (PAM and BLOSUM for protein, unitary for DNA). Two problems to consider: • GAPS • genes evolve AGATCTAG-ACGA-TGCAGT AGGCATCATGCAGT •deletions, insertions, recombination • give penalties for gap creations and extensions • Global or Local Alignments • Will sequences be similar over their whole length? • Use different algorithms Global and Local Alignments • A global approach will attempt to align two sequences along their entire length • A local alignment will look for local regions of similarity or subsequences. T H E R A T S A T O N T H E C A T T H E C A T S A T O N T H E M A T l l l l l l l l l l l l l l l l l Dotplots are the simplest form of alignment l l l l Identical sequences, or subsequences are l l l l l identified by diaganol lines l l l l l l l l l l l l l l l l l l l l DOTTUP website does this analysis Example of Rabbit v Emperor Penguin Haemoglobin Matrices - PAM and BLOSUM • Certain groups of amino acids have similar physicochemical properties e.g Lysine and Arginine – conservative substitution • Genetic code is degenerate - silent mutations • Dayhoff - Point Accepted Mutation (PAM) Matrices - PAM and BLOSUM PAM 1 PAM unit is the extent of evolutionary divergence in which 1% of amino acid residues are altered • Alignment of 15 very closely related proteins • Calculate a matrix of probability of a mutation altering one amino acid residue to any other amino acid on the basis of 1 PAM. • Extrapolate to PAM250 – more useful for proteins not well conserved PAM250 matrix Problems: derived from proteins of only slight divergence BLOSUM • Henikov and Henikov (1992) derived matrices based on sequences more divergent. • The BLOSUM (BLOcks SUbstition Matrix) matrices cover sequences with 80% or more similarity (BLOSUM 80), 62% or greater similarity (BLOSUM 62) etc • Based on local not global alignments Alignments - local Basic principle • Choose one sequence to be searched against the other • Query sequence (q) and target sequence (t) • Divide the query sequence into small subsequences, called words • For each word of q, look along t to find other words in t which are similar • Matching words "anchors" build up a better alignment between q and t • Assess how good this alignment is. FastA and BLAST FastA • Pearson and Lipman Method (late 80s) • Query sequence compared to each sequence in a database •matching words (up to 6 nucleotides, or two amino acids in a row) • Rescore best regions with matrices • Algorithm checks concatenation • Best sequences displayed FastA and BLAST BLAST • Basic Local Alignment Search Tool •Compares query to database – For each pair - finds maximal segment pair (using BLOSUM) – The algorithm calculates probability of random occurrence – Faster than FastA, less accurate, method of choice since introduction of GAP-BLAST Significance? Only Local Alignments - without gaps HSPs/MSPs - alignment occurring by chance (p value) is derived from the observed score (S) to the expected distribution of scores – larger databases - larger probability of a sequence match by chance – the closer the p-value to zero the more confidence can be given to the alignment Types of BLAST • Nucleotide BLAST Standard nucleotide-nucleotide BLAST [blastn] MEGABLAST Search for short nearly exact matches • Protein BLAST Standard protein-protein BLAST [blastp] PSI- and PHI-BLAST Search for short nearly exact matches • Translated BLAST Searches Nucleotide query - Protein db [blastx] Protein query - Translated db [tblastn] Nucleotide query - Translated db [tblastx] Example. I have a new mRNA sequence: TGGCGGCGGCGGCGGCGGTTGTCCCGGCTGTGCCGGTTGGTGTGGCCCGTCAGCCCGCGTACCACAGCGCCCGGGCCGCG TCGAGCCCAGTACAGCCAAGCCGCTGCGGCCGGGTCCGGCGCGGGCGGCGCGCGCAGACGGAGGGCGGCGGCCGCGGCCA GGGCGGCCCGTGGGACCGCGGGCCCCCGGCGCAGCGCTGCCCGGCTCCCGGCCCTGCCGGCCTCCTCCCTTGGCGCCGCG GCCATGGCGGCCAGCGCGAAGCGGAAGCAGGAGGAGAAGCACCTGAAGATGCTGCGGGACATGACCGGCCTCCCGCGCAA CCGAAAGTGCTTCGACTGCGACCAGCGCGGCCCCACCTACGTTAACATGACGGTCGGCTCCTTCGTGTGTACCTCCTGCT CCGGCAGCCTGCGAGGATTAAATCCACCACACAGGGTGAAATCTATCTCCATGACAACATTCACACAACAGGAAATTGAA TTCTTACAAAAACATGGAAATGAAGTCTGTAAACAGATTTGGCTAGGATTATTTGATGATAGATCTTCAGCAATTCCAGA CTTCAGGGATCCACAAAAAGTGAAAGAGTTTCTACAAGAAAAGTATGAAAAGAAAAGATGGTATGTCCCGCCAGAACAAG CCAAAGTCGTGGCATCAGTTCATGCATCTATTTCAGGGTCCTCTGCCAGTAGCACAAGCAGCACACCTGAGGTCAAACCA CTGAAATCTCTTTTAGGGGATTCTGCACCAACACTGCACTTAAATAAGGGCACACCTAGTCAGTCCCCAGTTGTAGGTCG TTCTCAAGGGCAGCAGCAGGAGAAGAAGCAATTTGACCTTTTAAGTGATCTCGGCTCAGACATCTTTGCTGCTCCAGCTC CTCAGTCAACAGCTACAGCCAATTTTGCTAACTTTGCACATTTCAACAGTCATGCAGCTCAGAATTCTGCAAATGCAGAT TTTGCAAACTTTGATGCATTTGGACAGTCTAGTGGTTCGAGTAATTTTGGAGGTTTCCCCACAGCAAGTCACTCTCCTTT TCAGCCCCAAACTACAGGTGGAAGTGCTGCATCAGTAAATGCTAATTTTGCTCATTTTGATAACTTCCCCAAATCCTCCA GTGCTGATTTTGGAACCTTCAATACTTCCCAGAGTCATCAAACAGCATCAGCTGTTAGTAAAGTTTCAACGAACAAAGCT GGTTTACAGACTGCAGACAAATATGCAGCACTTGCTAATTTAGACAATATCTTCAGTGCCGGGCAAGGTGGTGATCAGGG AAGTGGCTTTGGGACCACAGGTAAAGCTCCTGTTGGTTCTGTGGTTTCAGTTCCCAGTCAGTCAAGTGCATCTTCAGACA AGTATGCAGCTCTGGCAGAACTAGACAGCGTTTTCAGTTCTGCAGCCACCTCCAGTAATGCGTATACTTCCACAAGTAAT GCTAGCAGCAATGTTTTTGGAACAGTGCCAGTGGTTGCTTCTGCACAGACACAGCCTGCTTCATCAAGTGTGCCTGCTCC ATTTGGACGTACGCCTTCCACAAATCCATTTGTTGCTGCTGCTGGTCCTTCTGTGGCATCTTCTACAAACCCATTTCAGA CCAATGCCAGAGGAGCAACAGCGGCAACCTTTGGCACTGCATCCATGAGCATGCCCACGGGATTCGGCACTCCTGCTCCC TACAGTCTTCCCACCAGCTTTAGTGGCAGCTTTCAGCAGCCTGCCTTTCCAGCCCAAGCAGCTTTCCCTCAACAGACAGC TTTTTCTCAACAGCCCAATGGTGCAGGTTTTGCAGCATTTGGACAAACAAAGCCAGTAGTAACCCCTTTTGGTCAAGTTG CAGCTGCTGGAGTATCTAGTAATCCTTTTATGACTGGTGCACCAACAGGACAATTTCCAACAGGAAGCTCATCAACCAAT CCTTTCTTATAGCCTTATATAGACAATTTACTGGAACGAACTTTTATGTGGTCACATTACATCTCTCCACCTCTTGCACT GTTGTCTTGTTTCACTGATCTTAGCTTTAAACACAAGAGAAGTCTTTAAAAAGCCTGCATTGTGTATTAAACACCAGGTA ATATGTGCAAAACCGAGGGCTCCAGTAACACCTTCTAACCTGTGAATTGGCAGAAAAGGGTAGCGGTATCATGTATATTA AAATTGGCTAATATTAAGTTATTGCAGATACCACATTCATTATGCTGCAGTACTGTACATATTTTTCTTAGAAATTAGCT ATTTGTGCATATCAGTATTTGTAACTTTAACACATTGTTATGTGAGAAATGTTACTGGGGAAATAGATCAGCCACTTTTA AGGTGCTGTCATATATCTTGGAATGAATGACCTAAAATCATTTTAACCATTGCTACTGGAAAGTAACAGAGTCAAAATTG GAAGGTTTTATTCATTCTTGAATTTTTCCTTTCTAAAGAGCTCTTCTATTTATACATGCCTAAATTCTTTTAAAATGTAG AGGGATACCTGTCTGCATAATAAAGCTGATCATGTTTTGCTACAGTTTGCAGGTGAAAAAAAATAAATATTATAAAATAA AAAAAAAAAAAAAGAAAAAAAAAA I’ve pasted my sequence I’ve selected the database I hit BLAST! Record this number Press Format! Setting up a BLAST search Step 1. Plan the search Step 2. Enter the query sequence Step 3. Choose the appropriate search parameters Step 4. Submit the query Deciphering the BLAST output Step 1. Examine the alignment scores and statistics Step 2. Examine the alignments Step 3. Review search details to plan the next step Post-BLAST analysis Perform a PSI-BLAST analysis Create a multiple alignment Try motif searching with PHI-BLAST