Download Bioinformatics Dr. Víctor Treviño Pabellón Tec

BIOINFORMATICS DR. VÍCTOR TREVIÑO [email protected] A7-421 EXT-4536+103 BT4007 Blast and Alignments [email protected] PRESENTACIONES DE PAPERS EN MARZO  Buscar un artículo de investigación relacionado con su proyecto y que tenga un alto componente bioinformático. Por ejemplo:         Generación de una base de datos Desarrollo de un programa o servicio Descubrimiento de genes/vías metabólicas/etc por medio/con ayuda de métodos bioinformáticos Proponer el paper al profesor y confirmar Estudiar el paper Preparar presentación Presentarlo en clase, 15 minutos, 10 minutos presentación + 5 de preguntas Las presentaciones las evalua el profesor y los alumnos, se lleva una rúbrica calificando elementos como: Tema, Intro, Mét, Resul, Disc, Critica, Voz, Claridad, Seguridad, Conocimiento, Respuestas, Tiempo [email protected] PAPERS FOR NEXT SESSION [email protected] SEQUENCE SIMILARITY Sequences are similar because are derived from a common ancestor  Will most often be the result of duplication events.  Similarity will then depend on diveregence times.  General Rule: 25% Identity in 100 aa sequence is good evidence of common ancestry  Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI [email protected] SEQUENCE SIMILARITY  Within a protein sequence, some regions will be more conserved than others. As more conserved, more important.        for function for 3D structure for localization for modification for interaction for regulation/control for transcriptional regulation (in DNA) REASONS TO PERFORM SEQUENCE SIMILARITY SEARCHES [email protected] SEQUENCE SIMILARITY - TERMS Homologous: similar due to common ancestry  Analogous: similar due to convergent evolution  Orthologous: homologous with conserved function (by speciation in separated species)  Paralogous: homologous with different function (commonly within the same species)  Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI [email protected] SEQUENCE SIMILARITY - TERMS  Xenologous: due to horizontal transfer  HGT: transfer of genetic material that is not its offspring  VGT: transfer of genetic material from its ancestor (mitosis) [vgt is not related to xenologous] Ohnologous: paralogous that have originated by whole genome duplication  Gametologous: homologous genes in nonrecombining opposite sex chromosomes.  Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia [email protected] SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press SEQUENCE SIMILARITY – ORIGINS OF GENES [email protected] a1-S1 and a1-S2 are Orthologous a2-S1 and a2-S2 are Orthologous a1 & a2 are Paralogous Analogous Genes – Same Function Different Origin Xenologous Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] SEQUENCE SIMILARITY – TYPES OF MODIFICATION …ACCAGTGTGCCGTACA…  Mutations occur during evolution by  Insertions …ACCAGTaGTGCCGTACA…  Deletions GTG …ACCAGTCCGTACA…  Substitutions …ACCAGTGCGCCGTACA… [email protected] SIMILARITY AND DISTANCE BETWEEN SEQUENCES  SIMILARITY is the maximal SUM of WEIGHTS for the conserved residues   DISTANCE is the minimal SUM of WEIGHTS for a set of mutations transforming one sequence into the other    More useful for phylogenetic tree reconstruction More useful for database searching Both are opposite and interconvertible concepts WEIGHT accounts for different roles of mutation events, AA residue similarity, etc.  e.g. synonymous mutations are different than non-sense mutations Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI [email protected] SEQUENCE ALIGNMENT  Procedure for comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences  Overall similitude  Identical residues (nt or aa) are placed in the same column Non-identical residues can be placed in the same column or indicated as gaps Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htm [email protected] SEQUENCE ALIGNMENT GLOBAL - Procedure applied to the entire sequence to include as many matches as possible up to the end of the sequence  Methods   Brute Force – unpractical  Dot Matrix – graphical, easy to understand  Dynamical Programming – the most accurate  Heuristic Methods – fast, not so accurate  Word k-tuple – Database Searching – BLAST Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia [email protected] GLOBAL AND LOCAL ALIGNMENTS  Proteins are MODULAR  Patterns formed by exchange of whole EXONS  Example:  F12 : Coagulation Factor XII  PLAT: Tissue-type plasminogen activator F1/2 - Fibronectins E - Epidermal Growth Factors K - "Kringle" domain A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed. GLOBAL ALIGNMENT METHODS DO NOT CONSIDER THIS ISSUES  LOCAL ALIGNMENT [email protected] GLOBAL AND LOCAL ALIGNMENTS Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] LOCAL ALIGNMENT Alignment stops at the end of regions of identity or strong similarity  Much higher priority is given to find these local regions than extending the alignment  A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed. [email protected] DOT-MATRIX METHOD Primary method for comparing sequences  Provides a global and local overview of similarity  Useful for direct or inverted repeats  Useful for self-complementary RNA regions  DNA Straider, DOTTER, GCG-DOTPLOT, DOTLET  http://myhits.isb-sib.ch/cgi-bin/dotlet Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD  Align, the aa sequence "DOROTHYHODGKIN" vs "DOROTHYCROWFOOTHODGKIN" Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD – EX 1 WINDOW SIZE = 11 STRINGENCY =7 (how many identical) window …ACCAGTGTGCCGTACA… Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD – EX 2 A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed. [email protected] DOT-MATRIX METHOD – EX 3 -REPEATS Figure 3.6. Dot matrix analysis of the human LDL receptor against itself using DNA Strider, vers. 1.3, on a Macintosh Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD – PROGRAMS (you could use PubMed also) Bioinformatics for Dummies – Claviere – Notredame – Wiley - 2nd Ed. 2007 [email protected] DOT-MATRIX EXAMPLES  http://hits.isbsib.ch/util/dotlet/doc/dotlet_examples.html  http://myhits.isb-sib.ch/cgi-bin/dotlet [email protected] DYNAMIC PROGRAMMING METHOD Provides the very best or optimal alignment in a very reasonable amount of time  Several parameters though  Global: Needleman-Wunsch  Local: Smith-Waterman  Provides a p-value of obtaining the alignment by chance of unrelated sequences  There is a method for statistical significance  Results depends on the scoring system  [email protected] DYNAMIC PROGRAMMING METHOD Provides the very best or optimal alignment  Several parameters though  Global: Needleman-Wunsch  Local: Smith-Waterman  Provides a p-value of obtaining the alignment by chance of unrelated sequences  There is a method for statistical significance  DYN.PROG.METHOD - SCORING  Results depend on the scoring system – SCORING MATRICES    [email protected] Depending on Pair-wise Gap Penalties DNA alignments require a similar scoring system DYNAMIC PROGRAMMING METHOD [email protected] j Gap penalties from the scoring matrix x, y are the "radius" i Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING METHOD j Gap penalties from the scoring matrix x, y are the "radius" i Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING METHOD Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING EXAMPLE gap A gap 0 -1 G -1 G C -1 G G A T A T -1 -1 -1 -1 -1 -1 Max(0 -1,,-2,2,-1= 2)=0 -1 (d)+1 (d)+1 (l)0 (ld)-1 (d)-1 (d)-1 -1 -1,0 1,-2=1 (d)+1 (d)+3 (l)+2 (l)+1 (l)0 (ld)-1 C -1 -1 (d)+1 (ldu)0 (u)+2 (d)+3 (ld)+2 (ld)+1 (ld)0 T -1 (d)-1 (u)0 (d)+1 (u)+1 (ud)+ 2 (d)+5 (l)+4 (ld)+3 A -1 (d)+1 (l)0 (d)0 (d)+1 (d)+3 (u)+4 (d)+7 (l)+6 X=1 Y=1 Gap W(x=1) = 1, W(x=2)=1 … Gap W(y = 1)=1,… ACGGATAT s(a,b)=2, if a = b --GGCTAs(a,b)=0, if a <> b DYN.PROG.METHOD - SCORING  Results depend on the scoring system – SCORING MATRICES    Depending on Pair-wise Gap Penalties Dayhoff PAM (point accepted mutations) matrix is based on a evolutionary model for proteins   [email protected] One PAM is a unit of evolutionary divergence in which 1% of the amino acids have been changed in very similar sequences BLOSUM matrix are designed to identify members of the same family  Derived from BLOCKS database (for distant sequences, blocks substitution matrix) [email protected] Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press DYNAMIC PROGRAMING - SCORING  Remember "SUM OF WEIGHTS" for similarity/distance PAM250 is 250 times PAM BLOSUM62, seq 62% identical can be merged into one. BLOSUM90 for comparing more similar sequences. BLOSUM30 for very different. [email protected] DYNAMIC PROGRAMMING METHOD  Some programs provide alternative alignments, depending on the goal       domains structural same family biological function common ancestor There are several variations respect to original Needleman-Wunsch, Smith-Waterman methods improving memory usage, cpu time, and other features [email protected] DYNAMIC PROGRAMMING METHOD - OUTPUT Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE  To assign a p-value, we could "shuffle" both sequences 100,000 times.  The proportion of times we obtain SCORES larger than that obtained in the real score represent the p-value  Another quicker method is converting the alignment to BINARY sequences (match or not match)  e.g. probability of obtaining HTHTHHHH in a coin toss experiment [email protected] DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE Two random sequences of length m and n and p=prob. of match  Length of matches=log1/p(mn)  DNA seq. length=100, p=0.25 (equal nt)   the  longest match = 2 x log4(100)=6.65 More precise formula [email protected] DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE  Simpliying (mean of the highest possible local alignment score)  k=mismatches, m and n are sequence length  Efective length = n – E(m) (used in BLAST) [email protected] ALIGNMENT PROCEDURE OVERVIEW [email protected] WORD K-TUPLE METHOD - BLAST  Search a database for sequences that at least share W identical residues  For a sequence of length L, the number of "internal searches" is L-W+1  All "potential" sequences are then "extended" using the Dynamic Programming Method  A statistical significance score is estimated representing the number of expected similar sequences in the database (E value, equivalent- to a p-value for the entire database) [email protected] BLAST     Pi – random residue probability Sij – From score matrix Score  S=sum(PiPjSij) Transformation    Expected number of matches of at least S’   For statistical comparisons Expressed in bits Lengths: query=m, database=n Example:  m=250, n=50,000,000, to achieve E=0.05   S’ = 38 bits S = [(38 * ln 2) + ln K] / λ  S = 76.6 (for ungapped version : λu = 0.3176 and Ku = 0.134

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bioinformatics Dr. Víctor Treviño Pabellón Tec