* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics Dr. Víctor Treviño Pabellón Tec
Gene expression profiling wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Transposable element wikipedia , lookup
Public health genomics wikipedia , lookup
Minimal genome wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Point mutation wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Genomic library wikipedia , lookup
Non-coding DNA wikipedia , lookup
Microsatellite wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Helitron (biology) wikipedia , lookup
Human genome wikipedia , lookup
Human Genome Project wikipedia , lookup
Pathogenomics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Metagenomics wikipedia , lookup
Genome editing wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
BIOINFORMATICS DR. VÍCTOR TREVIÑO [email protected] A7-421 EXT-4536+103 BT4007 Blast and Alignments [email protected] PRESENTACIONES DE PAPERS EN MARZO Buscar un artículo de investigación relacionado con su proyecto y que tenga un alto componente bioinformático. Por ejemplo: Generación de una base de datos Desarrollo de un programa o servicio Descubrimiento de genes/vías metabólicas/etc por medio/con ayuda de métodos bioinformáticos Proponer el paper al profesor y confirmar Estudiar el paper Preparar presentación Presentarlo en clase, 15 minutos, 10 minutos presentación + 5 de preguntas Las presentaciones las evalua el profesor y los alumnos, se lleva una rúbrica calificando elementos como: Tema, Intro, Mét, Resul, Disc, Critica, Voz, Claridad, Seguridad, Conocimiento, Respuestas, Tiempo [email protected] PAPERS FOR NEXT SESSION [email protected] SEQUENCE SIMILARITY Sequences are similar because are derived from a common ancestor Will most often be the result of duplication events. Similarity will then depend on diveregence times. General Rule: 25% Identity in 100 aa sequence is good evidence of common ancestry Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI [email protected] SEQUENCE SIMILARITY Within a protein sequence, some regions will be more conserved than others. As more conserved, more important. for function for 3D structure for localization for modification for interaction for regulation/control for transcriptional regulation (in DNA) REASONS TO PERFORM SEQUENCE SIMILARITY SEARCHES [email protected] SEQUENCE SIMILARITY - TERMS Homologous: similar due to common ancestry Analogous: similar due to convergent evolution Orthologous: homologous with conserved function (by speciation in separated species) Paralogous: homologous with different function (commonly within the same species) Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI [email protected] SEQUENCE SIMILARITY - TERMS Xenologous: due to horizontal transfer HGT: transfer of genetic material that is not its offspring VGT: transfer of genetic material from its ancestor (mitosis) [vgt is not related to xenologous] Ohnologous: paralogous that have originated by whole genome duplication Gametologous: homologous genes in nonrecombining opposite sex chromosomes. Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia [email protected] SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press SEQUENCE SIMILARITY – ORIGINS OF GENES [email protected] a1-S1 and a1-S2 are Orthologous a2-S1 and a2-S2 are Orthologous a1 & a2 are Paralogous Analogous Genes – Same Function Different Origin Xenologous Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] SEQUENCE SIMILARITY – TYPES OF MODIFICATION …ACCAGTGTGCCGTACA… Mutations occur during evolution by Insertions …ACCAGTaGTGCCGTACA… Deletions GTG …ACCAGTCCGTACA… Substitutions …ACCAGTGCGCCGTACA… [email protected] SIMILARITY AND DISTANCE BETWEEN SEQUENCES SIMILARITY is the maximal SUM of WEIGHTS for the conserved residues DISTANCE is the minimal SUM of WEIGHTS for a set of mutations transforming one sequence into the other More useful for phylogenetic tree reconstruction More useful for database searching Both are opposite and interconvertible concepts WEIGHT accounts for different roles of mutation events, AA residue similarity, etc. e.g. synonymous mutations are different than non-sense mutations Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI [email protected] SEQUENCE ALIGNMENT Procedure for comparing two (pair-wise alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences Overall similitude Identical residues (nt or aa) are placed in the same column Non-identical residues can be placed in the same column or indicated as gaps Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htm [email protected] SEQUENCE ALIGNMENT GLOBAL - Procedure applied to the entire sequence to include as many matches as possible up to the end of the sequence Methods Brute Force – unpractical Dot Matrix – graphical, easy to understand Dynamical Programming – the most accurate Heuristic Methods – fast, not so accurate Word k-tuple – Database Searching – BLAST Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia [email protected] GLOBAL AND LOCAL ALIGNMENTS Proteins are MODULAR Patterns formed by exchange of whole EXONS Example: F12 : Coagulation Factor XII PLAT: Tissue-type plasminogen activator F1/2 - Fibronectins E - Epidermal Growth Factors K - "Kringle" domain A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed. GLOBAL ALIGNMENT METHODS DO NOT CONSIDER THIS ISSUES LOCAL ALIGNMENT [email protected] GLOBAL AND LOCAL ALIGNMENTS Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] LOCAL ALIGNMENT Alignment stops at the end of regions of identity or strong similarity Much higher priority is given to find these local regions than extending the alignment A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed. [email protected] DOT-MATRIX METHOD Primary method for comparing sequences Provides a global and local overview of similarity Useful for direct or inverted repeats Useful for self-complementary RNA regions DNA Straider, DOTTER, GCG-DOTPLOT, DOTLET http://myhits.isb-sib.ch/cgi-bin/dotlet Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD Align, the aa sequence "DOROTHYHODGKIN" vs "DOROTHYCROWFOOTHODGKIN" Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD – EX 1 WINDOW SIZE = 11 STRINGENCY =7 (how many identical) window …ACCAGTGTGCCGTACA… Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD – EX 2 A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed. [email protected] DOT-MATRIX METHOD – EX 3 -REPEATS Figure 3.6. Dot matrix analysis of the human LDL receptor against itself using DNA Strider, vers. 1.3, on a Macintosh Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DOT-MATRIX METHOD – PROGRAMS (you could use PubMed also) Bioinformatics for Dummies – Claviere – Notredame – Wiley - 2nd Ed. 2007 [email protected] DOT-MATRIX EXAMPLES http://hits.isbsib.ch/util/dotlet/doc/dotlet_examples.html http://myhits.isb-sib.ch/cgi-bin/dotlet [email protected] DYNAMIC PROGRAMMING METHOD Provides the very best or optimal alignment in a very reasonable amount of time Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the alignment by chance of unrelated sequences There is a method for statistical significance Results depends on the scoring system [email protected] DYNAMIC PROGRAMMING METHOD Provides the very best or optimal alignment Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the alignment by chance of unrelated sequences There is a method for statistical significance DYN.PROG.METHOD - SCORING Results depend on the scoring system – SCORING MATRICES [email protected] Depending on Pair-wise Gap Penalties DNA alignments require a similar scoring system DYNAMIC PROGRAMMING METHOD [email protected] j Gap penalties from the scoring matrix x, y are the "radius" i Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING METHOD j Gap penalties from the scoring matrix x, y are the "radius" i Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING METHOD Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING EXAMPLE gap A gap 0 -1 G -1 G C -1 G G A T A T -1 -1 -1 -1 -1 -1 Max(0 -1,,-2,2,-1= 2)=0 -1 (d)+1 (d)+1 (l)0 (ld)-1 (d)-1 (d)-1 -1 -1,0 1,-2=1 (d)+1 (d)+3 (l)+2 (l)+1 (l)0 (ld)-1 C -1 -1 (d)+1 (ldu)0 (u)+2 (d)+3 (ld)+2 (ld)+1 (ld)0 T -1 (d)-1 (u)0 (d)+1 (u)+1 (ud)+ 2 (d)+5 (l)+4 (ld)+3 A -1 (d)+1 (l)0 (d)0 (d)+1 (d)+3 (u)+4 (d)+7 (l)+6 X=1 Y=1 Gap W(x=1) = 1, W(x=2)=1 … Gap W(y = 1)=1,… ACGGATAT s(a,b)=2, if a = b --GGCTAs(a,b)=0, if a <> b DYN.PROG.METHOD - SCORING Results depend on the scoring system – SCORING MATRICES Depending on Pair-wise Gap Penalties Dayhoff PAM (point accepted mutations) matrix is based on a evolutionary model for proteins [email protected] One PAM is a unit of evolutionary divergence in which 1% of the amino acids have been changed in very similar sequences BLOSUM matrix are designed to identify members of the same family Derived from BLOCKS database (for distant sequences, blocks substitution matrix) [email protected] Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press DYNAMIC PROGRAMING - SCORING Remember "SUM OF WEIGHTS" for similarity/distance PAM250 is 250 times PAM BLOSUM62, seq 62% identical can be merged into one. BLOSUM90 for comparing more similar sequences. BLOSUM30 for very different. [email protected] DYNAMIC PROGRAMMING METHOD Some programs provide alternative alignments, depending on the goal domains structural same family biological function common ancestor There are several variations respect to original Needleman-Wunsch, Smith-Waterman methods improving memory usage, cpu time, and other features [email protected] DYNAMIC PROGRAMMING METHOD - OUTPUT Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press [email protected] DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE To assign a p-value, we could "shuffle" both sequences 100,000 times. The proportion of times we obtain SCORES larger than that obtained in the real score represent the p-value Another quicker method is converting the alignment to BINARY sequences (match or not match) e.g. probability of obtaining HTHTHHHH in a coin toss experiment [email protected] DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE Two random sequences of length m and n and p=prob. of match Length of matches=log1/p(mn) DNA seq. length=100, p=0.25 (equal nt) the longest match = 2 x log4(100)=6.65 More precise formula [email protected] DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE Simpliying (mean of the highest possible local alignment score) k=mismatches, m and n are sequence length Efective length = n – E(m) (used in BLAST) [email protected] ALIGNMENT PROCEDURE OVERVIEW [email protected] WORD K-TUPLE METHOD - BLAST Search a database for sequences that at least share W identical residues For a sequence of length L, the number of "internal searches" is L-W+1 All "potential" sequences are then "extended" using the Dynamic Programming Method A statistical significance score is estimated representing the number of expected similar sequences in the database (E value, equivalent- to a p-value for the entire database) [email protected] BLAST Pi – random residue probability Sij – From score matrix Score S=sum(PiPjSij) Transformation Expected number of matches of at least S’ For statistical comparisons Expressed in bits Lengths: query=m, database=n Example: m=250, n=50,000,000, to achieve E=0.05 S’ = 38 bits S = [(38 * ln 2) + ln K] / λ S = 76.6 (for ungapped version : λu = 0.3176 and Ku = 0.134