Download Bioinformatics Dr. Víctor Treviño Pabellón Tec

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transposable element wikipedia , lookup

Public health genomics wikipedia , lookup

NUMT wikipedia , lookup

Gene wikipedia , lookup

Minimal genome wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Point mutation wikipedia , lookup

RNA-Seq wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Genomic library wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microsatellite wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human genome wikipedia , lookup

Human Genome Project wikipedia , lookup

Pathogenomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Sequence alignment wikipedia , lookup

Transcript
BIOINFORMATICS
DR. VÍCTOR TREVIÑO
[email protected]
A7-421
EXT-4536+103
BT4007
Blast and Alignments
[email protected]
PRESENTACIONES DE PAPERS EN MARZO

Buscar un artículo de investigación relacionado con su proyecto y
que tenga un alto componente bioinformático. Por ejemplo:








Generación de una base de datos
Desarrollo de un programa o servicio
Descubrimiento de genes/vías metabólicas/etc por medio/con ayuda
de métodos bioinformáticos
Proponer el paper al profesor y confirmar
Estudiar el paper
Preparar presentación
Presentarlo en clase, 15 minutos, 10 minutos presentación + 5 de
preguntas
Las presentaciones las evalua el profesor y los alumnos, se lleva una
rúbrica calificando elementos como: Tema, Intro, Mét, Resul, Disc,
Critica, Voz, Claridad, Seguridad, Conocimiento, Respuestas, Tiempo
[email protected]
PAPERS FOR NEXT SESSION
[email protected]
SEQUENCE SIMILARITY
Sequences are similar because are derived
from a common ancestor
 Will most often be the result of duplication
events.
 Similarity will then depend on diveregence
times.
 General Rule: 25% Identity in 100 aa
sequence is good evidence of common
ancestry

Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
[email protected]
SEQUENCE SIMILARITY

Within a protein sequence, some regions will be
more conserved than others. As more conserved,
more important.







for function
for 3D structure
for localization
for modification
for interaction
for regulation/control
for transcriptional regulation
(in DNA)
REASONS TO
PERFORM
SEQUENCE
SIMILARITY
SEARCHES
[email protected]
SEQUENCE SIMILARITY - TERMS
Homologous: similar due to common
ancestry
 Analogous: similar due to convergent
evolution
 Orthologous: homologous with conserved
function (by speciation in separated species)
 Paralogous: homologous with different
function (commonly within the same
species)

Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
[email protected]
SEQUENCE SIMILARITY - TERMS

Xenologous: due to horizontal transfer
 HGT:
transfer of genetic material that is not its
offspring
 VGT: transfer of genetic material from its
ancestor (mitosis) [vgt is not related to xenologous]
Ohnologous: paralogous that have originated
by whole genome duplication
 Gametologous: homologous genes in nonrecombining opposite sex chromosomes.

Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
Wikipedia
[email protected]
SEQUENCE SIMILARITY – EVOLUTIONARY
RELATIONSHIP
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
SEQUENCE SIMILARITY – ORIGINS OF GENES
[email protected]
a1-S1 and a1-S2 are Orthologous
a2-S1 and a2-S2 are Orthologous
a1 & a2 are
Paralogous
Analogous Genes – Same Function
Different Origin
Xenologous
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
SEQUENCE SIMILARITY – TYPES OF
MODIFICATION
…ACCAGTGTGCCGTACA…

Mutations occur during evolution by
 Insertions
…ACCAGTaGTGCCGTACA…
 Deletions
GTG
…ACCAGTCCGTACA…
 Substitutions
…ACCAGTGCGCCGTACA…
[email protected]
SIMILARITY AND DISTANCE BETWEEN
SEQUENCES

SIMILARITY is the maximal SUM of WEIGHTS for the
conserved residues


DISTANCE is the minimal SUM of WEIGHTS for a set of
mutations transforming one sequence into the other



More useful for phylogenetic tree reconstruction
More useful for database searching
Both are opposite and interconvertible concepts
WEIGHT accounts for different roles of mutation
events, AA residue similarity, etc.

e.g. synonymous mutations are different than non-sense
mutations
Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
[email protected]
SEQUENCE ALIGNMENT

Procedure for comparing two (pair-wise alignment) or
more (multiple sequence alignment) sequences by
searching for similar patterns that are in the same
order in the sequences

Overall
similitude

Identical residues (nt or aa) are placed in the same column
Non-identical residues can be placed in the same column or
indicated as gaps
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htm
[email protected]
SEQUENCE ALIGNMENT
GLOBAL - Procedure applied to the entire
sequence to include as many matches as
possible up to the end of the sequence
 Methods

 Brute
Force – unpractical
 Dot Matrix – graphical, easy to understand
 Dynamical Programming – the most accurate
 Heuristic Methods – fast, not so accurate
 Word k-tuple – Database Searching – BLAST
Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
Wikipedia
[email protected]
GLOBAL AND LOCAL ALIGNMENTS

Proteins are MODULAR
 Patterns
formed by exchange of whole EXONS
 Example:
 F12
: Coagulation Factor XII
 PLAT: Tissue-type plasminogen activator
F1/2 - Fibronectins
E - Epidermal Growth Factors
K - "Kringle" domain
A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.
GLOBAL
ALIGNMENT
METHODS
DO NOT
CONSIDER
THIS ISSUES

LOCAL
ALIGNMENT
[email protected]
GLOBAL AND LOCAL ALIGNMENTS
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
LOCAL ALIGNMENT
Alignment stops at the end of regions of
identity or strong similarity
 Much higher priority is given to find these
local regions than extending the alignment

A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.
[email protected]
DOT-MATRIX METHOD
Primary method for comparing sequences
 Provides a global and local overview of
similarity
 Useful for direct or inverted repeats
 Useful for self-complementary RNA regions
 DNA Straider, DOTTER, GCG-DOTPLOT, DOTLET

http://myhits.isb-sib.ch/cgi-bin/dotlet
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DOT-MATRIX METHOD

Align, the aa sequence "DOROTHYHODGKIN" vs
"DOROTHYCROWFOOTHODGKIN"
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DOT-MATRIX METHOD – EX 1
WINDOW SIZE
= 11
STRINGENCY
=7
(how many identical)
window
…ACCAGTGTGCCGTACA…
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DOT-MATRIX METHOD – EX 2
A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.
[email protected]
DOT-MATRIX METHOD – EX 3 -REPEATS
Figure 3.6. Dot matrix analysis of the human LDL receptor against itself using DNA Strider, vers. 1.3, on a Macintosh
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DOT-MATRIX METHOD – PROGRAMS
(you could use PubMed also)
Bioinformatics for Dummies – Claviere – Notredame – Wiley - 2nd Ed. 2007
[email protected]
DOT-MATRIX EXAMPLES

http://hits.isbsib.ch/util/dotlet/doc/dotlet_examples.html

http://myhits.isb-sib.ch/cgi-bin/dotlet
[email protected]
DYNAMIC PROGRAMMING METHOD
Provides the very best or optimal alignment in a
very reasonable amount of time
 Several parameters though
 Global: Needleman-Wunsch
 Local: Smith-Waterman
 Provides a p-value of obtaining the alignment by
chance of unrelated sequences
 There is a method for statistical significance
 Results depends on the scoring system

[email protected]
DYNAMIC PROGRAMMING METHOD
Provides the very best or optimal alignment
 Several parameters though
 Global: Needleman-Wunsch
 Local: Smith-Waterman
 Provides a p-value of obtaining the
alignment by chance of unrelated sequences
 There is a method for statistical significance

DYN.PROG.METHOD - SCORING

Results depend on the scoring system – SCORING
MATRICES



[email protected]
Depending on Pair-wise
Gap Penalties
DNA alignments require a similar scoring system
DYNAMIC PROGRAMMING METHOD
[email protected]
j
Gap penalties from the scoring matrix
x, y are the "radius"
i
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DYNAMIC PROGRAMMING METHOD
j
Gap penalties from the scoring matrix
x, y are the "radius"
i
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DYNAMIC PROGRAMMING METHOD
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DYNAMIC PROGRAMMING EXAMPLE
gap
A
gap
0
-1
G
-1
G
C
-1
G
G
A
T
A
T
-1
-1
-1
-1
-1
-1
Max(0 -1,,-2,2,-1=
2)=0
-1
(d)+1
(d)+1
(l)0
(ld)-1
(d)-1
(d)-1
-1
-1,0
1,-2=1
(d)+1
(d)+3
(l)+2
(l)+1
(l)0
(ld)-1
C
-1
-1
(d)+1
(ldu)0 (u)+2
(d)+3
(ld)+2 (ld)+1 (ld)0
T
-1
(d)-1
(u)0
(d)+1
(u)+1
(ud)+
2
(d)+5
(l)+4
(ld)+3
A
-1
(d)+1
(l)0
(d)0
(d)+1
(d)+3
(u)+4
(d)+7
(l)+6
X=1
Y=1
Gap W(x=1) = 1, W(x=2)=1 …
Gap W(y = 1)=1,…
ACGGATAT
s(a,b)=2, if a = b
--GGCTAs(a,b)=0, if a <> b
DYN.PROG.METHOD - SCORING

Results depend on the scoring system –
SCORING MATRICES



Depending on Pair-wise
Gap Penalties
Dayhoff PAM (point accepted mutations)
matrix is based on a evolutionary model for
proteins


[email protected]
One PAM is a unit of evolutionary divergence in which 1% of
the amino acids have been changed in very similar sequences
BLOSUM matrix are designed to identify
members of the same family

Derived from BLOCKS database (for distant sequences, blocks
substitution matrix)
[email protected]
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
DYNAMIC PROGRAMING - SCORING

Remember "SUM OF
WEIGHTS" for
similarity/distance
PAM250 is
250 times PAM
BLOSUM62, seq 62%
identical can be merged into
one.
BLOSUM90 for comparing
more similar sequences.
BLOSUM30 for very
different.
[email protected]
DYNAMIC PROGRAMMING METHOD

Some programs provide alternative alignments,
depending on the goal






domains
structural
same family
biological function
common ancestor
There are several variations respect to original
Needleman-Wunsch, Smith-Waterman methods
improving memory usage, cpu time, and other
features
[email protected]
DYNAMIC PROGRAMMING METHOD - OUTPUT
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected]
DYNAMIC PROGRAMMING – STATISTICAL
SIGNIFICANCE

To assign a p-value, we could "shuffle" both
sequences 100,000 times.
 The
proportion of times we obtain SCORES larger
than that obtained in the real score represent the
p-value

Another quicker method is converting the
alignment to BINARY sequences (match or not
match)
 e.g.
probability of obtaining HTHTHHHH in a coin
toss experiment
[email protected]
DYNAMIC PROGRAMMING – STATISTICAL
SIGNIFICANCE
Two random sequences of length m and n and
p=prob. of match
 Length of matches=log1/p(mn)
 DNA seq. length=100, p=0.25 (equal nt)

 the

longest match = 2 x log4(100)=6.65
More precise formula
[email protected]
DYNAMIC PROGRAMMING – STATISTICAL
SIGNIFICANCE

Simpliying
(mean of the highest possible local alignment score)

k=mismatches, m and n are sequence length

Efective length = n – E(m) (used in BLAST)
[email protected]
ALIGNMENT
PROCEDURE
OVERVIEW
[email protected]
WORD K-TUPLE METHOD - BLAST

Search a database for sequences that at
least share W identical residues

For a sequence of length L, the number of
"internal searches" is L-W+1

All "potential" sequences are then "extended"
using the Dynamic Programming Method

A statistical significance score is estimated
representing the number of expected similar
sequences in the database (E value, equivalent- to a p-value for the entire
database)
[email protected]
BLAST




Pi – random residue probability
Sij – From score matrix
Score  S=sum(PiPjSij)
Transformation



Expected number of matches of at least S’


For statistical comparisons
Expressed in bits
Lengths: query=m, database=n
Example:

m=250, n=50,000,000, to achieve E=0.05


S’ = 38 bits
S = [(38 * ln 2) + ln K] / λ  S = 76.6
(for ungapped version : λu = 0.3176 and Ku = 0.134