Download Sin título de diapositiva

Document related concepts

Ridge (biology) wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Gene expression wikipedia , lookup

Gene regulatory network wikipedia , lookup

Exome sequencing wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Molecular cloning wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Molecular ecology wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Introducción a la Bioinformática
2002
Universidad Nacional San Cristobal de
Huamanga, Ayacucho
Mirko Zimic
Tópicos de interés en la
bioinformática
•
•
•
•
•
•
•
•
Análisis de secuencias
Filogenia y evolución molecular
Modelamiento molecular
Plegamiento de Proteínas
Genómica y Proteómica
Genética estadística
Microarreglos
Programación científica
Pongamos un ejemplo …
Cisteíno proteasa de la fasciola
hepática: En busca de un péptido
inmunogénico
Alineamiento: cisteíno proteasas de mamífero Vs. cisteíno
proteasa de Fasciola hepatica.
AA Idénticos
AA divergentes
VPKSVDWREKGYVTPVKNQGQCGSCWAFSATGALEGQMFRKTGR ISLSEQNLVDCSRPQGN
AVPDKIDWRESGYVTEVKDQGNCGSCWAFSTTGTMEGQYM KNERTSISFSEQQLVDCSRPWGN
_____ROJO_________
QGCNGGLMDNAFQYIKENGGLDSEESYPYEATDTSCNY KPEYSVANDTGFVDIPQREKA LMK
NGCGGGLMENAYQYLKQF GLETESSYPYTAVGGQCRYNKQLG VAKVTGYYTV QSGSEVEL KN
_VIOLETA____ _AMARILLO_______
AVATVGPISVAIDAGHSFQFYKSGIYYEPDCSSKDLDHGVLVVGYGFEG TDSNNNKYW IVKNSW
LIGSEGPSAVAVDVESDFMMYRSGIYQSQTCSPLRVNHAVLAVGYGTQGGTD
YW IVKNSW
_____
_VERDE_____
GPEWGM-GYVKMAKDRNNH CGIATAASYPTV
GLSWGERGYIRMV RNRGNMCGIASLASLPMVARFP
Epítope Discontinuo, formado por porciones distantes
de la secuencia.
Denaturación
El epítope se
pierde con la
denaturación.
Epítope Continuo, formado por una porción de la
secuencia
Denaturación
El epítope se
conserva como
tal.
Modelaje tridimensional por homología. Identidad de secuencia
de 56% con quimopapaína (1YAL)
Análisis de Superficie: vista de átomos por radio de van der Waals
AA idénticos
AA divergentes
Selección de secuencias (1)divergentes, (2)accesibles al
solvente y (3)contínuas.
TMEGQYMKNERTSISFS
YYTVQSGSEVELK
NLIGSE
QSQTCSPLRVN
RYNKQLGVAKV
Evaluación de la estabilidad conformacional de los péptidos por
minimización de energía.
H2O
TMEGQYMKNERTSISFS
“backbone”
YYTVQSGSEVELKNLIGSE
Pongamos otro ejemplo…
Sensibilidad de la aspartyl proteasa
del HIV-1 a los inhibidores más
frecuentes
Representación en “cartoon” de la enzima
proteasa de HIV-1
MONOMERO PROTEASA HIV
Enzima proteasa de HIV-1 mostrando los
elementos de estructura secundaria, flaps y
sitio activo
Enzima proteasa de HIV-1 indicando los
residuos consenso de unión inhibidor-enzima
INDINAVIR
RITONAVIR
Asociación de indinavir a la
proteasa de HIV-1
Proteasa de HIV-1 mutante
modelada en complejo con
Ritonavir
COMPARACION ENTRE UNA
ENZIMA SENSIBLE Y UNA
RESISTENTE A RITONAVIR
Un ejemplo más…
Ordenamiento filogenético y el
contenido de GC en tripanosomátidos
Reported %GC variation for each codon
position in Trypanosomatids
(Alonso et al,1992)
C r ith id ia
L e is h m a n ia
%GC
cod on
p o s itio n
90
T .c ru z i
T .b ru ce i
85
1st
2nd
3 rd
80
75
70
65
60
55
50
45
40
42
44
46
48
50
52
% G C to ta l D N A
54
56
58
60
Codon usage in Trypanosomatids
leucine
70
60
50
40
30
20
T.brucei
T.cruzi
Leishmania
Critidia
CTG
CTC
CTT
TTG
CTA
TTA
CTG
CTC
CTT
TTG
CTA
TTA
CTG
CTC
CTT
TTG
CTA
TTA
CTG
CTC
TTG
CTA
TTA
0
CTT
10
Codon usage in Trypanosomatids
serine
40
35
30
25
20
15
10
T.brucei
T.cruzi
Leishmania
Critidia
TCG
AGC
TCC
TCT
TCA
AGT
TCG
AGC
TCC
TCT
TCA
AGT
TCG
AGC
TCC
TCT
TCA
AGT
TCG
AGC
TCC
TCT
TCA
0
AGT
5
Phylogeny of Trypanosomatid lineage
(Maslov & Simpson)
“Hole” formation by DNA
replication
GC content variation in time
Restriction: AA family conservation
and AA conservation
%GC variation in Trypanosomatid lineage
(Nuclear coding DNA)
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
P1
P2
P3
P3*
Crithidia
Leishmania
T.cruzi
P
T.brucei
% GC
GC variation in trypanosomatidae lineage
Nuclear DNA
I. Proyecto Genoma Humano
La secuencia del genoma está casi completa!
– aproximadamente 3.5 billones de pares de bases.
All the Genes
• Any human gene can now be found in the
genome by similarity searching with over
90% certainty.
• However, the sequence still has many
gaps
– one is unlikely to find a complete and
uninterrupted genomic segment for any gene
– still can’t identify pseudogenes with certainty
• This will improve as more sequence data
accumulates
Raw Genome Data:
The next step is obviously to locate all of
the genes and describe their functions.
This will probably take another 15-20
years!
…Algunos años atrás…
Celera sostenía que sólo
habrían 30,000 genes
– so why are there 60,000 human genes on
Affymetrix GeneChips?
– Why does GenBank have 49,000 gene
coding sequence and UniGene have 89,000
clusters of unique ESTs?
• Clearly we are in desperate need of a
theoretical framework to go with all of this
data
Implications for Biomedicine
• Physicians will use genetic information
to diagnose and treat disease.
– Virtually all medical conditions (other than
trauma) have a genetic component.
• Faster drug development research
– Individualized drugs
– Gene therapy
• All Biologists will use gene sequence
information in their daily work
II. Bioinformatics Challenges
The huge dataset

Lots of new sequences being added
- automated sequencers
- Human Genome Project
- EST sequencing

GenBank has over 10 Billion bases
and is doubling every year!!
(problem of exponential growth...)

How can computers keep up?
New Types of Biological Data
• Microarrays - gene expression
• Multi-level maps: genetic, physical,
sequence, annotation
• Networks of Protein-protein interactions
• Cross-species relationships
– Homologous genes
– Chromosome organization
Similarity Searching the Databanks




What is similar to my sequence?
Searching gets harder as the
databases get bigger - and quality
degrades
Tools: BLAST and FASTA = time
saving heuristics (approximate)
Statistics + informed judgement of
the biologist
Alignment





Alignment is the basis for finding similarity
Pairwise alignment = dynamic
programming
Multiple alignment: protein families and
functional domains
Multiple alignment is "impossible" for lots
of sequences
Another heuristic - progressive pairwise
alignment
Sample Multiple Alignment
Structure- Function Relationships

Can we predict the function of protein
molecules from their sequence?
sequence > structure > function


Conserved functional domains = motifs
Prediction of some simple 3-D
structures (a-helix, b-sheet, membrane
spanning, etc.)
Protein domains
DNA Sequencing


Automated sequencers > 40 KB per day
500 bp reads must be assembled into
complete genes
- errors especially insertions and deletions
- error rate is highest at the ends where we want to
overlap the reads
- vector sequences must be removed from ends

Faster sequencing relies on better
software

overlapping deletions vs. shotgun approaches: TIGR
Finding Genes in genome
Sequence is Not Easy
• About 2% of human DNA encodes
functional genes.
• Genes are interspersed among long
stretches of non-coding DNA.
• Repeats, pseudo-genes, and introns
confound matters
Pattern Finding Tools
• It is possible to use DNA sequence patterns
to predict genes:
•
•
•
•
promoters
translational start and stop codes (ORFs)
intron splice sites
codon bias
• Can also use similarity to known genes/ESTs
Phylogenetics


Evolution = mutation of DNA (and
protein) sequences
Can we define evolutionary relationships
between organisms by comparing DNA
sequences
- is there one molecular clock?
- phenetic vs. cladisitic approaches
- lots of methods and software, what is the
"correct" analysis?
II. El papel del Biólogo
en la Era de la
Información
El Internet provee abundante
información biologica


Puede resultar abrumador…
- e-mail
- Web
Necesidad de nuevas habilidades =
localizar información necesaria de
manera eficiente
Computing in the lab - everyday
tasks (vs. computational biology)
ordering supplies
 reference books
 lab notes
 literature
searching

Training "computer" scientists

Know the right tool for the job

Get the job done with tools available


Network connection is the lifeline of
the scientist
Jobs change, computers change,
projects change, scientists need to be
adaptable
The job of the biologist is changing
• As more biological information becomes
available …
– The biologist will spend more time using
computers
– The biologist will spend more time on data
analysis (and less doing lab biochemistry)
– Biology will become a more quantitative science
(think how the periodic table and atomic theory
affected chemistry)
III. Molecular Biology
Software Tools
GCG (Wisconsin Package)

The most popular and most
comprehensive set of tools for the
molecular biologist.
- Runs on mainframe computers: (UNIX)
- Web, X-Windows (SeqLab) interfaces
- Inexpensive for large numbers of users
- Requires local databases (on the mainframe
computer)
- Allows for custom databases and programming
The Web

Many of the best tools are free over the Web
BLAST
 ENTREZ/PUBMED
 Protein motifs databases


Bioinformatics “service providers”


DoubleTwist™, Celera, BioNavigator™
Hodgepodge collection of other tools
PCR primer design
 Pairwise and Multiple Alignment

Personal Computer Programs

Macintosh and Windows applications
- Commercial: Vector NTI™, MacVector™, OMIGA™,
Sequencher™
- Freeware: Phylip, Fasta, Clustal, etc.



Better graphics, easier to use
Can't access very large databases or perform
demanding calculations
Integration with web databases and computing
services
Putting it all together


The current state of the art requires the
biologist to jump around from Web to
mainframe to personal computer
The trend is for integration
– Web + personal computer will replace text
interface to mainframe ?
– Will the Web become the ultimate interface
for all computing ??
IV. Genómica
Genomics Technologies
• Automated DNA sequencing
• Automated annotation of sequences
• DNA microarrays
– gene expression (measure RNA levels)
– single nucleotide polymorphisms (SNPs)
• Protein chips (SELDI, etc.)
• Protein-protein interactions
cDNA spotted microarrays
Affymetrix Gene Chips
Impact on Bioinformatics
• Genomics produces high-throughput, highquality data, and bioinformatics provides
the analysis and interpretation of these
massive data sets.
• It is impossible to separate genomics
laboratory technologies from the
computational tools required for data
analysis.
Pharmacogenomics
• The use of DNA sequence information to
measure and predict the reaction of
individuals to drugs.
• Personalized drugs
• Faster clinical trials
– Selected trail populations
• Less drug side effects
– toxicogenomics