Download Protein Sequence and Structural Similarity

Document related concepts

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

Magnesium transporter wikipedia , lookup

Protein wikipedia , lookup

Expression vector wikipedia , lookup

Western blot wikipedia , lookup

Metalloprotein wikipedia , lookup

Interactome wikipedia , lookup

Proteolysis wikipedia , lookup

Protein purification wikipedia , lookup

Point mutation wikipedia , lookup

Gene expression wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Transcript
Bioinformatics and
Computational Molecular Biology
Geoff Barton
http://www.compbio.dundee.ac.uk
Practical Tutorial
• Dr David Martin practical tutorial on the
use of pymol molecular graphics software.
• In this lecture I will show lots of protein
structures – use www.ebi.ac.uk/msd to find
them, and/or scop domains database (find
with google).
Similarities in Proteins
• Lecture 1
– Overview of data in molecular biology
– Protein modelling
– Similarities of Protein Sequence, Structure,
Function
Introduction to Sequence
Comparison
• Lecture 2:
–
–
–
–
–
Why compare sequences?
Methods for sequence comparison/alignment.
Multiple alignment
Database searching - FASTA/BLAST
Iterative searching - PSI-BLAST
Practical/WWW references
• Organised by Drs Martin
– Good preparation would be to look at:
http://www.ebi.ac.uk/Tools and
http://www.ncbi.nlm.nih.gov
– Look at BLAST and FASTA on these sites as
well as database access facilities.
Traditional biological research
Analysis
Reading. Talking.
Thinking.
Public Data
Journals
Conferences
Private Data
Hypothesis!
Past Experiments.
Lab note books.
Group discussions.
Experiment
Design. Execution.
Publish!
Bioinformatics/Computational Biology
and biological research
Analysis
Data
Public Data Private
Past Experiments.
Journals
Lab note books.
Conferences
Group discussions.
DNA sequences
DNA sequences
Protein Sequences
Protein Sequences
Genetic maps
Genetic maps
Transcripts
Transcripts
3D structures
3D structures
proteomics results
proteomics results
SNP data
SNP data
etc
etc
etc
etc
etc
etc
Reading. Talking.
Thinking.
Computational
Analysis
Software Development
Hypothesis!
Computer aided.
Experiment
Design. Execution.
Computational experiments
Simulation
Publish!
Database submission
Database management
EMBL Nucleotide Sequence Database Growth (to 2nd Oct 2006)
Taken from: www.ebi.ac.uk
Protein Sequences
Approx 3,500,000 known for all
species (Oct. 2006.)
25,000 for Human
(not counting splice variants and
post-translational modifications)
Protein 3D Structures
Approx 39,000 known
(much duplication)
Biological data in context
Ecosystem
many different organisms
Population
group of the same
type of organism
Overview of Biological Hierarchy...
Organelle
nucleus, mitochondria, etc...
Family
group with known
common lineage
Whole organism
animal, plant, etc.
DNA
RNA
Molecular
Nucleus
Levels
Chromosome
Protein Sequence
Protein 3D structure
Tissue/organ
brain, heart, lungs
blood, ...
Cell
nerve,muscle,etc..
Gene
Molecular function
Ecosystem
many different organisms
Population
group of the same
type of organism
Family
group with known
common lineage
Whole organism
animal, plant, etc.
Tissue/organ
brain, heart, lungs
blood, ...
Cell
nerve,muscle,etc..
Technology and data in biology
Expression Data
(Transcriptomics)
Organelle
nucleus, mitochondria, etc...
Which of the genes are
switched on in which
cells/tissues
Nucleus
and when?
What are the effects of
drugs and
Chromosome
disease
on expression
patterns
Gene
DNA ‘CHIP’
TECHNOLOGY
DNA
RNA
Protein Sequence
Protein 3D structure
Molecular function
Ecosystem
many different organisms
Population
group of the same
type of organism
Family
group with known
common lineage
Whole organism
animal, plant, etc.
Tissue/organ
brain, heart, lungs
blood, ...
Cell
nerve,muscle,etc..
Technology and data in biology
Protein
Expression
Data
Organelle
nucleus, mitochondria, etc...
(Proteomics)
Which proteins are
being produced in
Nucleus
which cells/tissues
when? Which
modified forms are
present?
Chromosome
What are the effects
of drugs and disease
on these patterns
Gene
2D Gels + Mass
Spectrometry.
DNA
RNA
Protein Sequence
Protein 3D structure
Molecular function
Ecosystem
many different organisms
Technology and data in biology
Protein 3D Structure - the bridge to
Population
chemistry
Organelle
group of the same
type of organism
(Structural
Genomics) nucleus, mitochondria, etc...
Family
DNA
RNA
knownlevel structure of the protein?
Whatgroup
is thewith
atomic
common lineage
Nucleus
What other molecules does it interact with?
Whole organism
etc.
Whatanimal,
small plant,
molecules
- potential drugs - does it
Chromosome
interact with?
Protein Sequence
Protein 3D structure
Tissue/organ
Whatbrain,
are the
effects
heart,
lungs of point mutations on the
structure?blood, ...
Gene
X-ray crystallography,
NMR spectroscopy,
Cell
singlenerve,muscle,etc..
particle, cryo-electron microscopy.
Molecular function
Ecosystem
many different organisms
Population
group of the same
type of organism
Overview of Biological Hierarchy...
Organelle
nucleus, mitochondria, etc...
Family
group with known
common lineage
RNA
Nucleus
Macroscopic
Levels
Protein Sequence
Whole organism
animal, plant, etc.
DNA
Chromosome
Protein 3D structure
Tissue/organ
brain, heart, lungs
blood, ...
Cell
nerve,muscle,etc..
Gene
Molecular function
Biology is now a data intensive
science
To do good science, you need to
know how to use (and not abuse)
computational tools.
Protein Structure Prediction
• ‘Homology’ modelling
– Relies on the fact that similarity of sequence
implies similarity of 3D structure.
?
Lysozyme (1lz1)
a-lactalbumin (1alc)
Imagine we don’t know the 3D structure of a-lactalbumin, but we do
know its amino acid sequence and that of lysozyme
?
Lysozyme (1lz1)
a-lactalbumin (1alc)
37.7% Identity, Z=17.6
Protein structure prediction
(Homology Modelling)
• Align sequence of protein of unknown
structure to sequence of protein of known
structure.
• In ‘conserved core’ of protein, substitute the
amino acid types into the known structure.
• Deal with ‘loops’ between the core elements
of structure.
Lysozyme (1lz1)
a-lactalbumin (1alc)
37.7% Identity, Z=17.6
Protein structure prediction
(Homology modelling)
• Problems:
– Need protein of known structure that is similar
in sequence.
– Building loops where there are deletions.
– Verifying model.
• Key is getting a good alignment in the first
place
– Bad alignment => bad model.
Good alignment on its own can:
• Identify key residues (absolutely conserved)
• Identify likely protein core (conserved
hydrophobic residues)
• Help predict protein secondary structure
(not this lecture).
Sequence alignment is a
fundamental technique in
molecular biology.
• May predict proteins of common function
even when no 3D structure is known.
• May be used to predict 3D structure and so
help understanding of mutants.
• Some examples of where this is right and
wrong...
Prediction of structure and function by similarity
to known
sequences and structures
Assumption is that similar sequence implies similar structure
and function.
But what do we mean by “similar”?
Does similarity of sequence really imply similarity of function?
Protein Sequence/Structure/Function Network
Sequence
3D Structure
Function
Similar
Similar
Similar
Different
Different
Different
Protein Sequence/Structure/Function Network
Sequence
3D Structure
Function
Similar
Similar
Similar
Different
Different
Different
Similar Sequence, Similar Structure, Similar Function.
e.g. Trypsin-like Serine Proteinases
Same fold, same catalytic mechanism.
But DIFFERENT specificity.
e.g. Immunoglobulin variable domains.
Same fold, similar binding function.
But DIFFERENT specificity.
True of all examples. Similarities only give clues to function,
differences in specificity can be regarded as differences of function.
Immunoglobulin
Variable Domains
e.g. see: 1a2y
Tryptophan at core of Ig variable domain
Protein Sequence/Structure/Function Network
Sequence
3D Structure
Function
Similar
Similar
Similar
Different
Different
Different
Lysozyme (1lz1)
a-lactalbumin (1alc)
37.7% Identity, Z=17.6
e-crystallin/
L-Lactate Dehydrogenase
Protein Sequence/Structure/Function Network
Sequence
3D Structure
Function
Similar
Similar
Similar
Different
Different
Different
Trypsin (3ptn)
Subtilisin (2sec)
Trypsin (3ptn)
Subtilisin (2sec)
Trypsin (3ptn)
His- 57, Asp-102, Ser-195
Subtilisin (2sec)
Asp- 32, His- 64, Ser-221
Protein Sequence/Structure/Function Network
Sequence
3D Structure
Function
Similar
Similar
Similar
Different
Different
Different
Nature 398,84-90, 1999
PDB: 1b47
11% sequence ID
rmsd 1.47Å
over 70 residues
PDB: 1b47
Protein Sequence/Structure/Function Network
Sequence
3D Structure
Function
Similar
Similar
Similar
Different
Different
Different
PDB: 1bia
PDB: 2ptk
Russell, R. B. and Barton, G. J. (1993), "An SH2-SH3 Domain hybrid", Nature, 364, 765.
PDB:2aai
PDB:1bas
Matthews, S., et al. (1994),
"The p17 Matrix Protein from HIV-1 is Structurally Similar to Interferon-gamma", Nature, 370, 666-668.
Protein Sequence/Structure/Function Network
Sequence
3D Structure
Function
Similar
Similar
Similar
Different
Different
Different
Does this ever happen?
HIV Reverse Transcriptase (RT)
HIV Reverse Transcriptase (RT)
HIV Reverse Transcriptase (RT) - domain linkers
Protein Sequence and
Structural Similarity
Type
Similarity
Find By
Homologous
(scop family)
Similar Structure
Similar Sequence
Similar Function
Pair-wise Sequence
Comparison
(BLAST/FASTA/SmithWaterman)
‘Remote
Homologue’
(scop superfamily)
Similar Structure
Weakly Similar Sequence
Similar Function
Profile
Iterative Search
(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue
(scop fold)
Similar Structure
No sequence similarity
Often no functional
similarity
Solve BOTH structures by
X-ray/NMR methods.
Mapping?
Protein Sequence and
Structural Similarity
Type
Similarity
Find By
Homologous
(scop family)
Similar Structure
Similar Sequence
Similar Function
Pair-wise Sequence
Comparison
(BLAST/FASTA/SmithWaterman)
‘Remote
Homologue’
(scop superfamily)
Similar Structure
Weakly Similar Sequence
Similar Function
Profile
Iterative Search
(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue
(scop fold)
Similar Structure
No sequence similarity
Often no functional
similarity
Solve BOTH structures by
X-ray/NMR methods.
Mapping?
Barton, G. J. et al, (1992),
"Human Platelet Derived Endothelial Cell Growth Factor is Homologous to
E.coli Thymidine Phosphorylase", Prot. Sci., 1, 688-690.
Protein Sequence and
Structural Similarity
Type
Similarity
Find By
Homologous
(scop family)
Similar Structure
Similar Sequence
Similar Function
Pair-wise Sequence
Comparison
(BLAST/FASTA/SmithWaterman)
‘Remote
Homologue’
(scop superfamily)
Similar Structure
Weakly Similar Sequence
Similar Function
Profile
Iterative Search
(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue
(scop fold)
Similar Structure
No sequence similarity
Often no functional
similarity
Solve BOTH structures by
X-ray/NMR methods.
Mapping?
Barton, G. J., Cohen, P. T. C. and Barford, D. (1994),
"Conservation Analysis and Structure Prediction of the Protein Serine/Threonine Phosphatases: Sequence
Similarity with Diadenosine Tetra-phosphatase fromE. coli Suggests Homology to the Protein Phosphatases",
Eur. J. Biochem.,220, 225-237.
Protein Sequence and
Structural Similarity
Type
Similarity
Find By
Homologous
(scop family)
Similar Structure
Similar Sequence
Similar Function
Pair-wise Sequence
Comparison
(BLAST/FASTA/SmithWaterman)
‘Remote
Homologue’
(scop superfamily)
Similar Structure
Weakly Similar Sequence
Similar Function
Profile
Iterative Search
(e.g. PSI-BLAST)
Threading/fold recognition?
Analogue
(scop fold)
Similar Structure
No sequence similarity
Often no functional
similarity
Solve BOTH structures by
X-ray/NMR methods.
Mapping?
Russell, R. B. and Barton, G. J. (1993), "An SH2-SH3 Domain hybrid", Nature, 364, 765.
Reading material for this lecture:
This lecture itself. pdf’s for “Barton” papers:
www.compbio.dundee.ac.uk/ftp/pdf/
Database statistics: http://www.ebi.ac.uk/embl/
Structure of the amino-terminal domain of Cbl complexed to its binding site on ZAP-70 kinase
Wuyi Meng, Sansana Sawasdikosol, Steven J. Burakoff, Michael J. Eck
Nature 398, 84 - 90 (04 March 1999)
(available on-line at www.nature.com - search for ZAP-70 kinase - republished in December on-line)
Protein recognition: An SH2 domain in disguise
John Kuriyan, James E. Darnell
Nature 398, 22 - 25 (04 March 1999) (news and views article for above paper)
Russell, R. B. and Barton, G. J. (1993), "An SH2-SH3 Domain hybrid", Nature, 364, 765.
Matthews, S., et al. (1994), "The p17 Matrix Protein from HIV-1 is Structurally Similar to
Interferon-gamma", Nature, 370, 666-668.
Barton, G. J., Cohen, P. T. C. and Barford, D. (1994),
"Conservation Analysis and Structure Prediction of the Protein Serine/Threonine Phosphatases: Sequence
Similarity with Diadenosine Tetra-phosphatase fromE. coli Suggests Homology to the Protein Phosphatases",
Eur. J. Biochem.,220, 225-237.
The end of Lecture 1
Lecture 2 will be on sequence
comparison methods.