Download Bioinformatics

Document related concepts

Expression vector wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Metalloprotein wikipedia , lookup

Gene wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Gene expression wikipedia , lookup

Community fingerprinting wikipedia , lookup

Interactome wikipedia , lookup

Western blot wikipedia , lookup

Protein wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomic library wikipedia , lookup

Point mutation wikipedia , lookup

Proteolysis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Bioinformatics
CSC 391/691; PHY 392; BICM 715
Bioinformatics Course, Spring 2004
Importance of bioinformatics
• A more global perspective in experimental
design
• The ability to capitalize on the emerging
technology of database-mining--the
process by which testable hypotheses are
generated regarding the function or
structure of a gene or protein of interest by
identifying similar sequences in better
characterized organisms.
Bioinformatics Course, Spring 2004
Amino acids: chemical composition or
digital symbols for proteins
http://wbiomed.curtin.edu.au/teach/biochem/tutorials/AAs/AA.html
Link found on the Research Collaboratory for Structural
Biology web site: www.rcsb.org/pdb/education.html
See also Table 2.2 (Mount)
Bioinformatics Course, Spring 2004
Nucleotides: chemical composition or
digital symbols for nucleic acids
http://ndbserver.rutgers.edu/NDB/archives/NAintro/
http://www.web-books.com/MoBio/Free/Ch3A.htm
Link found on the Research Collaboratory for Structural
Biology web site: www.rcsb.org/pdb/education.html
See also Table 2.1 (Mount)
Bioinformatics Course, Spring 2004
The Genetic Code: how DNA
nucleotides encode protein amino acids
http://www.accessexcellence.org/AB/GG/genetic.html
Bioinformatics Course, Spring 2004
Biologists think it’s a lot of data, but
maybe its really not
He made fun of biologists for complaining that the human genome, which
takes up about 3 gigabytes, is "a lot of data". He offered the comparison
of the DVD movie "Evita", which is about 12 gigabytes, with the genome
of Madonna. (3 gigabytes). "The movie contains four times more
information than Madonna's genome. And Madonna shares 99% of her
DNA with a chimp...And 90% with Craig Venter’s dog.”
More proof that the genome is not a lot of data: About 90-something
percent of genetic information is common to all humans. "The unique
part of you will fit on a floppy disk."
Nathan Myhrvold, former Chief Technology Officer for Microsoft
Keynote Speech at NIH Digital Biology Meeting 2003
Bioinformatics Course, Spring 2004
Review of Lab 1
• What did you learn about the sites you
visited: SGD, SwissProt, EntrezRefSeq,
EntrezNeighbor, EntrezProtein, PIR-US
• Can you define the term protein function?
• Does the term gene function have any
meaning?
• Questions?
Bioinformatics Course, Spring 2004
Biologists think it’s a lot of data, but
maybe its really not
He made fun of biologists for complaining that the human genome, which
takes up about 3 gigabytes, is "a lot of data". He offered the comparison
of the DVD movie "Evita", which is about 12 gigabytes, with the genome
of Madonna. (3 gigabytes). "The movie contains four times more
information than Madonna's genome. And Madonna shares 99% of her
DNA with a chimp...And 90% with Craig Venter’s dog.”
More proof that the genome is not a lot of data: About 90-something
percent of genetic information is common to all humans. "The unique
part of you will fit on a floppy disk."
Nathan Myhrvold, former Chief Technology Officer for Microsoft
Keynote Speech at NIH Digital Biology Meeting 2003
Bioinformatics Course, Spring 2004
Biologists think it’s a lot of data,
and maybe it really is
• The genome is not a static, one-time
picture
• Genome changes over time—mutations
and other changes
• Genes expressed to make proteins
– Set of genes that are expressed changes with
cell type
– Set of genes that are expressed changes over
time and state
Bioinformatics Course, Spring 2004
Definition of a Biological Database
A biological database is a large, organized
body of persistent data, usually associated
with computerized software designed to
update, query, and retrieve components of
the data stored within the system.
Bioinformatics Course, Spring 2004
Sources of sequence data
1. GenBank at the National Center of Biotechnology Information, National Library of
Medicine, Washington, DC (nucleotides and proteins)
http://www.ncbi.nlm.nih.gov/Entrez
2. European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England
http://www.ebi.ac.uk/embl/index.html
3. DNA DataBank of Japan (DDBJ) at Mishima, Japan http://www.ddbj.nig.ac.jp/
4. Protein International Resource (PIR) database at the National Biomedical Research
Foundation in Washington, DC (see Barker et al. 1998) http://wwwnbrf.georgetown.edu/pirwww/
5. The SwissProt protein sequence database at ISREC, Swiss Institute for
Experimental Cancer Research in Epalinges/Lausanne http://www.expasy.ch/cgibin/sprot-search-de
6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute
allows both simple and complex concurrent searches of one or more sequence
databases. The SRS system may also be used on a local machine to assist in the
preparation of local sequence databases. http://srs6.ebi.ac.uk
Table 2.5. Mount
Bioinformatics Course, Spring 2004
Sources of protein structure data
• RCSB Protein Data Bank (PDB):
www.rcsb.org
• BioMagResBank:
http://www.bmrb.wisc.edu/
• MMDB:
http://www.ncbi.nlm.nih.gov/Structure/MM
DB/mmdb.shtml
Bioinformatics Course, Spring 2004
Review of Lab 2
• What did you learn about the RCSB web page?
• What are your thoughts about the PDB file format?
• Was RasMol easy or hard to use? Is there anything you
tried to do, but couldn’t figure out how?
• What is the difference between the two glutaredoxin
structures (1aaz and 1die)?
• MMDB: database of protein structures, ASN.1 format
(http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.sht
ml)
• Other questions?
Bioinformatics Course, Spring 2004
Levels of protein structure
•
•
•
•
•
Primary structure
Secondary structure
(Super secondary structure)
Tertiary structure
Quaternary structure
Bioinformatics Course, Spring 2004
Databases of protein structure
classification
• SCOP: Murzin A. G., Brenner S. E., Hubbard T., Chothia
C. (1995). J. Mol. Biol. 247, 536-540. [email protected]
• CATH: Orengo, C.A., Michie, A.D., Jones, S., Jones,
D.T., Swindells, M.B., and Thornton, J.M. (1997) Vol 5.
No 8. p.1093-1108.
http://www.biochem.ucl.ac.uk/bsm/cath/
• Dali: L. Holm and C. Sander (1996) Science 273:595602.
http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.ht
ml
• VAST: S. H. Bryant and C. Hogue.
http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
Bioinformatics Course, Spring 2004
RNA Structure
• Primary structure:
sequence of GACU
nucleotides
• Secondary
structure: stemloop structures
• Tertiary structure
• http://www.rnabase
.org/
Bioinformatics Course, Spring 2004
DNA structure
• Primary structure:
sequence of GACT
nucleotides
• Secondary structure:
double helix
• Higher levels of
structure…
nucleosome…
chromatin…
chromosome
Bioinformatics Course, Spring 2004
An example of pairwise alignment
(A) ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa
(B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa
using matrix file: BL50, gap penalties: -14/-4
27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014
10
20
30
40
50
Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR
..::
.. :: : .: ::
:
.:.: .. . .
::
::. : .. .
Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM
10
20
30
40
50
60
60
70
80
Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF
.:.
..:..:. ::..::.. :... .
Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY
70
80
90
Bioinformatics Course, Spring 2004
Pairwise Sequence Alignment
• The alignment of two sequences (either
protein or nucleic acid) based on some
algorithm
• What is the “right answer”?
– Align (pairwise) the following words:
instruction, insurrection, incision
• There is NO unique, precise, and
universally applicable method of pairwise
alignment
Bioinformatics Course, Spring 2004
An example of pairwise alignment
(A) ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa
(B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa
using matrix file: BL50, gap penalties: -14/-4
27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014
10
20
30
40
50
Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR
..::
.. :: : .: ::
:
.:.: .. . .
::
::. : .. .
Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM
10
20
30
40
50
60
60
70
80
Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF
.:.
..:..:. ::..::.. :... .
Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY
70
80
90
Bioinformatics Course, Spring 2004
Global vs Local Alignment
Figure 3.1, Mount
Bioinformatics Course, Spring 2004
Pairwise Sequence Alignment Websites
Bayes block aligner
http://www.wadsworth.org/res&res/b
ioinfo
BCM Search Launcher:
Pairwise sequence alignment
http://searchlauncher.bcm.tmc.edu/se
q-search/alignment.html
SIM—Local similarity
program for finding
alternative alignments
http://www.expasy.ch/tools/sim.html
Huang et al. (1990); Huang and
Miller (1991); Pearson and Miller
(1992)
Global alignment programs
(GAP, NAP)
http://genome.cs.mtu.edu/align/align.
html
Huang (1994)
FASTA program suite
http://fasta.bioch.virginia.edu/fasta/f
asta_list.html
Pearson and Miller (1992); Pearson
(1996)
BLAST 2 sequence alignment
(BLASTN, BLASTP)
http://www.ncbi.nlm.nih.gov/gorf/bl
2.html
Altschul et al. (1990)
LALIGN
http://www.ch.embnet.org/software/
LALIGN_form.html
Huang and Miller, published in Adv.
Appl. Math. (1991) 12:337-357
Likelihood-weighted
sequence alignment (lwa)
http://stateslab.bioinformatics.med.u
mich.edu/service/lwa.html
Table 3.1, Mount
Zhu et al. (1998)
Bioinformatics Course, Spring 2004
What is multiple sequence
alignment?
• Multiple sequence alignment is the
alignment of more than two nucleotide or
protein sequences
• Compare pairwise sequence alignment
multiple sequence alignment
Bioinformatics Course, Spring 2004
Issues with multiple sequence
alignment
• Try creating a multiple sequence
alignment of the three words:
– Insurrection
– Incision
– Instruction
Bioinformatics Course, Spring 2004
Issues with multiple sequence
alignment
• What’s the right answer?
–
in
cision
insurrection
instr uction
in
cision
insurrection
ins truction
inci
sion
insurrection
ins truction
in
cision
insurrec tion
instr uc tion
• Computational complexity
• What is reasonable method for obtaining
cumulative score?
• Placement and scoring of gaps
Bioinformatics Course, Spring 2004
Pairwise sequence alignment: LALIGN
of OVCA2 and DYR_SCHPO (global)
./wwwt MAAQRPLRVLCLAGFRQSERGFREKTGALRKALRGRAELVCLSGPHPVPDPPGPEGARSD
:. .::.:::: :. ::
: .: :...: :
:::
.:: . .
:. .
dihydr MS—KPLKVLCLHGWIQSGPVFSKKMGSVQKYLSKYAELHFPTGPVVADEEADPNDEEEK
10
20
30
40
50
70
80
90
100
110
120
./wwwt FGSCPPEEQPRGWWFSEQEADVFSALEEPAVCRGLEESLGMVAQALNRLGPFDGLLGFSQ
. : :. :.. :.
. . .::: . : ... ::::::.::::
dihydr KRLAALGGEQNGGKFGWFEVEDFKN-----TYGSWDESLECINQYMQEKGPFDGLIGFSQ
60
70
80
90
100
110
130
140
150
160
170
./wwwt GAALAALVCALGQAGDPRFPL---P—RFILLVSSFCPRGIGFKESILQRPLSLPSLHVF
::...:.. . : :.: :
: .:...:..: .
: . . . :. ::::.
dihydr GAGIGAMLAQMLQPGQPPNPYVQHPPFKFVVFVGGFRAEKPEF-DHFYNPKLTTPSLHIA
120
130
140
150
160
170
180
190
200
210
./wwwt GDTDKVIPSQESVQLASQFPGAITLTHSGGHFIPA-------------AAP--------: .: ..: .: ::. . .: .: : : :..:
.::
dihydr GTSDTLVPLARSKQLVERCENAHVLLHPGQHIVPQQAVYKTGIRDFMFSAPTKEPTKHPR
19.2% sequence identity; score -413
Bioinformatics Course, Spring 2004
Multiple sequence alignment
Bioinformatics Course, Spring 2004
What is multiple sequence
alignment used for?
• Consensus sequences: which residues can be
used to identify other members of the family?
• Gene and protein families: which residues are
functionally important; functional families
• Relationships and phylogenies: contains
evolutionary “history” of sequences
• Data underlying some protein structure
prediction algorithms
• Genome sequencing: sequence random,
overlapping fragments; automation of assembly
(in this case, there is a RIGHT answer)
Bioinformatics Course, Spring 2004
Consensus sequences and
important functional residues
Baxter, et al, Mol Cell Prot 2003
Bioinformatics Course, Spring 2004
Relationships and phylogenies
• Serine-threonine
protein phosphatases
• Same biochemical
function
• Clustering clearly
shows PP1, PP2a
and PP2B families
• What is different
about these families?
Fetrow, Siew, Skolnick, FASEB J, 1999
Bioinformatics Course, Spring 2004
Possible redox site in PP1 family
Only a clustering, not a true phylogenetic tree
Bioinformatics Course, Spring 2004
Methods to solve computational
complexity
• Progressive global alignment
• Iterative methods
• Alignments based on locally conserved
patterns
• Statistical methods and probabilistic
models
Bioinformatics Course, Spring 2004
Multiple Sequence Alignment: Global
CLUSTALW or
CLUSTALX (latter
has graphical
interface)
FTP to
ftp.ebi.ac.uk/pub/soft
warea,d
Thompson et al. (1994a,
1997); Higgins et al.
(1996)
MSA
http://www.psc.edu/b
Lipman et al.
http://www.ibc.wustl.e
(1989);Gupta et al.
du/ibc/msa.htmlc
(1995)
FTP to
fastlink.nih.gov/pub/m
sa
PRALINE
http://mathbio.nimr.mrc.a Heringa (1999)
c.uk/~jhering/praline/
Table 4.1, Mount
Bioinformatics Course, Spring 2004
Multiple Sequence Alignment: Interative
DIALIGN segment
alignment
http://www.gsf.de/biodv/dialign.
html
Morgenstern et al. (1996)
MultAlin
http://protein.toulouse.inra.fr/mu
ltalin.html
Corpet (1988)
Parallel PRRN
progressive global
alignment
http://prrn.ims.u-tokyo.ac.jp/
Gotoh (1996)
SAGA genetic
algorithm
http://igs-server.cnrsmrs.fr/~cnotred/
Projects_home_page/saga_
home_page.html
Notredame and Higgins
(1996)
Table 4.1, Mount
Bioinformatics Course, Spring 2004
Multiple Sequence Alignment: Local
Aligned Segment Statistical
Evaluation Tool (Asset)
FTP to
ncbi.nlm.nih.gov/pub/neuwa
ld/asset
Neuwald and Green (1994)
BLOCKS Web site
http://blocks.fhcrc.org/blocks/
Henikoff and Henikoff (1991,
1992)
eMOTIF Web server
http://dna.Stanford.EDU/emotif/
Nevill-Manning et al. (1998)
GIBBS, the Gibbs sampler
statistical method
FTP to
ncbi.nlm.nih.gov/pub/neuwa
ld/gibbs9_95/
Lawrence et al. (1993); Liu et al.
(1995); Neuwald et al.
(1995)
HMMER hidden Markov model
software
http://hmmer.wustl.edu/
Eddy (1998)
MACAW, a workbench for
multiple alignment
construction and analysis
FTP to
ncbi.nlm.nih.gov/pub/maca
w/
Schuler et al. (1991)
MEME Web site, expectation
maximization method
http://meme.sdsc.edu/meme/we
bsite/
Bailey and Elkan (1995);
Grundy et al. (1996, 1997);
Bailey and Gribskov (1998)
Profile analysis at UCSDa,e
http://www.sdsc.edu/projects/pro
file/
Gribskov and Veretnik (1996)
SAM hidden Markov model Web
site
http://www.cse.ucsc.edu/researc
h/compbio/sam.html
Krogh et al. (1994); Hughey and
Krogh (1996)
Table 4.1, Mount
Bioinformatics Course, Spring 2004
Methods to solve computational
complexity
• Progressive global alignment
– Start with most related sequences
– Problem is that these errors in initial alignments are
propagated
• Iterative methods
– Iterative alignment of subgroup of sequences to find
“best”; then align subgroups
• Alignments based on locally conserved patterns
– Block analysis
• Statistical methods and probabilistic models
– Expectation maximum; Gibbs sampler; Hidden
Markov Models;
Bioinformatics Course, Spring 2004
Profile Methods
• Perform a global multiple sequence
alignment on a group of sequences
• Extract more highly conserved regions
• Profile = scoring matrix for these highly
conserved regions
• Used to search unknown sequences for
membership in the family
Figures 4.11 (p. 162) and 4.12 (p. 166-167)
Bioinformatics Course, Spring 2004
Limitations of such profiles
• Limited by sequences in original msa:
– Sequence bias (too many of one type of
sequence)
– Sequences in msa not representative of entire
family
Bioinformatics Course, Spring 2004
Blocks
• Blocks are conserved regions of msa (like
profiles) but no gaps allowed
• Servers for producing Blocks:
– Blocks server
– eMotif server
• Block libraries for database searching
– Blocks (Henikoff and Henikoff)
– Prosite (Bairoch)
– Prints (Attwood)
Bioinformatics Course, Spring 2004
Blocks that might be extracted from an msa
Baxter, et al, Mol Cell Prot 2003
Bioinformatics Course, Spring 2004
Blocks that might be extracted from an msa
Baxter, et al, Mol Cell Prot 2003
Bioinformatics Course, Spring 2004
Database searching
• Identify a new sequence by experimental
methods: what is it?
• Search databases to find similar
sequences
• If “enough similarity”, can say that function
of new sequence is same as known
sequence: function annotation transfer
• What is “enough similarity”?
• What is “function”?
Chapter 7, Mount
Bioinformatics Course, Spring 2004
Relationships between family
members
• Sequence relationships
between family members
• Not all members of family
have significant
sequence similarity to all
others
• Can be represented by
nodes and edges of a
graph
Z
F
E
A
D
C
B
Bioinformatics Course, Spring 2004
Beware of issues with function
annotation transfer
• Multiple domains
• High sequence
identity, but
functional residues
not conserved
• Sequence repeats
(low complexity
regions)
New
Function B
S
S
Function A
Function A
H
D
Known serine hydrolase
New sequence
D
L
Bioinformatics Course, Spring 2004
Methods for database searching
• Sequence similarity with query sequence:
FASTA, BLAST (Fig 7.5, p. 305)
• Profile search: ProfileSearch
• Position-specific scoring matrix: MAST
• Iterative alignment (combination of
sequence searching and profile search):
PSI-BLAST
• Patterns: Prosite, Blocks, Prints,
CDD/Impala
Table 7.1, Mount
Bioinformatics Course, Spring 2004
The problem with speed
• Dynamic programming
– Guaranteed to find optimal answer
– Too slow (number of searches performed and number
of sequences in databases that are searched):
Smith-Waterman dynamic programming algorithm
50X slower than BLAST or FASTA
– faster hardware has made this problem feasible
• Heuristic methods
– FASTA: short, common patterns in query and
database searches
– BLAST: similar, but searched for more rare and
significant patterns
Bioinformatics Course, Spring 2004
Searches on DNA vs Protein
Sequences
• 20-letter alphabet vs 4-letter alphabet
• Fivefold larger variety of sequence characters in
proteins: easier to detect patterns
• Searches with DNA sequences produce fewer
significant matches
• What if you don’t know reading frame?
• Sometimes must do nucleic acid searches
(searching for similarities in non-coding regions)
Bioinformatics Course, Spring 2004
Sensitivity vs selectivity
• Sensitivity: method’s ability to find most
members of the protein family
• Selectivity: method’s ability to distinguish
true members from non-members
• Want a method to have high sensitivity
(get all true positives) and high selectivity
(not get false positives)
• Can be a difficult test with biological data
sets: not all true positives are known
Bioinformatics Course, Spring 2004
Scoring matrices commonly used
• PAM250: point accepted mutation; Dayhoff, M.,
Schwartz, R. M., and Orcutt, B. C., Atlas of
Protein Sequence and Structure (1978) 5(3):345
• BLOSUM62: blocks amino acid substitution
matrices; Henikoff and Henikoff, Amino acid
substitution matrices from protein blocks. (1992)
Proc. Natl. Acad. Sci. USA 89:10915-10919.
Bioinformatics Course, Spring 2004
PAM250
– Calculated for families of related proteins (>85%
identity)
– 1 PAM is the amount of evolutionary change that
yields, on average, one substitution in 100 amino acid
residues
– A positive score signifies a common replacement
whereas a negative score signifies an unlikely
replacement
– PAM250 matrix assumes/is optimized for sequences
separated by 250 PAM, i.e. 250 substitutions in 100
amino acids (longer evolutionary time)
Bioinformatics Course, Spring 2004
BLOSUM62
• BLOSUM matrices are based on local
alignments (“blocks” or conserved amino acid
patterns)
• BLOSUM 62 is a matrix calculated from
comparisons of sequences with no less than
62% divergence
• All BLOSUM matrices are based on observed
alignments; they are not extrapolated from
comparisons of closely related proteins
• BLOSUM 62 is the default matrix in BLAST 2.0
Bioinformatics Course, Spring 2004
Comparison of PAM250 and
BLOSUM62
BLOSUM80
BLOSUM62
BLOSUM45
PAM1
PAM120
PAM250
Less
divergent
More
divergent
The relationship between BLOSUM and PAM substitution matrices. BLOSUM
matrices with higher numbers and PAM matrices with low numbers are both designed for
comparisons of closely related sequences. BLOSUM matrices with low numbers and
PAM matrices with high numbers are designed for comparisons of distantly related
proteins. If distant relatives of the query sequence are specifically being sought, the
matrix can be tailored to that type of search.
Bioinformatics Course, Spring 2004
Scoring matrices commonly used
• PAM250
– Represents a period of time during which only about
20% of amino acids will remain unchanged
– Shown to be appropriate for searching for sequences
of 17-27% identity
• BLOSUM62
– Matrix calculated from comparisons of sequences
with no less than 62% divergence
– Though it is tailored for comparisons of moderately
distant proteins, it performs well in detecting closer
relationships
• BLOSUM50
– Shown to be better for FASTA searches
Bioinformatics Course, Spring 2004
Methods for database sequence
searching
• Sequence similarity with query sequence:
FASTA, BLAST
• Profile search: ProfileSearch
• Position-specific scoring matrix: MAST
• Iterative alignment (combination of
sequence searching and profile search):
PSI-BLAST
• Patterns: Prosite, PFAM, CDD/Impala
Bioinformatics Course, Spring 2004
Review of protein structure
• Primary structure: sequence of amino
acids
• Secondary structure: local segments of
protein structure
• Tertiary structure: three-dimensional
structure of a single protein chain
• Quaternary structure: packing of 2 or
more protein chains
Bioinformatics Course, Spring 2004
Classification of protein tertiary
structure
•
•
•
•
•
All alpha proteins
All beta proteins
Alpha+beta proteins
Alpha/beta proteins
Irregular proteins
Classify these proteins: T-cell protein CD8 (1cd8), myoglobin, triose
phosphate isomerase, G-specific endonuclease (1rnb)
Bioinformatics Course, Spring 2004
Representations of protein
structures
•
•
•
•
All atom
CPK models
Cartoons (ribbons, etc)
Topology diagrams
Bioinformatics Course, Spring 2004
Protein structure databases
• RCSB (PDB): http://www.rcsb.org/pdb
– General repository for all protein coordinate files
• MMDB: http://www.ncbi.nlm.nih.gov/Structure
– NCBI structure database; structures from pdb
– Links to sequence and genome databases
• BioMagResBank: http://www.bmrb.wisc.edu/
– General repository for NMR structure data
Bioinformatics Course, Spring 2004
Alignment of protein structure
• Superposition of protein 3D structures
• Used in searching for structural similarity
and grouping proteins into “fold families”
• Structural similarity is common and does
not necessarily indicate an evolutionary
relationship (different from sequence
similarity)
Bioinformatics Course, Spring 2004
Structure Alignment: A difficult
problem
• Alignment in atom
positions in 3D space
• Pieces of proteins may
align
Easy example (Eidhammer and Jonassen):
– What is significant and
what is not? (Is alignment
of two helices significant?)
• Alignment of topology or
secondary structure
packing give different
answers
More difficult examples:
http://www.sbg.bio.ic.ac.uk/people/rob/sf/sf.html
Bioinformatics Course, Spring 2004
Structure alignment used to classify (group)
protein structures
•
•
•
•
•
SCOP (Structural Classification Of Proteins; http://scop.mrc-lmb.cam.ac.uk/scop/)
– Class (all alpha, all beta, alpha+beta, alpha/beta), family, superfamily, fold
– Reflects structural and evolutionary relationships
– Mostly done by “hand” (expert analysis)
CATH (classification by class, architecture, topology and homology;
http://www.biochem.ucl.ac.uk/bsm/cath)
– Class (all alpha, all beta, alpha+/beta), architecture, fold, superfamily, family
– Uses SSAP structure alignment program
FSSP (fold classification based on structure-structure protein alignment;
http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html.)
– Based on pairwise alignment of all non-redundant proteins in PDB
– Divides proteins into structures and domains: represents unique configuration of
secondary structure elements
– Uses Dali structure alignment program
MMDB (molecular modeling database;
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)
– Proteins classified into structurally related groups by VAST, based on arrangements of
secondary structures
– Groupings of all PDB structures
SARF (spatial arrangement of backbone fragments; http://123d.ncifcrf.gov/)
Bioinformatics Course, Spring 2004
Web sites for structure alignment
•
VAST: http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml
– NCBI structure comparison
– Comparison of orientations of secondary structures (vector representation of
secondary structures)
– Approach from graph theory
•
Dali: http://www.ebi.ac.uk/dali/
– FSSP structure comparison
– Protein represented as distance matrix between alpha carbons
– Monte Carlo simulation to do random search for sub-distance-matrices
•
SSAP: http://www.biochem.ucl.ac.uk/cgi-bin/cath/GetSsapRasmol.pl
– CATH structure comparison
– Set structure environment for each residue, then align residue by residue using
double dynamic programming
– Structure environment can use beta carbon vectors or phi/psi backbone dihedral
angles
•
Others: Lots, such as Structal (Gerstein and Levitt); Minarea (Falicov and
Cohen); Lock (Singh and Brutlag)
Bioinformatics Course, Spring 2004
Protein Structure Prediction
• Goal is to understand the relationship
between the primary amino acid sequence
and the structure of the protein
• Relationship between sequence and
structure is not simple and is not
understood
• “Protein folding problem” remains
unsolved
Bioinformatics Course, Spring 2004
Protein Structure Prediction
• Secondary structure prediction: unsolved?
• Tertiary structure prediction: unsolved
problem (CASP competition)
• Quaternary structure prediction: unsolved
problem
– “Docking” of two subunits
Bioinformatics Course, Spring 2004
Secondary structure prediction
• Prediction of three classes of secondary
structure: helix, strand, “coil”
– Solved problem? 70-80% “correct
predictions”
– Methods (web sites) can give very different
answers
• Prediction of non-regular secondary
structure (loops and turns) not as
successful
Bioinformatics Course, Spring 2004
Secondary structure prediction
• Method development
– Frequencies on types of residues found in each
secondary structures
– Frequencies calculated from database of known
structures (training set)
• Method evaluation
– Test method on proteins whose structures are known
(testing set)
• Training and testing sets must not be the same
Bioinformatics Course, Spring 2004
Secondary structure prediction
methods and references
Single
residue
statistics
Explicit rules
Nearest
neighbors
Neural
networks
Hidden
markov
models
1st generation
Chou/Fasman
(’74)
GOR I
Lim (’74)
2nd generation
GOR III (’87)
Predator (’96)
Levin (’86)
Nishikawa and
Ooi (‘86)
Yi and Lander
(’93)
Qian and
Sejnowski
(‘88)
Holley and
Karplus
(’89)
Yi and Lander
(’93)
Asai/Handa
(’93)
3rd generation
GOR IV
DSC (Prof)
(’96)
NNssp (’95)
NNssp (’95)
PHD (’93)
Jnet (’99)
PsiPred (’99)
PASSML
(’98)
See Table 9.7, Mount, for list of servers
Bioinformatics Course, Spring 2004
GOR IV secondary structure
prediction
• Three state prediction: helix, strand, loop
• Statistics of pair frequencies observed
within a window of 17 amino acid residues
• Based on information theory—sound
statistical basis and no ad hoc rules
• Mean accuracy of 64.4% for a three state
prediction (Q3)
Garnier, Gibrat, Robson; http://abs.cit.nih.gov/index.html
Bioinformatics Course, Spring 2004
PHD secondary structure prediction
• Three state prediction: helix, strand, loop
• Predicts secondary structure from multiple sequence
alignments
• Three consecutive neural networks (feed forward)
– Raw 3-state prediction for each position, based on alignment
composition in 13 residue window
– Filter 3-state probabilities based on probabilities of flanking
positions in 17-residue window
– Jury network using several raw/filter combinations trained
separately
• Expected average accuracy > 72% for three state
prediction (Q3)
Rost and Sander; http://www.predictprotein.org
Bioinformatics Course, Spring 2004
Method evaluation: how good is “good”?
• Testing of prediction methods involves
– Applying the method to a set of proteins whose secondary structures
are known experimentally and comparing prediction results to known
results
– Calculating measures of how good the performance is
• Q1 (h, s, or c)
– (number of residues correctly predicted in one state/number of residues
in that state) * 100
• Q3 (h, s, and c):
– (number of residues correctly predicted in each of 3 states/number of all
residues) * 100
• Matthews correlation coefficient (Cs)
– (TpTn - FpFn) / sqrt[(Tp+Fp)(Tn+Fn)(Tp+Fn)(Tn+Fp)]
Num:
Res:
Actu:
Pred:
Pred:
....,....1....,....2....,....3....,....4....,....5....,....6
MSTKQHSAKDELLYLNKAVVFGSGAFGTALAMVLSKKCREVCVWHMNEEEVRLVNEKREN|
HHHHHHHHHHHH
EE HHHHHHHHHHHHHHH
EE
HHHHHHHHHHHHHH|
HHHHHH
EEEEE HHHHHHHHHHHH
EEEEEE
HHHHHHHH |
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH|
Bioinformatics Course, Spring 2004
Method evaluation: how good is “good”?
• Matthews correlation coefficient (Cs)
– (TpTn - FpFn) / sqrt[(Tp+Fp)(Tn+Fn)(Tp+Fn)(Tn+Fp)]
– Where Tp, true positive predictions (method predicts helix, and
residue is in a helix); Tn, true negative prediction (method
predicts “not helix”, and residue is not in a helix); Fp, false
positive prediction (method predicts helix, but residue is not in a
helix); Fn, false negative prediction (method predicts “not helix”,
but residue is in a helix)
Num: ....,....1....,....2....,....3....,....4....,....5....,....6
Res: MSTKQHSAKDELLYLNKAVVFGSGAFGTALAMVLSKKCREVCVWHMNEEEVRLVNEKREN|
Actu: HHHHHHHHHHHH
EE HHHHHHHHHHHHHHH
EE
HHHHHHHHHHHHHH|
Pred:
HHHHHH
EEEEE HHHHHHHHHHHH
EEEEEE
HHHHHHHH |
Q1 (helix)=(4+12+8)/(12+15+14)*100=58%
Q3=(4+12+8+2+2)/60*100=47%
Tp=4+12+8=24; Tn=9+8=17
Fp=2; Fn=8+1+2+5+1=17
Ch=[(24*17)-(2*17)]/sqrt[(24+2)(17+17)(24+17)(17+2)]
Bioinformatics Course, Spring 2004
Tertiary Structure Prediction
• Homology modeling: identifiable
sequence similarity
• Fold recognition (“threading;” Table 9.8 for
server list)
• “Ab initio” methods
Bioinformatics Course, Spring 2004
Homology modeling
•
•
•
•
•
•
Sequence alignment
Side chain modeling
Modeling insertions and deletions
Optimizing the model
Model evaluation
Repeat?
Bioinformatics Course, Spring 2004
Fold Recognition (“threading”)
• Template identification/sequence
alignment/alignment optimization
• Side chain modeling
• Modeling insertions and deletions
• Optimizing the model
• Model evaluation
• Repeat?
Bioinformatics Course, Spring 2004
Ab initio methods: folding “from
scratch”
•
•
•
•
Start with unfolded protein or random conformation
Use atomic-level forces, solve energetic equations
Identify most stable conformation (lowest free energy)
Computational demands high: for protein of 100 amino
acids
–
–
–
–
Assume constant bond lengths and angles
Allow 2/3 backbone torsion angles per amino acid to rotate
Do not allow side chain torsion angles to move
Assuming 10 allowed conformations per residue, must explore
10100 conformations
– Calculation of 10100 energies (one for each conformation) is not
possible
Bioinformatics Course, Spring 2004
Ab initio methods: simplifications
• Lattice models to simplify the
conformational search space
• Monte Carlo statistical sampling of
conformational space
• Stepwise processes:
– Predict regular secondary structures
– Pack secondary structures to form tertiary
structures
• Others…
Bioinformatics Course, Spring 2004
Review of Definitions
• Cell: fundamental working unit of biology
• DNA: encodes all information to create cells and allow
them to function
– Linear arrangement of bases (AGTC)
• Genome: organism’s complete set of DNA
• Chromosome: physically distinct molecules of DNA
– Genomes can be composed of 1, 2 or more chromosomes
• Gene: basic physical and functional unit of heredity
– Linear arrangement of bases along the chromosome
– Contain instructions for encoding protein
– (Remember genetic code?)
Bioinformatics Course, Spring 2004
Genomes and proteomes
• Genome: Sum of all genes and intergenic DNA
sequences in a cell
– the smallest known genome for a free-living organism
(a bacterium) contains about 600,000 DNA base pairs
– human and mouse genomes have about 3 billion
– “relatively” unchanging from cell to cell
• Proteome: The entire set of proteins encoded in
the genome of an organism and produced by
that organism
– Constellation of proteins in cells is highly dynamic
Bioinformatics Course, Spring 2004
The Human Genome
• 24 chromosomes
• Chromosomes range is size from 50 million to 250 million base pairs
• Total size of the human genome is over 3 billion base pairs (3.1647
billion)
– 99.9% of all bases are the same in all people
• Genes comprise only 2% of the total genome
–
–
–
–
Human genome is estimated to contain 30,000 to 40,000 genes
Average gene size is about 3000 bases
Largest identified so far is 2.4 million bases (dystrophin)
Functions for less than 50% of genes and gene products are known
• Remainder of genome is non-coding regions
–
–
–
–
Chromosomal structural integrity
Repetitive sequences
Regulation of protein production
Other functions that we don’t know about
Bioinformatics Course, Spring 2004
Human Genome Sequencing
Project Goals
• Determine the sequences of the 3 billion chemical base
pairs that make up human DNA
• Identify all the approximately 30,000 genes in human
DNA
• Store this information in databases
• Improve tools for data analysis
• Transfer related technologies to the private sector
• Address the ethical, legal, and social issues that may
arise from the project
Human Genome Project (DOE):
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
NIH:
http://www.ncbi.nlm.nih.gov/genome/guide/human/
Bioinformatics Course, Spring 2004
Other sequencing projects
• Over 200 genomes sequenced
• Range of archeae, bacteria, eukaryotic
genomes
– Organisms that have been well-studied in the
laboratory
– Organisms that are pathogenic to humans
– Organisms of special scientific or technical
interest
NCBI list of sequenced genomes (NIH):
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome
Bioinformatics Course, Spring 2004
Prokaryotes and eukaryotes
• Prokaryotes (bacteria
and archaea)
– No true nucleus
– DNA generally circular
(one chromosome)
• Eukaryotes
– True nucleus contains
(most) DNA
– DNA linear and
arranged in
chromosomes
Phylogenetic analysis of small subunit ribosomal
RNAs, C. Woese, 1987
Bioinformatics Course, Spring 2004
Anatomy of a prokaryotic genome
• DNA compact and
circular
• ORFs (open reading
frames) with start and
stop codons
• No introns
Bioinformatics Course, Spring 2004
Anatomy of a eukaryotic genome
• Linear DNA;
chromosomes
• Centromeres
• Telomeres
• Tandem repeats
• Transposable
elements
• Introns
• Pseudogenes
Example of chromosome maps:
http://www.ncbi.nlm.nih.gov/genome/guide/human/
Bioinformatics Course, Spring 2004
DNA sequencing
AGCT
• Separate strands of DNA
• Anneal primer to one
strand
• Replicate using
fluorescently labeled
ddNTPs (as opposed to
normal dNTPs)
• Separate fragments by
size
• Image gel for fluorescent
labels
See also, electropherogram, Fig2.2, Mount
Bioinformatics Course, Spring 2004
Methods of genome sequencing
• Mapping method
– Fragment chromosome
– Identify markers and order them
– Arrange fragments, then sequence
• Shotgun method
– Fragment chromosome
– Sequence fragments, then arrange
• cDNA sequencing (ESTs)
– Isolate mRNA (expressed in cell)
– Reverse transcribe mRNA to create cDNA
– Sequence cDNA
Bioinformatics Course, Spring 2004
Maps
•
•
•
•
Gene map
Chromosome map
Sequence map
Maps important for obtaining sequence
information (mapping method)
– Restriction map
– Contig (contiguous clone) map
NCBI map viewer:
http://www.ncbi.nlm.nih.gov/mapview/
Bioinformatics Course, Spring 2004
Prediction of genes
• Method
• Difference between prokaryotes and
eukaryotes
• Tests for validation of predictions
Bioinformatics Course, Spring 2004
Genome Analysis
• General approach (p. 492)
• Comparative genomics
– Self-comparison reveals gene families and
duplication
– Between-genome-comparison reveals
orthologs, gene families and domains
– Gene ordering on chromosomes
• Phylogenetic analysis
• Genetic diversity
Bioinformatics Course, Spring 2004