Download Genome Annotation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of diabetes Type 2 wikipedia , lookup

Oncogenomics wikipedia , lookup

History of RNA biology wikipedia , lookup

Epitranscriptome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Copy-number variation wikipedia , lookup

Essential gene wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

RNA interference wikipedia , lookup

RNA silencing wikipedia , lookup

Genetic engineering wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy wikipedia , lookup

Non-coding RNA wikipedia , lookup

Public health genomics wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Primary transcript wikipedia , lookup

Transposable element wikipedia , lookup

Gene nomenclature wikipedia , lookup

NEDD9 wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Human genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Point mutation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genetic code wikipedia , lookup

Ridge (biology) wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Genome Annotation
What we are going to discuss
• Finding RNA-only genes
• Gene prediction
–
–
–
–
Prokaryotes vs. eukaryotes
Introns and exons
Transcription signals
ESTs
• Functional annotation
• Biochemical pathways and subsystems
• Metabolic reconstruction of whole organisms
Genome Overview
•
What’s in a genome?
– Protein coding genes.
• In long open reading frames
• ORFs interrupted by introns in eukaryotes
• Take up most of the genome in prokaryotes, but only a small portion of the eukaryotic
genome
– RNA-only genes
• Transfer RNA, ribosomal RNA, snoRNAs (guide ribosomal and transfer RNA
maturation), intron splicing, guiding mRNAs to the membrane for translation, gene
regulation—this is a growing list
– Gene control sequences
• Promoters
• Regulatory elements
– Transposable elements, both active and defective
• DNA transposons and retrotransposons
• Many types and sizes
– Repeated sequences.
• Centromeres and telomeres
• Many with unknown (or no) function
– Unique sequences that have no obvious function
•
As a general rule, each part of a genomic sequence has only one function:
protein-coding gene, RNA gene, control signal, transposable element,
repeat sequence, maybe no functional at all. But, most sequence elements
overlap only slightly if at all.
RNA Genes
• The most universal genes, such as tRNA and rRNA, are very
conserved and thus easy to detect. Finding them first removes some
areas of the genome from further consideration.
– One easy approach to finding common RNA genes is just looking for
sequence homology with related species: a BLAST search will find most
of them quite easily
• Functional RNAs are characterized by secondary structure caused
by base pairing within the molecule.
– Determining the folding pattern is a matter of testing many possibilities
to find the one with the minimum free energy, which is the most stable
structure.
• The free energy calculations are in turn based on experiments where short
synthetic RNA molecules are melted
– Related to this is the concept that paired regions (stems) will be
conserved across species lines even if the individual bases aren’t
conserved. That is, if there is an A-U pairing on one species, the same
position might be occupied by a G-C in another species.
• This is an example of concerted evolution: a deleterious mutation at one site
is cancelled by a compensating mutation at another site.
RNA Structures
•
•
RNA differs from DNA in having fairly
common G-U base pairs. Also, many
functional RNAs have unusual modified
bases such as pseudouridine and
inosine.
The pseudoknot, pairing between a loop
and a sequence outside its stem, is
especially difficult to detect:
computationally intense and not subject
to the normal situation that RNA base
pairing follows a nested pattern
– But pseudoknots seem to be fairly rare.
•
Essentially, RNA folding programs start
with all possible short sequences, then
build to larger ones, adding the
contribution of each structural element.
– There is an element of dynamic
programming here as well.
– And, “stochastic context-free grammars”,
something I really don’t want to approach
right now!
Finding tRNAs
•
•
•
tRNAs have a highly conserved
structure, with 3 main stem-andloop structures that form a
cloverleaf structure, and several
conserved bases. Finding such
sequences is a matter of looking
in the DNA for the proper features
located the proper distance apart.
Looking for such sequences is
well-suited to a decision tree, a
series of steps that the sequence
must pass.
In addition, a score is kept, rating
how well the sequence passed
each step. This allows a more
stringent analysis later on, to
eliminate false positives.
tRNAscan
Decision
Tree
•tRNAscan is estimated to have
an error rate of 1 in 3 million
bases.
•This is very suitable for
prokaryotes,
whose genomes are
approximately this size.
Prokaryotic Genes
•
Gene finding in prokaryotes is relatively simple compared to
eukaryotes:
–
–
•
Thus, you can achieve 100% accuracy, if you don’t mind false
positives, by simply listing all possible ORFs above a certain
size.
–
•
•
•
There is a problem in that it is not clear how many short ORFs (say
less than 100 bp) are real genes.
If you compare predicted genes with actual genes, you can
classify each base according to whether it is:
–
–
–
–
•
no introns, so all genes are in open reading frames starting at a
start codon and ending at a stop codon
most of the DNA is involved in coding from proteins.
true positive: predicted correctly to be in a gene
true negative: predicted correctly to not be in a gene
false positive: predicted to be in a gene but actually not
false negative: predicted to not be in a gene but actually is within a
gene.
The sensitivity (Sn) of a prediction is the fraction of bases in
real genes that are predicted to be within genes.
The specificity (Sp) is the fraction of bases predicted to be in a
gene that actually are.
Both of these parameters need to be optimized.
Sn = TP / (TP + FN)
Sp = TP / (TP + FP)
General Considerations
•
Bacteria use ATG as their main start codon, but GTG and TTG are also fairly
common, and a few others are occasionally used.
–
•
The stop codons are the same as in eukaryotes: TGA, TAA, TAG
–
•
•
stop codons are (almost) absolute: except for a few cases of programmed frameshifts
and the use of TGA for selenocysteine, the stop codon at the end of an ORF is the end
of protein translation.
Genes can overlap by a small amount. Not much, but a few codons of overlap
is common enough so that you can’t just eliminate overlaps as impossible.
Cross-species homology works well for many genes. It is very unlikely that noncoding sequence will be conserved.
–
•
Remember that start codons are also used internally: the actual start codon may not be
the first one in the ORF.
But, a significant minority of genes (say 20%) are unique to a given species.
Translation start signals (ribosome binding sites; Shine-Dalgarno sequences)
are often found just upstream from the start codon
–
–
however, some aren’t recognizable
genes in operons sometimes don’t always have a separate ribosome binding site for
each gene
Compositional Methods
• The frequency of various codons is different in coding regions
as compared to non-coding regions.
– This extends to G-C content, dinucleotide frequencies, and other
measures of composition. Dicodons (groups of 6 bases) are often
used
– Well documented experimentally.
• The composition varies between different proteins of course,
and it is affected within a species by the amounts of the various
tRNAs present
– horizontally transferred genes can also confuse things: they tend to
have compositions that reflect their original species.
– A second group with unusual compositions are highly expressed
genes.
GeneMark
•
GeneMark uses fifth order Markov chains to examine dicodons. That is, every
base is evaluated in terms of its probability given the previous 5 bases.
–
–
•
GeneMark pays attention to reading frame also. Each reading frame gets its
own set of statistics. Thus there is a separate P1(a|x1x2x3x4x5), P2(a|x1x2x3x4x5),
and P3(a|x1x2x3x4x5), where 1, 2, and 3 are the reading frames.
–
–
–
•
P(a|x1x2x3x4x5), the probability that the sixth base in the sequence is a given that the
bases preceding it are x1x2x3x4x5, so the final sequence is x1x2x3x4x5a.
The necessary parameters are obtained by looking at pentamers (5-mers) within
known genes and counting the number of times each base appears in the sixth
position. This is the training set. Possible use of pseudocounts here.
Based on the position of the stop codon, each base in an ORF has a unique codon
position.
Non-coding regions are assumed to have the same statistics for all frames.
The final probability is given as the probability that it is coding for a specified reading
frame.
A 96 base sliding window is moved across the genome, scoring all possible
reading frames. Start and stop codons are not accurately predicted, especially
with overlapping genes--they need to be identified separately.
GLIMMER
•
GLIMMER also uses Markov chains, but they vary from zeroth order (i.e. GC content) to eighth
order (what is the probability of a base given the previous 8 bases?).
–
–
•
GLIMMER selects training data from a genome sequence by picking non-overlapping long ORFs,
which are almost all genes.
–
•
–
•
Note that high GC-content genomes need “long” defined differently than low GC genomes, since random stop
codons are rarer.
GLIMMER builds its Markov models from the lowest order up. At each step, there must be at least
400 observations to accept the model as valid.
–
•
•
The point of this is to help get around the need for huge sets of training data while avoiding pseudocounts.
Called “interpolated Markov models”
If there are too few observations, the model is compared to each of the next order down model, using a chisquare test.
• If the new model isn’t significantly different from the lower order model, it is discarded.
• If the new model is significantly different, it is weighted based on the number of observations and the
significance level.
For example, if there are less than 400 observations of x1x2x3x4x5, the P(a|x1x2x3x4x5) for each base a is
tested against P(a|x2x3x4x5) probabilities.
After all parameters are obtained, only the highest order model is used for any given subsequence.
Each ORF longer than a minimum is scored (as opposed to using a sliding window that ignores
ORFs)
New versions don’t require that the “given” bases be adjacent to the base they are scored with.
ORPHEUS
•
ORPHEUS uses Markov models of codon frequency, based on a set of highconfidence (i.e. highly conserved) genes. However, ORPHEUS also looks for
ribosome binding sites.
–
•
•
•
Each ORF in the genome is scored for the correct reading frame (as set by the
stop codon) and for the other 2 forward, incorrect reading frames. If the correct
frame score exceeds the incorrect frame scores by a certain amount, this ORF is
accepted as protein-coding.
After a good ORF is found, it is extended 5’ to find possible start codons (but only
allowing 6 bases of overlap with another gene).
Ribosome binding sites are then defined, based on genes that have only 1 possible
start codon. Twenty bases upstream from the start sites are aligned.
–
–
•
The score for a given codon abc is bases on the frequency of this codon compared to the
frequencies of the individual bases. This score is then summed for all codons in the
training set and used as a parameter for the Markov model.
RBS are not an exact distance upstream from the start codon.
The RBS scoring matrix derived from this is used to locate RBS for other genes.
The search is done progressively, starting with the longest ORFs and working
towards the smaller ones. This avoids a lot of overlap problems.
GeneMark.hmm
•
More Markov chains, but here, the probabilities are based on overall
length of the ORF.
– A true Markov model only considers the previous state, so this is a semiMarkov model.
•
•
•
It has been found that the length of coding regions can be modelled
with a gamma distribution, and the length of non-coding regions can be
modelled with an exponential distribution. (just empirical observations,
not based on theory).
GeneMark.hmm changes the probability that a base is in a coding
region depending on the length of the coding region defined to that
point.
It also looks for ribosome binding sites.
Eukaryotic Gene Prediction
•
•
Some fundamental differences between
prokaryotes and eukaryotes:
There is lots of non-coding DNA in eukaryotes.
– First step: find repeated sequences and RNA
genes
– Note that eukaryotes have 3 main RNA
polymerases. RNA polymerase 2 (pol2)
transcribes all protein-coding genes, while pol1
and pol3 transcribe various RNA-only genes.
•
•
•
most eukaryotic genes are split into exons and
introns.
Only 1 gene per transcript in eukaryotes.
No ribosome binding sites: translation starts at
the first ATG in the mRNA
– thus, in eukaryotic genomes, searching for the
transcription start site (TSS) makes sense.
•
Many fewer eukaryotic genomes have been
sequenced
Exons and Introns
•
•
Size distribution of exons varies
according to position in the gene. It is
also quite different between plants and
animals.
Exons are generally shorter than
prokaryotic ORFs, as short as 10 bp.
– Note that the leading exon and the trailing
exon always contain some non-coding
bases, and sometimes they are entirely
non-coding.
– Exon-intron boundaries can occur within a
codon as well as between codons.
•
•
Introns can be incredibly long, with some
human introns over 400,000 bp.
Minimum size is about 50 bp.
Many genes have alternate splicing
patterns: a sequence that is an exon in
one tissue might be an intron in another
tissue.
More Exon-Intron
•
•
Each gene has a transcription start
site, but promoters and other
features are not well conserved, as
compared to coding sequences.
Splicing signals are not absolute
(especially given alternative splicing),
and they also vary widely.
–
•
In general, introns start with GT and
end with AG,and have a slice
acceptor region just upstream from
the end.
There are also the relatively rare (<
1%) U12 introns, which are removed
by different spliceosomes than the
usual U2 introns. The U12 introns
start with AT and end in AC.
Human on left, Arabidopsis on right
Predicting Exons and Introns
• Exon sequences can often be identified by sequence
conservation, at least roughly.
• Dicodon statistics, as was used for prokaryotes, also is useful
– eukaryotic genomes tend to contain many isochores, regions of
different GC content, and composition statistics can vary between
isochores.
• The initial and terminal exons contain untranslated regions, and
thus special methods are needed to detect them.
• Predicting splice junctions is a matter of collecting information
about the sequences surrounding each possible GT/AC pair,
then running this information through some combination of
decision tree, Markov models, discriminant analysis, or neural
networks, in an attemp to massage the data into giving a reliable
score.
– In general, sites are more likely to be correct if predicted by multiple
methods
– Experimental data from ESTs can be very helpful here.
ESTs
•
•
Experimental information about intron/exon boundaries is mostly obtained by
analyzing expressed sequence tags (ESTs).
EST production starts out by extracting mRNA from a specific tissue, then
reverse-transcribing it to make double-stranded cDNA, then cloning the cDNA
into a plasmid vector.
–
–
–
•
This leads to an imperfect sequence, but BLAST can generally locate its position
in the genome exactly.
–
•
•
•
After the clone is produced, it is sequenced for one or both ends, just a single time.
A 5’ EST from the 5’ end, which usually contains at least some protein-coding portion.
3’ EST, sequenced from the 3’ end, is often 3’ untranslated region, which is less
conserved across species lines.
Also, lots of redundancy in an EST library, especially with highly expressed sequences.
ESTs provide evidence that a given sequence has been expressed.
They also show which sequences are exons, since introns have already been
spliced out of the mRNA.
Large numbers deposited in dbEST, part of NCBI.
–
The UniGene set organizes ESTs from individual genes to remove a lot of redundancy.
Finding the Transcription Start Site
•
•
The basic idea is to first create a model of transcription start sites based on
experimentally-determined starts, then devise ways to score sequences relative to
this model.
Work by Bucher in 1990 produced scoring matrices for the GC box, CCAAT box,
TATA box, and RNA initiation/cap site (Inr). (Moving 5’ to 3’ upstream from the
TSS itself).
–
–
–
Only vertebrates have GC and CCAAT boxes
not all genes have recognizable TATA boxes
the cap signal is quite short, and thus noisy.
Eukaryotic ab initio Gene Prediction
•
•
•
•
Based on hidden Markov model (HMM)
As you move along the DNA sequence, a
given nucleotide can be in an exon or an
intron or in an intergenic region.
The oversimplified model on this slide
doesn’t have the ”non-gene” state
Use a training set of known genes (from
the same or closely related species) to
determine transmission and emission
probabilities.
Very simple HMM: each
base is either in an intron
or an exon, and gets
emitted with different
frequencies depending on
which state it is in.
Genemark scoring of the likelihood each
nucleotide is in an intron, based on HMM.
HMM Model with Intron Phases
•
•
•
A more realistic model: you can move from the
non-gene state (N) to either a singleton exon
(Es: a gene with just one exon ) or to an initial
exon (Ei).
From Es you move back to N.
From Ei you can move to an intron, which can be
in any of 3 different phases.
– Intron and exon phases designate whether the
exon/intron boundary splits a codon: in phase 0
the boundary is between codons; phase 1 splits
the codon between the first and second bases,
and phase 2 splits the codon between the second
and third bases.
– Also, exon/intron boundaries don’t split stop
codons, which necessitates the I1T etc. intron
states.
•
•
Then back and forth between introns and exons,
until you reach a terminal exon (Et), then back to
the intergenic state (N).
SNAP: Korf (2004) BMC Bioinformatics 5:59.
A more realistic
model from SNAP
Codon Bias within Exons
• Depending on the GC
content of the
organism as well as
other, less well defined
characteristics, the
frequency with which
different synonymous
codons are used can
vary widely. This
makes it necessary to
train the HMM gene
finder with a set of
genes from the same
or a closely related
species.
At: Arabidopsis thaliana; Ce:
Caenorhabditis elegans; Dm:
Drosophila melanogaster; Os:
Oryza sativa
Exon/Intron Boundaries and
Start Codons
• Gene finders use HMMs
that look for signals in the
DNA by applying a “weight
matrix” to each nucleotide
based on the nucleotides
around it. Thus, the HMM
is considering more than
just the immediately
preceding nucleotide.
Sequence logos around (b) the intron slice donor
site (usually GT) and (c) the ATG translation start
codon, in four well-studied eukaryotes.
Some Results with SNAP
•
Here, sensitivity (SN) and specificity (SP) are listed for:
1. Whether a given nucleotide is contained in an exon
2. Whether a given predicted exon has exactly the same boundaries as
a real exon
3. Whether a given gene has exactly the same intron/exon structure
and boundaries as the actual gene.
Discriminant Analysis
•
•
•
Scoring sequences for the presence of eukaryotic
promoters uses several techniques, including hidden
Markov models, neural networks (which we will
discuss later), and other scoring schemes.
Discriminant analysis is a statistical technique for
combining scores from several different parameters
and drawing a line that discriminates between “good”
and “bad”.
Each factor is considered an independent dimension
on a multi-dimensional plot.
–
–
•
•
Each sequence from a training set is plotted, knowing
in advance which sequences are genuine promoters
and which are not.
Using a least-squares fitting method, draw the line (a
hyperplane really) that best separates the two groups.
–
•
As opposed to just adding up scores for each factor
or, using individual scores as part of yes/no decisions
This is linear discriminant analysis
Quadratic discriminant analysis draws a parabola
instead. Sometimes this works better.
Several factors used to score
promoter sequences. This is
part of a neural network
model, but the factors are
common to many programs.
Discriminant Analysis
•
•
•
•
Illustrated here for 2 factors, but of course there can be many more.
The quadratic discriminant works much better in this case.
The position of each sequence in a scan of a region can be scored
according to where it falls on the plot.
Support Vector Machines (SVM) are a fancier way of doing this: they can
generate a much more complex curve than a hyperplane to separate the
groups.
Annotation
• Once genes have been identified, we need to assign them names
and functions.
– In well-studied genomes, such as Drosophila, there are many alreadynamed genes, some of which are quite whimisical. They often reflect the
mutant phenotype, e.g. white eyes. A mutant whose wings are held at an
unusual angle: Frodo (“lowered of the wings”).
– But in general, gene names from genome project tend to be descriptions
of function. For example, the gene for glucose 6-phosphate
dehydrogenase is just called that in bacteria, but it is “Zwischenferment”
(a German word) in Drosophila.
• Who is going to do the annotations? There are a lot of genes, and no
one is an expert in all of them.
– One approach: use amateurs who are trained to follow certain guidelines
and have easy access to as much useful information as possible.
Problem: inconsistent results
– Another approach: have experts in specific genes annotate all examples
of that gene. Problem: getting experts for all genes and keeping them
interested.
– Yet another: do as much automated annotation as possible, with trained
personnel examining only the hard cases. Problem: identifying the hard
cases.
More Annotation
•
Need for experimental evidence. All gene identification is based on
experimental work: biochemistry, genetics, etc.
– Most annotation is thus based on logic like “Gene X in my organism is similar to
gene Y in another organism that has been experimentally determined to have
such-and-such a function.”
– How similar is “similar”? Are there other functions that might use similar
proteins?
•
•
Gene function predictions vary in their reliability: how well does the current
gene match previously discovered genes?
We need gene names that are computer-recognizable. This means using a
controlled vocabulary: only certain words and punctuation is used, and
standard genes are named the same way in all organisms.
– Gene Ontology descriptions are useful, but they are not detailed
enough, and they tend to focus on human genes at the expense of
bacterial genes.
– Enzyme Commission (E.C.) numbers are very useful or enzymes
because they describe a function precisely.
– Otherwise, you either follow the conventions of the group you are
working with or try to mimic the best BLAST hits
Confidence in Name Assignment
• The basic hierarchy:
– Confident assignment. We are almost certain we know what this gene
is, based on its similarity to other genes.
• If all of the top hits are high quality (say, better than 35% identical amino acids and
within 20% of the same length), and they all have similar names, a gene name can be
confidently given.
– Some uncertainty, often with regard to exact enzyme or transporter
specificity. Names are often called “putative” here.
– We know it belongs to a gene family, or it contains a known domain, but
function is unclear
– Conserved hypothetical genes. Found in other species, but with no
known function.
– Hypothetical genes. The gene caller predicts a gene, but there is no
match to any gene in another species.
• But in fact these ideas are only loosely applied across many
different annotation systems, and it is common to find highly similar
genes given slightly different names.
– Also, sometimes “hypothetical” is used too freely. It is always correct to
call a gene hypothetical, but it doesn’t convey any useful information.
Gene Ontology
•
One of my “rules of biology” I tell the introductory students is that quite often there is
more than one word used to describe the same phenomenon, and the same word is
often used to describe completely different phenomena
–
–
The citric acid cycle is also the tricarboxylic acid cycle and the Krebs cycle
“nucleus” of a cell and an atom
•
The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt
describe gene products with a structured controlled vocabulary, a set of invariant
terms that have a known relationship to each other.
•
Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For
example, GO:0005102 is “receptor binding”.
•
There are 3 root terms: biological process, cellular component, and molecular function.
A gene product will probably be described by GO terms from each of these
“ontologies”. (ontology is a branch of philosophy concerned with the nature of being,
and the basic categories of being and their relationships.)
–
•
For instance, cytochrome c is described with the molecular function term “oxidoreductase
activity”, the biological process terms “oxidative phosphorylation” and “induction of cell
death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner
membrane”
The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a
tree. This means simply that each term can have more than one parent term, but the
direction of parent to child (i.e. less specific to more specific) is always maintained.
More GO
•
•
•
•
Cellular component describes what larger structure the gene product is part of. For
example, “ribosome” or “endoplasmic reticulum” or “cytoplasm”.
Molecular function describes activities, such as catalytic or binding activities, that
occur at the molecular level. They are always described as activities: the enzyme
adenlylate cyclase is given the term “adenylate cyclase activity”
Biological process describes the higher level activity that the molecular function
contributes to. For example, “signal transduction” or “mannose transport”
GO doesn’t go above the level of the cell, and it doesn’t deal with cell types.
–
–
It also doesn’t describe disease states or abnormal functions (cancer, for example).
It also doesn’t describe individual protein domains or gene structure.
•
Terms range in a hierarchy from very specific to very general. During annotation, the
trick is to find terms that are as specific as possible without over-interpreting the data.
This can be tricky with unfamiliar gene functions.
•
My opinion is that GO is a great tool, but hard to do well. And, it doesn’t quite get
down to the level of exactly what the gene does. We really do want to name a gene
“cytochrome c”, and not just use GO terms as descriptions.
Enzyme Nomenclature
•
Enzyme functions: which reactants are converted to which products
– Across many species, the enzymes that perform a specific function are usually
evolutionarily related. However, this isn’t necessarily true. There are cases of two
entirely different enzymes evolving similar functions.
– Often, two or more gene products in a genome will have the same E.C. number.
•
Enzyme functions are given unique numbers by the Enzyme Commission.
– E.C. numbers are four integers separated by dots. The left-most number is the
least specific
– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose
components indicate the following groups of enzymes:
• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule)
• EC 3.4 are hydrolases that act on peptide bonds
• EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a
polypeptide
• EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide
•
Top level E.C. numbers:
– E.C. 1: oxidoreductases (often dehydrogenases): electron transfer
– E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between
molecules.
– E.C. 3: hydrolases: splitting a molecule by adding water to a bond.
– E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule
– E.C. 5: isomerases: rearrangements of atoms within a molecule
– E.C. 6: ligases: joining two molecules using energy from ATP
Information Used in Annotation
• BLAST searches
• HMM models of specific genes or gene families (Pfam, TIGRfam,
FIGfam).
• Sequence motifs and domains. If the gene is not a good match to
previously known genes, these provide useful clues.
• Cellular location predictions, especially for transmembrane proteins.
• Genomic neighbors, especially in bacteria, where related functions
are often found together in operons and divergons (genes
transcribed in opposite directions that use a common control region).
• Biochemical pathway/subsystem information. If an organism has
most of the genes needed to perform a function, any missing
functions are probably present too.
– Also, experimental data about an organism’s capacities can be used to
decide whether the relevant functions are present in the genome.
Transmembrane Predictions
•
Integral membrane proteins contain amino acid
sequences that go through the membrane one or
several times.
– There are also peripheral membrane proteins that stick
to the hydrophilic head groups by ionic and polar
interactions
– There are also some that have covalently bound
hydrophobic groups, such as myristoylate, a 14 carbon
saturated fatty acid that is attached to the N-terminal
amino group.
•
There are 2 main protein structures that cross
membranes.
– Most are alpha helices, and in proteins that span
multiple times, these alpha helices are packed together
in a coiled-coil. Length = 15-30 amino acids.
– Less commonly, there are proteins with membrane
spanning “beta barrels”, composed of beta sheets
wrapped into a cylinder. An example: porins, which
transport water across the membrane.
Hydrophobicity and Amphipathy
•
•
•
Membrane interiors are hydrophobic, so the simplest way of finding
membrane-spanning regions is to look for relatively hydrophobic regions.
There are several measures of amino acid hydrophobicity available,
based on partitioning in water vs. solvent or on crystallography of
membrane proteins. No one scale dominates prediction models.
However, beta barrels and coiled-coils of alpha helices have interior
regions that don’t need to be hydrophobic because they don’t interact
with the hydrophobic fatty acid chains of the membrane.
– Thus, many membrane-spanning regions are amphipathic: they
have a hydrophobic side and a hydrophilic side.
– The helical wheel is a simple way of visualizing this. It is a view
looking down the helix. If most of the hydrophobic residues fall on
one side, the sequence is likely to be membrane-spanning.
HMM Prediction of Transmembrane Regions
•
•
Hidden Markov models seem to do a
good job predicting transmembrane
regions.
The states are: loops inside the cell,
loops outside the cell, and
transmembrane regions.
– In addition, the cap amino acids (at
the membrane/aqueous interface)
can be a state, and it is possible to
globular domains either inside or
outside the cell.
•
The HMM is circular, allowing for
multiple passes through the
membrane.
– Many of the states allow transition
back to themselves: there is more
than one amino acids in the
membrane interior, for example.
•
•
The model is parameterized using
known membrane proteins (from Xray crystallography).
The model pictured here is TMHMM.
Biochemical Pathways and Co-localization
•
Operon structure is often
maintained over fairly large
taxonomic regions.
–
–
•
Sometimes gene order is altered,
and sometimes one or more
enzymes are missing.
But in general, this phenomenon
allows recognition or verification
that widely diverged enzymes do
in fact have the same function.
This is an operon that contains
part of the glycolytic pathway.
–
–
–
–
–
–
1: phosphoclycerate mutase
2: triosephosphate isomerase
3: enolase
4: phosphoglycerate kinase
5: glyceraldehyde 3-phosphate
dehydrogenase
6: central glycolytic gene
regulator
Alternate
pathways
•
There are often alternate ways
of going through a pathway.
– Often dependent on
taxonomic group (but beware
of horizontal gene transfer).
– Reversible pathways often
have irreversible steps that
need alternate enzymes to
get around. And some
species will only have the
pathway functioning in one
direction.
•
This pathway is glycolysis and
gluconeogenesis in Bacillus
megaterium. The colored
boxes indicate enzymes that
are present.
– Both glycolysis and
gluconeogenesis are present
– Several alternative enzymes
are not found here.
BIOLOG
• BIOLOG is a company that performs batteries of tests on
bacteria. The idea is to develop a complete metabolic
profile for the organism.
– They are grown in microtiter plates with standard growth media
supplemented or substituted with various possible nutrients or
growth inhibitors. For example, carbon sources, nitrogen
sources, phosphate sources, various osmotic strengths and pHs
– Growth is checked over several days
– Strain comparison or individual data
• The yellow triangles in each well position are growth
curves.
– Red = strain A grew better than strain B; green is the opposite
– Outlined boxes were significant by the company’s standards
BIOLOG Results