Download Biological sequence databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Histone acetylation and deacetylation wikipedia , lookup

G protein–coupled receptor wikipedia , lookup

SR protein wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein wikipedia , lookup

Magnesium transporter wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Genome evolution wikipedia , lookup

List of types of proteins wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Intrinsically disordered proteins wikipedia , lookup

Gene expression wikipedia , lookup

Western blot wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Expression vector wikipedia , lookup

Protein adsorption wikipedia , lookup

Molecular evolution wikipedia , lookup

Protein moonlighting wikipedia , lookup

Gene regulatory network wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Transcript
Abstracts
Briefings in Bioinformatics aims to provide
working biologists with an awareness and
understanding of the computational
approaches available for research and
discovery. The Abstracts section of the
journal consists of summaries of
bioinformatics manuscripts published in
the previous quarter. Inclusion of an
article in this section indicates that the
editors consider it to be among the most
interesting and/or useful contributions to
the field for the quarter covered. The
contents of these reports are briefly
distilled for the readers with an emphasis
placed on their potential utility.
Publications in the areas of genome
evolution and biological networks from
the fourth quarter of 2003 (October–
December) are reviewed here.
GENOME EVOLUTION
An evolutionary analysis of
orphan genes in Drosophila
Tomislav Domazet-Loso and Diethard
Tautz
Genome Research (2003) Vol. 13,
pp. 2213–2219
Once the sequence of a genome has been
characterised, the functions that
correspond to the genes that it encodes
must be assigned. In the vast majority of
cases, this is done by information transfer:
the process of computationally
extrapolating experimental information
from one system to another based on
sequence similarity (and thus evolutionary
relatedness) between encoded proteins.
Unfortunately, for any genome a
substantial fraction of genes exists for
which there is no other sequence with
detectable similarity – these are the socalled orphan genes. It was originally
thought the number of such genes should
be continuously reduced as sequence
databases increase in size, but this has not
happened. Thus, the persistence of
orphan genes far into the age of genomics
is an evolutionary enigma. Domazet-Loso
and Tautz take the genome sequence of
Drosophila melanogaster as a model system,
and they scrutinise its orphans to try to
shed light on the role of these often
neglected genes. The Drosophila genome
consists of 26–29 per cent orphan genes,
and the authors demonstrate that this
fraction does not seem to be changing
even as sequence databases continue to
grow. Drosophila orphan gene sequences
were compared with sequences expressed
in the closely related species D. yakuba in
order to assess their evolutionary
characteristics. Not surprisingly, it was
found that orphan genes evolve twice as
fast on average as non-orphan genes.
However, the range of evolutionary rates
was about the same for these two classes
of Drosophila genes. Thus there are some
orphan genes that do have very low
substitution rates, and it is proposed that
these anomalous orphans may be
particularly prone to encode lineagespecific adaptive traits. A model for how
orphan genes may relate to adaptation is
proposed, and it is suggested that this class
of genes may be most important for
evolutionary divergence over relatively
short time-scales.
Gene loss, protein sequence
divergence, gene dispensability,
expression level, and
interactivity are correlated in
eukaryotic evolution
Dmitri M. Krylov, Yuri I. Wolf, Igor
B. Rogozin and Eugene V. Koonin
Genome Research (2003) Vol. 13,
pp. 2229–2235
Large-scale sequence comparisons
between complete genomes have
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004
75
Abstracts
contributed much to the understanding of
the evolutionary process at the most
fundamental level. Among the greatest
surprises that have resulted from the
application of genomic technology to the
study of evolution is the extent to which
genomes have been shaped by lineagespecific gene loss. Massive gene loss can
occur rapidly and this process of loss alone
appears to account for the majority of the
differences in gene repertoires among
eukaryotic genomes. Using a database of
evolutionarily related protein sequences
(the clusters of eukaryotic orthologous
groups – KOGs – database), Eugene
Koonin and colleagues have devised a
parsimony-based algorithm that
characterises and quantifies gene loss
among seven complete eukaryotic
genomes. The extent of gene loss for any
group of related proteins, quantified as a
numerical propensity for gene loss (PGL)
value, was compared with a number of
different parameters that characterise the
genes/proteins in the orthologous groups.
These are the level of protein sequence
divergence, the fitness effect of gene
knock-outs, the number of protein–
protein interactions and gene expression
levels, respectively. Not surprisingly, PGL
values were significantly correlated with
all of these factors. Genes that are less
likely to be lost have, on average, lower
levels of sequence divergence, greater
effects on fitness, more protein–protein
interactions and higher levels of gene
expression. One particularly interesting
result is the finding that PGL levels are
more correlated with those biological
characteristics of proteins than are the
levels of sequence divergence.
Apparently, the biological importance of a
gene is better predicted by its propensity
to be lost than by its rate of evolution.
76
The signature of selection
mediated by expression on
human genes
Araxi O. Urrutia and Laurence D.
Hurst
Genome Research (2003) Vol. 13,
pp. 2260–2264
One of the great opportunities that the
study of genomics affords is the realisation
of how the effects of natural selection are
manifest at the level of the genome.
Comparisons of protein coding sequences
reveal the action of natural selection and
consideration of these data with respect to
other biological parameters suggests
factors that influence the action of natural
selection. For instance, different aspects of
relating to the efficiency and level of
protein synthesis are known to affect the
propensity of selection to constrain the
evolution of gene sequences. Specifically,
more highly expressed genes tend to be
smaller and have more codon and amino
acid biases than lower expressed genes.
However, these observations have been
made primarily for organisms with large
population sizes, such as unicellular
organisms and some invertebrates, and
this is thought to be related to the fact
that natural selection is more effective in
larger populations. In this report, the
authors demonstrate that similar effects of
gene expression mediated natural
selection can be seen for human genes.
Their results are based on an extensive
computational study that combined
sequence analysis of human genes with
the analysis of a number of publicly
available large-scale gene expression data
sets. Highly expressed human genes are
shown to have a lower overall intron
content and higher codon bias and to
encode proteins that are smaller and have
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004
Abstracts
higher amino acid biases than do less
expressed genes. These observations can
be considered to be unexpected because
humans, as well as related primates and
presumably their common ancestors, have
relatively small population sizes and so
selection on their genomes is expected to
be relatively weak. Thus, the selective
pressure to maximise the efficiency of
protein synthesis appears to be substantial
even for species with low population sizes.
The origins of genome
complexity
Michael Lynch and John S. Conery
Science (2003) Vol. 302, pp. 1401–
1404
The similarity of life at the molecular
level allows for meaningful genome
comparisons between vastly different
organisms that belong to deeply diverging
evolutionary lineages. One question that
can be addressed using such comparisons
relates to the genomic basis of the
tremendous differences in complexity
between unicellular and multicellular
organisms. A seemingly obvious result
that bears on this question is the
observation that the genomes of
multicellular eukaryotic organisms are far
more complex than those of the distantly
related unicellular (both prokaryotic and
eukaryotic) forms. However, closer
inspection of these differences reveals a
conundrum. The genomic differences
between simple unicellular and more
complex multicellular organisms are
accounted for far less by differences in
gene number than by differences in the
quantities and varieties of non-gene
coding sequences such as introns and
transposable elements. Thus, the
connection between genotypic and
phenotypic complexity is non-trivial.
Lynch and Conery explore this mystery
and propose a specific model by which
the non-adaptive accumulation of genetic
material allowed for the secondary
evolution of the complex traits that
characterise multicellular life forms. This
inference relies on the fact that the
efficacy of natural selection grows with
increasing population sizes. The authors
use a clever application of sequence
analysis to demonstrate that prokaryotes,
followed by unicellular eukaryotes, do
indeed have much greater population
sizes than multicellular eukaryotes. From
this it is surmised that the genomes of
multicellular organisms will accumulate
more genetic material, by duplication and
transposition, because of the reduced
power of natural selection (ie due to
genetic drift). The authors go on to show
that multicellular organisms do retain
gene duplicates for longer, contain more
and longer introns and accumulate far
more transposons than do unicellular
organisms. Once these genomic features
were established in the permissive
genomic environment that is associated
with small population sizes, they probably
served as building blocks for the evolution
of genomic and consequently phenotypic
complexity.
Apparent dependence of
protein evolutionary rate on
number of interactions is linked
to biases in protein–protein
interactions data sets
Jesse D. Bloom and Christoph Adami
BioMedCentral Evolutionary Biology
(2003) Vol. 3, p. 21
Staggeringly successful attempts to
characterise the sequences of complete
genomes have been followed closely by
even more ambitious efforts at
characterising the functional properties of
encoded proteins on a genomic scale.
These two rich sources of genomic
information are being increasingly
employed together to try to clarify the
relationship between biological function
and the action of natural selection on genes.
Independent efforts in this arena by several
different investigative groups have led to
some matters of contention. A recent
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004
77
Abstracts
example of this concerns the relationship
between the number of protein–protein
interactions and the rate of sequence
evolution. While it has been demonstrated
that proteins involved in a large number of
protein–protein interactions evolve more
slowly than those involved in fewer such
interactions, the magnitude and pattern of
this effect have been debated. This most
recent contribution to this discussion by
Bloom and Adami lends some critical new
insight and raises yet more questions. What
these authors have done is control for the
abundance (ie level of expression) of
proteins in comparisons between the
number of protein–protein interactions
and evolutionary rate. This control was
proposed in light of the facts that (1) some
high-throughput methods for the
characterisation of protein–protein
interactions are known to be biased
towards abundant proteins, and (2) highly
expressed genes are known to evolve
slowly. So it is not entirely surprising that
when the authors controlled for protein
abundance, the relationship between the
number of protein–protein interactions
and evolutionary rate disappeared (and was
even reversed in one case). However, this
mitigating factor had not been previously
considered, and the authors take it to
indicate that the relationship between
evolutionary rate and protein interaction
number is purely artefactual. It may be the
case, though, that abundant proteins really
are involved in more protein–protein
interactions, and so the correlation
between rate and interactions is real,
although not necessarily indicative of
causation.
BIOLOGICAL NETWORKS
Evolutionary conservation of
motif constituents in the yeast
protein interaction network
Stephan Wuchty, Zoltán N.
Oltvaiand Albert-László Barabási
Nature Genetics (2003) Vol. 35, pp.
176–179
Cellular functions are carried out by
interacting proteins that can be considered
78
to be related by a network. Over the last
several years, such biological networks
have been studied in substantial detail,
with the emphasis being placed on the
networks’ topological properties.
Consideration of the function and
evolution of the network components – ie
proteins and protein complexes – has been
largely absent from this field of inquiry.
Wuchty et al. add an important dimension
to the study of biological networks by
considering the evolutionary trajectories
of proteins that are organised into cohesive
interaction patterns (motifs). A database of
Saccharomyces cerevisiae protein–protein
interactions was used to identify motifs,
topologically distinct interaction patterns,
made up of two to five proteins. The yeast
proteins that make up the motifs were then
considered with respect to their level of
evolutionary conservation; specifically, for
any motif, the fraction of proteins that
have an orthologue present in each of five
other eukaryotes studied was determined.
Proteins that belong to specific topological
motifs are more conserved across species
than those that are not found in such
motifs. Furthermore, the proteins found in
motifs that have fewer and less connected
proteins are less conserved than proteins
found in larger, more connected motifs.
Different kinds of motifs were found to be
preferentially associated with specific
cellular functions, and the rate of evolution
of proteins in specific motifs is related to
their functional role. Taken together,
these results suggest that protein
interaction motifs represent coherent
modules of proteins that are conserved
together over evolutionary time by virtue
of the shared function that they perform.
Protein complexes and
functional modules in
molecular networks
Victor Spirin and Leonid A. Mirny
Proceedings of the National Academy of
Science USA (2003) Vol. 100, pp.
12123–12128
The application of high-throughput
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004
Abstracts
experimental techniques to the study of
functional genomics has led to the
enumeration of a number of different
types of cellular networks. Perhaps the
most widely studied networks of this kind
are made up of proteins (nodes) that are
connected (linked) to other proteins by
virtue of the physical interactions between
them. Numerous studies of such protein–
protein interaction networks have focused
on their overall architecture by analysing
properties such as the degree distributions
(ie number of links per node) and
clustering coefficients. Spirin and Mirny
also study protein–protein interaction
networks, but instead of focusing on the
large-scale properties of the network, they
examine relatively small (5–25) clusters
of proteins that have many more
connections between one another than
with the rest of the network. They reason
that these clusters (motifs) represent the
most biologically relevant assemblages of
proteins such as those that are involved in
processes such as signal transduction,
transcription and translation. Several
algorithms were developed for the
identification of protein clusters and
applied to a network of yeast protein–
protein interactions. More than 50
protein clusters were identified in this
way, and the identity of the proteins in
the clusters were considered with respect
to their annotation and relevant
experimental data. Two types of cellular
modules were discovered using this
approach: protein complexes and dynamic
functional units. The members of a
protein complex interact with one
another at the same time and place and
form a single molecular machine;
examples of such protein complexes
include transcription factors and
spliceosome components. On the other
hand, dynamic functional units are made
up of proteins that participate in a specific
cellular process but do so by interacting
with one another at different times and in
different cellular locations. For example,
proteins involved in signalling pathways
and cell cycle progression make up
dynamic functional modules. In addition
to being biologically germane, the authors
demonstrate that the protein clusters
identified in their study are highly
statistically significant and robust to noise
(spurious interactions) in the data set.
Bioinformatics analysis of
experimentally determined
protein complexes in the yeast
Saccharomyces cerevisiae
Zoltán Dezso, Zoltán N. Oltvai and
Albert-László Barabási
Genome Research (2003) Vol. 13,
pp. 2450–2454
Virtually all cellular functions are
performed by proteins that do not work
alone, but rather act together as
components of multi-protein complexes.
The identity of the proteins that function
together in such complexes can be
determined in large-scale mass
spectrometry experiments, and this
approach has been employed extensively
for the yeast Saccharomyces cerevisiae. The
resulting parts lists of multi-protein
complexes are of course quite useful but
tell only part of the story. The authors of
this report combine these protein
interaction data with other genome-scale
data sets reporting protein function,
expression pattern, essentiality (ie deletion
phenotype) and cellular localisation to try
to better understand the structure and
function of protein complexes. By
comparing these different parameters,
they find that the function and essentiality
of any given protein complex can be
characterised by a small core of protein
subunits. This core of proteins tends to
share similar expression patterns, belongs
to the same functional class and possesses
similar cellular localisations and deletion
phenotypes. In addition to these core
members of protein complexes there is
another group of proteins with far fewer
self-consistent values for each of these
parameters. It is postulated that these
more peripheral proteins may correspond
to subunits that are only transiently
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004
79
Abstracts
attached to a complex and that some may
even represent spurious members of a
complex that have been misidentified.
The identity of the proteins in the
characteristic core of any protein complex
can be used to predict the function of the
entire complex, and this functional
annotation can be extrapolated, with
varying degrees of confidence, to other
members of the complex. This method
entails a powerful and justifiable approach
for the prediction of function for many
proteins with as yet unknown
biochemical and cellular function.
Functional modules by relating
protein interaction networks
and gene expression
Sabine Tornow and H. W. Mewes
Nucleic Acids Research (2003) Vol.
31, pp. 6283–6289
The business of the cell is carried out by
proteins that work closely together, and
the coordinated action of these proteins
can be usefully conceptualised as a
network. The nodes in these networks are
genes/proteins and the connections
between nodes can represent qualitatively
distinct interactions such as regulatory
interactions or physical associations.
Because these different types of
interactions are biologically related, the
networks that capture them are expected
to show some degree of consistency.
Tornow and Mewes emphasise that such
consistency can be taken both as a
measure of support for distinct inferences
on the coordinated action of proteins and
as an indication of the functional
relationships between proteins. In this
report, they analyse the concordance
between groups of Saccharomyces cerevisiae
proteins connected in distinct networks –
co-expression versus physical interaction
– in order to make strongly supported
inferences about their function. Towards
this end, they propose a novel statistical
technique for the analysis of the
relationships between proteins based on
80
the connections between them in
different networks. Specifically, what they
have done is determine the correlation in
expression for a group of genes that were
identified to cluster together physically by
protein interaction data and assess the
probability that this expression correlation
is due to chance. Their approach is
demonstrated to be superior to a simpler
method based on average correlations
between genes in networks. The method,
as articulated here, can also be applied to
any number of combinations of different
sources functional information obtained
with high-throughput techniques. The
superposition of expression and physical
interaction networks leads to the
exposition of well-supported functional
modules such as complexes involved in
transcription and translation. This
approach should, in principle, be able to
lead to the functional annotation of
previously uncharacterised proteins.
Reconciling gene expression
data with known genome-scale
regulatory network structures
Markus J. Herrgård, Markus W.
Covert and Bernhard Ø. Palsson
Genome Research (2003) Vol. 13,
pp. 2423–2434
The activity and expression of the
proteins involved in cellular function are
controlled by hierarchical cascades of
regulatory interactions. Series of binary
regulatory interactions, where the
products of one gene activate or repress
the expression of another, can be
resolved into regulatory networks to
explore the mechanics of this process.
Traditionally this has been done through
the painstaking reconstruction of
networks by combining information on
individual regulatory interactions culled
from experimental information that is
represented in the literature and
databases. Now, the application of
computational analysis to genome-scale
expression data provides a novel systems-
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004
Abstracts
based approach that holds the possibility
of reconstructing entire regulatory
networks in one fell swoop. For the first
time, Herrgård et al. compare these two
disparate approaches to assess how they
may differ, where they may support one
another and in what sense can they be
considered to be complementary. They
have analysed two thoroughly studied
systems, Escherichia coli and Saccharomyces
cerevisiae. Both of these organisms have
regulatory networks that have been well
resolved using the traditional interactionby-interaction methodology as well as
copious amounts of gene expression data
gleaned from numerous large-scale
experiments. They computed the
consistency between four decomposed
elements of regulatory networks given by
each method. The consistency between
methods was found to be influenced by
both the network structure and the
function of the genes in the network.
Interestingly, gene expression data seem
to be much better at confirming
relationships between genes that are
targets of the same regulators than for
revealing interactions between regulator
genes and their targets. In addition, those
regulatory network elements that include
activators are more consistent than those
that are connected to repressors. Taken
together, their results suggest some
specific ways that large-scale gene
expression data can be used to enhance
and expand existing knowledge of gene
regulatory networks.
I. King Jordan
National Centre for Biotechnology Information,
National Institutes of Health,
8600 Rockville Pike,
Bethesda, Maryland 20894, USA
& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004
81