Download Biological sequence databases

Abstracts Briefings in Bioinformatics aims to provide working biologists with an awareness and understanding of the computational approaches available for research and discovery. The Abstracts section of the journal consists of summaries of bioinformatics manuscripts published in the previous quarter. Inclusion of an article in this section indicates that the editors consider it to be among the most interesting and/or useful contributions to the field for the quarter covered. The contents of these reports are briefly distilled for the readers with an emphasis placed on their potential utility. Publications in the areas of genome evolution and biological networks from the fourth quarter of 2003 (October– December) are reviewed here. GENOME EVOLUTION An evolutionary analysis of orphan genes in Drosophila Tomislav Domazet-Loso and Diethard Tautz Genome Research (2003) Vol. 13, pp. 2213–2219 Once the sequence of a genome has been characterised, the functions that correspond to the genes that it encodes must be assigned. In the vast majority of cases, this is done by information transfer: the process of computationally extrapolating experimental information from one system to another based on sequence similarity (and thus evolutionary relatedness) between encoded proteins. Unfortunately, for any genome a substantial fraction of genes exists for which there is no other sequence with detectable similarity – these are the socalled orphan genes. It was originally thought the number of such genes should be continuously reduced as sequence databases increase in size, but this has not happened. Thus, the persistence of orphan genes far into the age of genomics is an evolutionary enigma. Domazet-Loso and Tautz take the genome sequence of Drosophila melanogaster as a model system, and they scrutinise its orphans to try to shed light on the role of these often neglected genes. The Drosophila genome consists of 26–29 per cent orphan genes, and the authors demonstrate that this fraction does not seem to be changing even as sequence databases continue to grow. Drosophila orphan gene sequences were compared with sequences expressed in the closely related species D. yakuba in order to assess their evolutionary characteristics. Not surprisingly, it was found that orphan genes evolve twice as fast on average as non-orphan genes. However, the range of evolutionary rates was about the same for these two classes of Drosophila genes. Thus there are some orphan genes that do have very low substitution rates, and it is proposed that these anomalous orphans may be particularly prone to encode lineagespecific adaptive traits. A model for how orphan genes may relate to adaptation is proposed, and it is suggested that this class of genes may be most important for evolutionary divergence over relatively short time-scales. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution Dmitri M. Krylov, Yuri I. Wolf, Igor B. Rogozin and Eugene V. Koonin Genome Research (2003) Vol. 13, pp. 2229–2235 Large-scale sequence comparisons between complete genomes have & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004 75 Abstracts contributed much to the understanding of the evolutionary process at the most fundamental level. Among the greatest surprises that have resulted from the application of genomic technology to the study of evolution is the extent to which genomes have been shaped by lineagespecific gene loss. Massive gene loss can occur rapidly and this process of loss alone appears to account for the majority of the differences in gene repertoires among eukaryotic genomes. Using a database of evolutionarily related protein sequences (the clusters of eukaryotic orthologous groups – KOGs – database), Eugene Koonin and colleagues have devised a parsimony-based algorithm that characterises and quantifies gene loss among seven complete eukaryotic genomes. The extent of gene loss for any group of related proteins, quantified as a numerical propensity for gene loss (PGL) value, was compared with a number of different parameters that characterise the genes/proteins in the orthologous groups. These are the level of protein sequence divergence, the fitness effect of gene knock-outs, the number of protein– protein interactions and gene expression levels, respectively. Not surprisingly, PGL values were significantly correlated with all of these factors. Genes that are less likely to be lost have, on average, lower levels of sequence divergence, greater effects on fitness, more protein–protein interactions and higher levels of gene expression. One particularly interesting result is the finding that PGL levels are more correlated with those biological characteristics of proteins than are the levels of sequence divergence. Apparently, the biological importance of a gene is better predicted by its propensity to be lost than by its rate of evolution. 76 The signature of selection mediated by expression on human genes Araxi O. Urrutia and Laurence D. Hurst Genome Research (2003) Vol. 13, pp. 2260–2264 One of the great opportunities that the study of genomics affords is the realisation of how the effects of natural selection are manifest at the level of the genome. Comparisons of protein coding sequences reveal the action of natural selection and consideration of these data with respect to other biological parameters suggests factors that influence the action of natural selection. For instance, different aspects of relating to the efficiency and level of protein synthesis are known to affect the propensity of selection to constrain the evolution of gene sequences. Specifically, more highly expressed genes tend to be smaller and have more codon and amino acid biases than lower expressed genes. However, these observations have been made primarily for organisms with large population sizes, such as unicellular organisms and some invertebrates, and this is thought to be related to the fact that natural selection is more effective in larger populations. In this report, the authors demonstrate that similar effects of gene expression mediated natural selection can be seen for human genes. Their results are based on an extensive computational study that combined sequence analysis of human genes with the analysis of a number of publicly available large-scale gene expression data sets. Highly expressed human genes are shown to have a lower overall intron content and higher codon bias and to encode proteins that are smaller and have & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004 Abstracts higher amino acid biases than do less expressed genes. These observations can be considered to be unexpected because humans, as well as related primates and presumably their common ancestors, have relatively small population sizes and so selection on their genomes is expected to be relatively weak. Thus, the selective pressure to maximise the efficiency of protein synthesis appears to be substantial even for species with low population sizes. The origins of genome complexity Michael Lynch and John S. Conery Science (2003) Vol. 302, pp. 1401– 1404 The similarity of life at the molecular level allows for meaningful genome comparisons between vastly different organisms that belong to deeply diverging evolutionary lineages. One question that can be addressed using such comparisons relates to the genomic basis of the tremendous differences in complexity between unicellular and multicellular organisms. A seemingly obvious result that bears on this question is the observation that the genomes of multicellular eukaryotic organisms are far more complex than those of the distantly related unicellular (both prokaryotic and eukaryotic) forms. However, closer inspection of these differences reveals a conundrum. The genomic differences between simple unicellular and more complex multicellular organisms are accounted for far less by differences in gene number than by differences in the quantities and varieties of non-gene coding sequences such as introns and transposable elements. Thus, the connection between genotypic and phenotypic complexity is non-trivial. Lynch and Conery explore this mystery and propose a specific model by which the non-adaptive accumulation of genetic material allowed for the secondary evolution of the complex traits that characterise multicellular life forms. This inference relies on the fact that the efficacy of natural selection grows with increasing population sizes. The authors use a clever application of sequence analysis to demonstrate that prokaryotes, followed by unicellular eukaryotes, do indeed have much greater population sizes than multicellular eukaryotes. From this it is surmised that the genomes of multicellular organisms will accumulate more genetic material, by duplication and transposition, because of the reduced power of natural selection (ie due to genetic drift). The authors go on to show that multicellular organisms do retain gene duplicates for longer, contain more and longer introns and accumulate far more transposons than do unicellular organisms. Once these genomic features were established in the permissive genomic environment that is associated with small population sizes, they probably served as building blocks for the evolution of genomic and consequently phenotypic complexity. Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein–protein interactions data sets Jesse D. Bloom and Christoph Adami BioMedCentral Evolutionary Biology (2003) Vol. 3, p. 21 Staggeringly successful attempts to characterise the sequences of complete genomes have been followed closely by even more ambitious efforts at characterising the functional properties of encoded proteins on a genomic scale. These two rich sources of genomic information are being increasingly employed together to try to clarify the relationship between biological function and the action of natural selection on genes. Independent efforts in this arena by several different investigative groups have led to some matters of contention. A recent & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004 77 Abstracts example of this concerns the relationship between the number of protein–protein interactions and the rate of sequence evolution. While it has been demonstrated that proteins involved in a large number of protein–protein interactions evolve more slowly than those involved in fewer such interactions, the magnitude and pattern of this effect have been debated. This most recent contribution to this discussion by Bloom and Adami lends some critical new insight and raises yet more questions. What these authors have done is control for the abundance (ie level of expression) of proteins in comparisons between the number of protein–protein interactions and evolutionary rate. This control was proposed in light of the facts that (1) some high-throughput methods for the characterisation of protein–protein interactions are known to be biased towards abundant proteins, and (2) highly expressed genes are known to evolve slowly. So it is not entirely surprising that when the authors controlled for protein abundance, the relationship between the number of protein–protein interactions and evolutionary rate disappeared (and was even reversed in one case). However, this mitigating factor had not been previously considered, and the authors take it to indicate that the relationship between evolutionary rate and protein interaction number is purely artefactual. It may be the case, though, that abundant proteins really are involved in more protein–protein interactions, and so the correlation between rate and interactions is real, although not necessarily indicative of causation. BIOLOGICAL NETWORKS Evolutionary conservation of motif constituents in the yeast protein interaction network Stephan Wuchty, Zoltán N. Oltvaiand Albert-László Barabási Nature Genetics (2003) Vol. 35, pp. 176–179 Cellular functions are carried out by interacting proteins that can be considered 78 to be related by a network. Over the last several years, such biological networks have been studied in substantial detail, with the emphasis being placed on the networks’ topological properties. Consideration of the function and evolution of the network components – ie proteins and protein complexes – has been largely absent from this field of inquiry. Wuchty et al. add an important dimension to the study of biological networks by considering the evolutionary trajectories of proteins that are organised into cohesive interaction patterns (motifs). A database of Saccharomyces cerevisiae protein–protein interactions was used to identify motifs, topologically distinct interaction patterns, made up of two to five proteins. The yeast proteins that make up the motifs were then considered with respect to their level of evolutionary conservation; specifically, for any motif, the fraction of proteins that have an orthologue present in each of five other eukaryotes studied was determined. Proteins that belong to specific topological motifs are more conserved across species than those that are not found in such motifs. Furthermore, the proteins found in motifs that have fewer and less connected proteins are less conserved than proteins found in larger, more connected motifs. Different kinds of motifs were found to be preferentially associated with specific cellular functions, and the rate of evolution of proteins in specific motifs is related to their functional role. Taken together, these results suggest that protein interaction motifs represent coherent modules of proteins that are conserved together over evolutionary time by virtue of the shared function that they perform. Protein complexes and functional modules in molecular networks Victor Spirin and Leonid A. Mirny Proceedings of the National Academy of Science USA (2003) Vol. 100, pp. 12123–12128 The application of high-throughput & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004 Abstracts experimental techniques to the study of functional genomics has led to the enumeration of a number of different types of cellular networks. Perhaps the most widely studied networks of this kind are made up of proteins (nodes) that are connected (linked) to other proteins by virtue of the physical interactions between them. Numerous studies of such protein– protein interaction networks have focused on their overall architecture by analysing properties such as the degree distributions (ie number of links per node) and clustering coefficients. Spirin and Mirny also study protein–protein interaction networks, but instead of focusing on the large-scale properties of the network, they examine relatively small (5–25) clusters of proteins that have many more connections between one another than with the rest of the network. They reason that these clusters (motifs) represent the most biologically relevant assemblages of proteins such as those that are involved in processes such as signal transduction, transcription and translation. Several algorithms were developed for the identification of protein clusters and applied to a network of yeast protein– protein interactions. More than 50 protein clusters were identified in this way, and the identity of the proteins in the clusters were considered with respect to their annotation and relevant experimental data. Two types of cellular modules were discovered using this approach: protein complexes and dynamic functional units. The members of a protein complex interact with one another at the same time and place and form a single molecular machine; examples of such protein complexes include transcription factors and spliceosome components. On the other hand, dynamic functional units are made up of proteins that participate in a specific cellular process but do so by interacting with one another at different times and in different cellular locations. For example, proteins involved in signalling pathways and cell cycle progression make up dynamic functional modules. In addition to being biologically germane, the authors demonstrate that the protein clusters identified in their study are highly statistically significant and robust to noise (spurious interactions) in the data set. Bioinformatics analysis of experimentally determined protein complexes in the yeast Saccharomyces cerevisiae Zoltán Dezso, Zoltán N. Oltvai and Albert-László Barabási Genome Research (2003) Vol. 13, pp. 2450–2454 Virtually all cellular functions are performed by proteins that do not work alone, but rather act together as components of multi-protein complexes. The identity of the proteins that function together in such complexes can be determined in large-scale mass spectrometry experiments, and this approach has been employed extensively for the yeast Saccharomyces cerevisiae. The resulting parts lists of multi-protein complexes are of course quite useful but tell only part of the story. The authors of this report combine these protein interaction data with other genome-scale data sets reporting protein function, expression pattern, essentiality (ie deletion phenotype) and cellular localisation to try to better understand the structure and function of protein complexes. By comparing these different parameters, they find that the function and essentiality of any given protein complex can be characterised by a small core of protein subunits. This core of proteins tends to share similar expression patterns, belongs to the same functional class and possesses similar cellular localisations and deletion phenotypes. In addition to these core members of protein complexes there is another group of proteins with far fewer self-consistent values for each of these parameters. It is postulated that these more peripheral proteins may correspond to subunits that are only transiently & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004 79 Abstracts attached to a complex and that some may even represent spurious members of a complex that have been misidentified. The identity of the proteins in the characteristic core of any protein complex can be used to predict the function of the entire complex, and this functional annotation can be extrapolated, with varying degrees of confidence, to other members of the complex. This method entails a powerful and justifiable approach for the prediction of function for many proteins with as yet unknown biochemical and cellular function. Functional modules by relating protein interaction networks and gene expression Sabine Tornow and H. W. Mewes Nucleic Acids Research (2003) Vol. 31, pp. 6283–6289 The business of the cell is carried out by proteins that work closely together, and the coordinated action of these proteins can be usefully conceptualised as a network. The nodes in these networks are genes/proteins and the connections between nodes can represent qualitatively distinct interactions such as regulatory interactions or physical associations. Because these different types of interactions are biologically related, the networks that capture them are expected to show some degree of consistency. Tornow and Mewes emphasise that such consistency can be taken both as a measure of support for distinct inferences on the coordinated action of proteins and as an indication of the functional relationships between proteins. In this report, they analyse the concordance between groups of Saccharomyces cerevisiae proteins connected in distinct networks – co-expression versus physical interaction – in order to make strongly supported inferences about their function. Towards this end, they propose a novel statistical technique for the analysis of the relationships between proteins based on 80 the connections between them in different networks. Specifically, what they have done is determine the correlation in expression for a group of genes that were identified to cluster together physically by protein interaction data and assess the probability that this expression correlation is due to chance. Their approach is demonstrated to be superior to a simpler method based on average correlations between genes in networks. The method, as articulated here, can also be applied to any number of combinations of different sources functional information obtained with high-throughput techniques. The superposition of expression and physical interaction networks leads to the exposition of well-supported functional modules such as complexes involved in transcription and translation. This approach should, in principle, be able to lead to the functional annotation of previously uncharacterised proteins. Reconciling gene expression data with known genome-scale regulatory network structures Markus J. Herrgård, Markus W. Covert and Bernhard Ø. Palsson Genome Research (2003) Vol. 13, pp. 2423–2434 The activity and expression of the proteins involved in cellular function are controlled by hierarchical cascades of regulatory interactions. Series of binary regulatory interactions, where the products of one gene activate or repress the expression of another, can be resolved into regulatory networks to explore the mechanics of this process. Traditionally this has been done through the painstaking reconstruction of networks by combining information on individual regulatory interactions culled from experimental information that is represented in the literature and databases. Now, the application of computational analysis to genome-scale expression data provides a novel systems- & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004 Abstracts based approach that holds the possibility of reconstructing entire regulatory networks in one fell swoop. For the first time, Herrgård et al. compare these two disparate approaches to assess how they may differ, where they may support one another and in what sense can they be considered to be complementary. They have analysed two thoroughly studied systems, Escherichia coli and Saccharomyces cerevisiae. Both of these organisms have regulatory networks that have been well resolved using the traditional interactionby-interaction methodology as well as copious amounts of gene expression data gleaned from numerous large-scale experiments. They computed the consistency between four decomposed elements of regulatory networks given by each method. The consistency between methods was found to be influenced by both the network structure and the function of the genes in the network. Interestingly, gene expression data seem to be much better at confirming relationships between genes that are targets of the same regulators than for revealing interactions between regulator genes and their targets. In addition, those regulatory network elements that include activators are more consistent than those that are connected to repressors. Taken together, their results suggest some specific ways that large-scale gene expression data can be used to enhance and expand existing knowledge of gene regulatory networks. I. King Jordan National Centre for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, Maryland 20894, USA & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 5. NO 1. 75–81. MARCH 2004 81

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Biological sequence databases