* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Computational analysis of human disease
Survey
Document related concepts
Designer baby wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome (book) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Protein moonlighting wikipedia , lookup
Frameshift mutation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Transcript
247 Computational analysis of human disease-associated genes and their protein products Kodangattil R Sreekumar, L Aravind and Eugene V Koonin* The complete genome sequences for human, Drosophila melanogaster and Arabidopsis thaliana have been reported recently. With the availability of complete sequences for many bacteria and archaea, and five eukaryotes, comparative genomics and sequence analysis are enabling us to identify counterparts of many human disease genes in model organisms, which in turn should accelerate the pace of research and drug development to combat human diseases. Continuous improvement of specialized protein databases, together with sensitive computational tools, have enhanced the power and reliability of computational prediction of protein function. Addresses National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA ∗e-mail: [email protected] Current Opinion in Genetics & Development 2001, 11:247–257 0959-437X/01/$ — see front matter © 2001 Elsevier Science Ltd. All rights reserved. Abbreviations BIR baculovirus inhibition of apoptosis protein repeats CARD caspase recruitment domain CBD carbohydrate-binding domain FRDA Friedreich’s ataxia IAP inhibitor of apoptosis MALT mucosa-associated lymphoid tissue MKKS McKusick–Kaufman syndrome PTP protein tyrosine phosphatase TFPP type-4 prepilin peptidase VEGFR-3 vascular endothelial growth factor receptor-3 phylogenetically divergent sources. This branch of biology has greatly benefited from the finished and continuing genome-sequencing projects, as well as sequencing efforts undertaken on a smaller scale. Genome sequencing of three eukaryotic organisms — the fruitfly Drosophila melanogaster, the thale cress Arabidopsis thaliana and (in a draft form) Homo sapiens — has been completed in the past two years [1••–4••]. Table 1 lists some of the genes implicated in human disorders during the past year, and some of the previously identified human disease genes for which computational analysis in the same time frame has produced important new findings. Procedures for undertaking rigorous sequence analysis, certain pitfalls in such analysis, methods to circumvent these problems, and the predictive power of computational analysis have been described previously [5–7]. Selected tools and databases that are currently available and are widely used in sequence and structure analysis are listed in Table 2. Here we focus on recent insights obtained by computational analysis of disease genes, and in particular how defects in one or more protein domains affect the cellular function of the encoded protein. The work that we discuss serves only to illustrate how computational analysis of disease genes can provide important clues about molecular basis of pathogenesis (we apologize to those researchers whose important contributions to this burgeoning field are not be cited owing to space limitations). Conserved globular domains and their defects Introduction Inherited diseases are caused by mutation(s) in one or more genes. Diseases that are due to defect(s) in a single gene are called monogenic diseases; polygenic diseases are caused by defect(s) in more than one gene, with all those genes individually contributing to the development of the disease. In some diseases, such as Lafora’s (see below), the defect in proteins implicated in the disease can be directly related to the specific pathological state, whereas in others the link between the disease manifestation and the apparent defect observed at the level of the nucleotide or deduced amino-acid sequence may not be obvious. Irrespective of whether there is an obvious relationship between a defective gene and the pathological state, once a disease-associated gene (or a ‘disease gene’, in a common, if not precise, parlance) is identified, computational analysis of the gene is a critical step that can provide clues to the molecular basis of pathogenesis and invaluable insights for further experimental analysis. Most proteins consist of one or more evolutionarily conserved domains, each with a distinct structure and function. An alteration of the amino-acid sequence of a domain is likely to hamper its proper functioning, especially if the alteration is in a highly conserved region and/or involves a non-conservative replacement of an amino-acid residue. Several tools are available for the automatic detection of protein domains to aid in function prediction (Table 2); however, the existing domain databases are still far from comprehensive, and the detection methods have limited sensitivity. It is therefore critical to analyze protein sequence in detail, on a case-by-case basis, by using several tools and methods and by assessing the relevance of the results obtained by computational approaches in the context of the available experimental data. Below, some recent discoveries of defects in proteins that result in human disorders are discussed with a view to relate a defective domain/motif to specific biological consequences. Bcl10 and mucosa-associated lymphoid tissue lymphoma Computational sequence analysis draws its strength from the conservation of critical features of a gene/protein in A frameshift mutation in Bcl10, resulting in truncation distal to the caspase recruitment domain (CARD), is 248 Genetics of disease Table 1 Protein products of some genes mutated in human diseases. Disease Protein Domain(s)* Soluble, globular proteins MALT Paracaspase DEATH+ 2(IG) + lymphoma paracaspase Other sequence features/motifs – Demonstrated or predicted protein function Effect of mutation(s) Reference Predicted protein oligomerization, Translocation produces a interaction with other proteins chimericprotein that might be and protease activity potentially an inhibitor of apoptosis. involved in apoptosis [13] [8,9] MALT lymphoma Bcl10 CARD Serine/ Threoninerich region The Ser/Thr-rich domain may be involved in regulatory phosphorylation. The CARD domain mediates specific proteinprotein interactions in the apoptotic system Frameshift and truncation beyond the CARD domain might renderthe protein unstable or disruptSer/Thr phosphorylation Fukuyama type CMD Fukutin Predicted phosphorylsugar/choline transferase Signal peptide Similarity to bacterial proteins suggests a role in modifying cell-surface glycoproteins or glycolipids Transposon insertion disrupts [27,29] the predicted enzymatic domain Friedreich’s ataxia Frataxin Frataxin Mitochondrial import peptide Nuclear-encoded mitochondrial protein with a key role in the regulation of energy conversion Several point mutations affect [30,33] conserved amino-acid residues Parkinson’s disease Parkin Ubiquitin+ PARKIN_finger + RING – E3 ubiquitin-protein ligase Mutations in ubiquitin domain [55,56•] affect binding to target proteins; mutations in RING prevent recruitment of E2 Giant axonal Gigaxonin neuropathy POZ+IVR+ kelch repeats – POZ-kelch domain combination suggests a role in cytoskeleton structure–assembly Mutations that disrupt POZ or [57] Kelch domains might disrupt protein–protein interactions causing in cytoskeleton defects Retinitis pigmentosa MERTK Ig-like+ Fn3-like+ Tyr-kinase – Fn3 and Ig-like modules indicate ligand-binding ability; Tyr-Kinase domain implies protein phosphorylation; MERTK may be a receptor for a specific extracellular ligand Mutations affecting Tyr-kinase [58] domaindisrupt the phosphorylation cascade activated by it. Laterality defects CFC1 EGF+ CFC Macular corneal dystrophy CHST6 Sulfo-transferase Signal peptide Transfers sulfate groups from PAPS to various substrates R50C mutation in the conserved [60] region encompassing the sulfate donor binding site may prevent PAPSbinding Dominant OPA1 optic atrophy Dynamin-like GTPase Mitochondrial import signal Possible role in maintenance and inheritance of mitochondria R290Q, G300E, deletion of [45,46] invariant I432 and a nonsense mutation in the dynamin GTPase domain X-linked congenital SNB nyctalopin LRR-NT+ 15(LRR)+ LRR-CT Signal peptide, GP anchor Predicted LRR–containing glyco-protein of extracelluar matrix mediating protein–protein interactions Missense mutations cause truncation, insertion and deletions affecting LRRs and/or loss of GPI–anchor [61,62] X-linked mental retardation ARHGEF6 CH+SH3+ RhoGEF+ PH – Signal transduction via the RhoGTPase cycle (RhoGEF domain); predicted to interact with actin filaments proline-rich peptides in proteins and inositol phosphate. Intronic mutation resulting in exon 2 skipping, affects CH domain [63] Lafora’s disease EPM2A CBD-4 +tyrosine Signal peptide phosphatase (PTP) Signal peptide Extracellular signal molecule and membraneassociating region Mutations in the conserved residuesin EGF domain could disrupt inter-actions with its receptor Binds polysaccharides via CBD-4; W32G mutation in CBD, phosphatase domain may signal T194I in the PTP domain the catabolism of Lafora bodies. [59] [23] Computational analysis of disease-associated genes and their products Sreekumar et al. 249 Table 1 (continued) Protein products of some genes mutated in human diseases. Disease Protein Domain(s)* Other sequence features/motifs Demonstrated or predicted protein function Usher syndrome type 1c Harmonin 2(PDZ)+bZIP+ PDZ Proline/Serine/ Threonine-rich region; leucine zipper Predicted to dimerize via leucine Mutations resulting in loss of zipper and mediate proteinmost orall of the globular protein interactions via PDZ domains. domain. The P/S/T-rich region might serve as binding site for SH3 and WW domain-containing proteins [64] Familial segmental glomerulosclerosis ACTN4 2(CH)+ 4(SPEC) + 2(EF) – Predicted actin-binding and cross-linking protein (CH domains) and Ca-binding (EF hands) protein. Probable component of the cyto-skeletal network (SPEC domain) K228E, T232I, and S235P mutations in the 2nd CH domain cause increased affinity for actin [65] May-Hegglin MYH9 anomaly and Fechtner and Sebastian syndromes Myosin + IQ – Nonmuscle myosin heavy chain 9; predicted to bind calmodulin (IQ motif) N93K predicted to destabilize [66] the 2nd helix of myosin catalytic domain, R702C predicted to alter helical stability and ATPase activity. Mulibrey nanism MUL RING+ B-box + BBC + MATH – Predicted nuclear protein, may participate in ubiquitinmediated protein degradataion as a E3 ubiquitin ligase (RING) and possibly in apoptosis (MATH) Deletions (splice site and lcoding) ead to truncation, with the loss of MATH domain in two cases McKusickKaufman syndrome MKKS Group II chaperonin – Facilitates correct protein folding in conjunction with ATP hydrolysis One mutation predicted to [34••] affect intermolecular interaction based on structural modeling combined pitutary hormone deficiency LHX3 2(LIM) + HOX – Predicted transcriptional regulator via DNA-binding homeodomain protein–protein interaction via LIM domain A conserved Y replaced by C [67] in LIM domain; frameshift leading to loss of homeodomain Netherton syndrome SPINK5 15 (KAZAL) Signal peptide Potential extracellular adhesion molecule Insertions and deletions leading [68] to frameshift and protein truncation familial cylindromatosis CYLD 3(CAP-GLY)+ Proline-rich B-BOX+ UCH-2 region Nonsense and splice site mutations leading to truncation [69] of the C-terminal 2/3 of the protein Type 2 diabetes MAPK8IP1 SH3 + PTB Predicted to coordinate attachment of cellular organelles to microtubules via CAP-GLY domain and down-regulate protein degradation via the ubiquitin hydrolase via UCH-2 domain Regulation of JNK signaling pathway by mediating protein–protein interactions via the SH3 and PTB domains DNA-binding protein, transcription factor P148H enhances DNA-binding [15,16] affinity of homeobox; R172H and deletion of RK159-160 cause low DNA-binding affinity of homeobox Multiple DNA-binding domains predict a transcription factor 3 nonsense and 3 frameshift mutations cause loss of 2–4 Zn-fingers [71] ER membrane-associated translation initiation factor eIF-2α kinase Mutations in kinase domain or truncations N-terminal to the kinase domain. β-propeller domainin the N-terminus was previouslyundetected [72] CranioMSX2 synostosis and enlarged parietal foramina Homeobox Tricho-rhino- TRPS1 phalangeal syndrome type 1 8 (C2H2) + GATA + 2(C2H2) Membrane proteins Wolcott– EIF2AK3 Rallison syndrome β-propeller + TMS TM + S/T protein kinase N-terminal lowcomplexity region, probably non-globular – Signal peptide Effect of mutation(s) Reference [41•] S59N mutation outside the two [70] globular domains, not fully penetrant 250 Genetics of disease Table 1 (continued) Protein products of some genes mutated in human diseases. Disease Protein Domain(s)* Other sequence features/motifs Demonstrated or predicted protein function Effect of mutation(s) Reference Endotoxin TLR4 hypo-responsiveness 11(LRR) + LRRCT + TMS + TIR Signal peptide Intracellular signaling (TIR domain), cell adhesion and LRRs interaction with protein ligands (extracellular LRRs) Mutations A290G and T399I within the LRRs [73] Diastrophic dysplasia DTD 8(TMS) + STAS – Sulfate transporter Impaired sulfate transport across [39,74, cell membrane cause low sulfate 75] content of cartilage Pendred syndrome Pendrin 10(TMS) + STAS – Potential iodide-chloride transporter Mutation in predicted phosphate-binding loop. Also deletions, missense and splice site mutations [38,39•] congenital chloride diarrhoea DRA 10(TMS) + STAS – Chloride–NaHCO3 exchanger Missense and deletion mutations [39•,76] X-linked mental retardation TM4SF2 4(TMS) Member of the tetraspanin family that potentially interact with β-1 integrins. Translocation removing 4th TMS [77] Membrane-associated, ATPdependent transporter Mutations in the NTP-binding motifs (nearly) abolish ATP binding [78] Signal peptide Stargardt ABCA4 disease, agerelated macular degeneration 5(TMS) + ABC+ 7(TMS) + ABC Presenile dementia TYROBP TMS Signal peptide and ITAM motif Predicted transmembrane protein Large deletion, and single base [79] involved in signal transduction via pair deletion cause truncation the ITAM motif in TM segment and loss of ITAM motif Niemann– Pick C1 disease NPC1 13(TMS) + sterol-sensing domain Signal peptide Lysosomal protein involved in Glu20stop: truncation, perhaps [80,81] sequestering sterols in lysosomes null; another truncation results from frameshift in exon2 Spondylocostal dysostosis DLL3 DSL + 11(EGF) + TMS Signal peptide Predicted Ca2+-binding membrane protein involved in signal transduction Insertion leading to truncation in DSL domain; deletion in 4th EGF repeat and G385D mutation in 5thEGF repeat Serine/ Threonine-rich region Predicted protein tyrosine kinase involved in signal transduction and ligand-binding Y755stop, W749stop, and a deletion leading to frameshift [83] beyond G750, perhaps affect interaction with an SH3 domain ATP-powered Ca2+ pump that transports Ca2+ into Golgi bodies Large deletion, missense and splice site mutations cause truncation Class III receptor Tyr kinase; binds ligands and is involved in signal transduction G857R mutation in kinase motif; [36••,85] R1041P, R forms H-bonds with substrate OH- and aspartic acid that is involved in phosphotransfer Apparently a membrane aspartyl protease, with two aspartate required for activity G384A mutation, next to the [14••,86] critical D385, causes higher levels of residues amyloidogenic Aβ42 production BrachyROR2 IGc2 + Frizzled dactyly type B+ Kringle + TMS + TyrKinase Hailey– Hailey disease ATP2C1 8(TMS) + P-type ATPase Primary lymphoedema VEGFR-3 2(IG) + IGc2 + IG + IGc2 + 3(IG) + IGc2 + IG + IGc2 + TMS+TyrKinase Alzheimer’s disease Presenilin 1 9(TMS) – Signal peptide – [82] [84] *Domain name abbreviations are as in the SMART database. The number of domains of a particular type is indicated (for example, 2IG denotes two consecutive immunoglobulin domains). Consecutive domains are connected with ‘+’. Abbreviations: CMD, congenital muscular dystrophy; PAPS, phosphoadenosine phosphosulfate; TM, transmembrane; SNB, stationary night blindness. associated with mucosa-associated lymphoid tissue (MALT) lymphomas that have the translocation t(1;14)(p22;q32) [8,9]. The CARD domain was first identified in a comparison of the sequences of several apoptotic proteins [10]; this CARD domain comprises six antiparallel α helices and mediates homotypic interactions between proteins involved in apoptosis [11]. Besides caspase recruitment, which is a critical step in the chain of events leading to apoptosis, protein–protein interactions mediated by CARD contribute to activation of the transcription Computational analysis of disease-associated genes and their products Sreekumar et al. factor NF-κB. The mutated Bcl10 protein seen in MALT lymphomas has lost its pro-apoptotic functions but retains the ability to activate NF-κB [12]. In a different MALT lymphoma translocation, t(11;18(q21;q21), a fusion of the inhibitor of apoptosis (IAP)-2 gene to the MLT1/MALT1 locus generates a chimeric protein comprising BIR repeats of IAP-2 and the caspase-like predicted protease domain of the protein designated paracaspase [13••]. The paracaspase family of caspase homologs has been identified recently, along with the metacaspase family, in a detailed computational analysis of the caspase-like protease superfamily [13••]. Paracaspases contain a predicted caspase-like proteolytic domain, a Death domain capable of mediating specific protein–protein interactions and, in certain cases, including the human representative immunoglobulin domains. In the MALT lymphoma translocation t(11;18)(q21;q21), the prodomain of human paracaspase is replaced with BIR repeats that might function as inhibitors of apoptosis. The BIR–paracaspase fusion has been found to activate NF-κB in a manner dependent on the predicted catalytic cysteine of the paracaspase, thus supporting the computational prediction [13••]. Notably, the prodomain of human paracaspase interacts with Bcl10, which suggests that the two translocations associated with MALT lymphomas affect the same apoptotic pathway. Alzheimer disease — presenilins Familial Alzheimer disease is associated with mutations in genes that encode the paralogous integral membrane proteins presenilin 1 and presenilin 2. Presenilins are required to cleave several other integral membrane proteins, including the β-amyloid precursor proteins Notch and Ire1 that are central to the pathogenesis of Alzheimer disease. It has been shown that two aspartate residues of presenilin 1 — one located in the intracellular loop between helices 6 and 7 and the other one located in helix 7 — are required for these proteolytic events, leading to the hypothesis that presenilins themselves belong to a distinct class of membrane proteases called γ-secretases [14••]. Sequence database searches using all available methods, including different types of profile analysis, have failed to detect similarity between presenilins and any known proteases. However, a pattern search carried out by Steiner et al. [14••] revealed that the signature [G/A]xGDh (where x is any residue, h is a bulky hydrophobic residue, and alternative residues are shown in brackets) is shared by presenilins and bacterial type-4 prepilin peptidases (TFPP), and includes one of the functionally critical aspartates of both protein families. Presenilins and TFPPs have similar membrane topology, with eight transmembrane helices present in each family, but show no appreciable sequence similarity beyond the above signature. Therefore, although Steiner et al.’s [14••] 251 observation reinforces the hypothesis that presenilins are the catalytically active γ-secretases, rather than cofactors of a still unidentified protease, it remains unclear whether they are homologs of TFPPs or whether the similarity in the (predicted) active sites is due to convergence. Given the relatively relaxed functional constraints that are typical of membrane proteins, with selection preserving mainly the membrane topology rather then sequence, common origin cannot be ruled out despite the absence of sequence conservation. MSX1 and MSX2 Craniosynostosis and enlarged parietal foramina are caused by mutations in the homeodomain — a highly conserved DNA-binding domain containing three helical regions — of the MSX2 protein [15,16]. A mutation (P148H) that leads to craniosynostosis enhances the DNA-binding affinity of this protein, whereas another mutation (R172H) that results in parietal foramina has the opposite effect of lowering the protein’s affinity for DNA. Similarly, a mutation in the MSX1 gene (R31P) that affects the second helix of the homeodomain [17], and a nonsense mutation that leads to a truncation at position 104 and a peptide lacking the entire homeodomain [18] are implicated in tooth agenesis. Other examples of diseases caused by mutations in homeodomains include microphthalmia [19], amegakaryocytic thrombocytopenia and radio-ulnar synostosis [20]. The paired (PAX) domain, another DNA-binding module involved in transcription regulation, is defective in certain individuals diagnosed with oligodontia [21]. An enhanced or diminished DNA-binding capacity of a transcription factor is likely to result in altered level of protein(s) encoded by genes under its control, which in turn might lead to pathogenesis. Laforin The presence of polyglucosan inclusion bodies in the brain is characteristic of progressive myoclonus epilepsy or Lafora’s disease. The EPM2A product, laforin, which is defective in patients suffering from this disease, was initially designated as a protein tyrosine phosphatase (PTP) because of the presence of a consensus sequence characteristic of the catalytic site of PTPs [22]. Minassian et al. [23••] have carried out a detailed computational analysis resulting in significant insights about this protein and its probable link to the accumulation of polyglucosan inclusions. Analysis of protein domains in laforin using the Pfam database [24] revealed the presence of the carbohydratebinding domain (CBD)-4 at the amino (N) terminus of the protein, in addition to the dual-specificity phosphatase domain, which belongs to a distinct subfamily of PTPs, at the carboxyl (C) terminus. Furthermore, two sequence motifs characteristic of the glucohydrolase family as documented in PROSITE [25] were detected. On the basis of these findings and the mutations detected in the EPM2A gene, Minassian et al. [23••] proposed that normal laforin prevents the accumulation of polyglucosan 252 Genetics of disease Table 2 Selected tools and databases useful in sequence analysis, with an emphasis on those dedicated to disease genes*. Tools/database Comments URL Reference Tools and databases for functional annotation SignalP Predicts location of signal peptide and cleavage site in proteins. http://www.cbs.dtu.dk/services/SignalP/ [87] TopPred Predicts location and orientation of TM segments. http://www.sbc.su.se/~erikw/toppred2/ [88] CDD Collection of protein domains with links to structures if available. CDD uses a library of PSSMs that is searched by Reverse Position-Specific BLAST to match a protein query. http://www.ncbi.nlm.nih.gov./Structure/cdd/cdd. shtml [89] COG Phylogenetic classification of proteins from complete genomes that groups orthologous proteins; useful for functional annotation of newly sequenced genomes. COGNITOR can be used to place a protein into one of the COGs. http://www.ncbi.nlm.nih.gov./COG/ [52•] Pfam Collection of protein domains and families consisting of http://www.sanger.ac.uk/Pfam/ multiple alignments and hidden Markov models generated in a semi-automatic manner. [24] SMART Searchable collection of protein domains for detecting and analysing domain architecture. http://smart.embl-heidelberg.de/ [90] BLOCKS Database of highly conserved regions of proteins derived automatically and represented in the form of ungapped multiple alignments. Also tools to detect and verify protein sequence conservation. http://www.blocks.fhcrc.org/ [91] PRINTS Database of protein fingerprints (group of motifs distributed in the sequence characteristic of a family). http://bmbsgi11.leeds.ac.uk/bmb5dp/prints.html [92] PROSITE Database of biologically significant sites, patterns and profiles in proteins that helps in function prediction. http://www.expasy.ch/prosite/ [25] http://www.ncbi.nlm.nih.gov/BLAST/ [28] Tools for sequence similarity search BLAST Programs for rapid sequence similarity searches for protein and DNA sequences. Also provides searches for nucleotide sequences translated in all six frames. HMMER Programs for constructing multiple protein sequence alignments and performing database searches using the HMM approach. http://hmmer.wustl.edu [93] PSI-BLAST An implementation of BLAST capable of detecting distantly related proteins by creating a PSSM from the significant matches detected in one round and using that PSSM as the query in the next iteration. http://www.ncbi.nlm.nih.gov./blast/psiblast.cgi [28] http://dodo.cpmc.columbia.edu/pp/ predictprotein.html [94] Tools for protein structure prediction and comparison PHD Neural network dependent secondary structure prediction. PSI-PRED Prediction of secondary structure using PSI-BLAST derived scoring matrices. http://insulin.brunel.ac.uk/psipred [95] Hybrid fold recognition Combined algorithm for fold recognition using both sequence-based evolutionary information and secondary structure predictions. http://www.cs.bgu.il/~bioinbgu [96] FSSP/DALI Automatic structural classification of protein domains with the http://www2.ebi.ac.uk/dali/fssp/ DALI search facility for structure comparisons. [97] SWISS-PDB Viewer and Swiss Model Powerful structure visualization and homology modeling system for personal computers. http://www.expasy.ch/spdbv/ [98] VAST Structure similarity search tool based on the definition of the threshold of statistically significant structural similarity. http://www.ncbi.nlm.nih.gov/Structure/ [99] Databases for disease genes Genes and Information on selected human diseases with links to genes, Disease chromosomes, scientific literature and OMIM. OMIM Catalog of human genetic disorders and associated phenotypic and genetic information. http://www.ncbi.nlm.nih.gov./disease/ http://www.ncbi.nlm.nih.gov./entrez/ query.fcgi?db=OMIM [100] Computational analysis of disease-associated genes and their products Sreekumar et al. 253 Table 2 legend *The number of tools and databases for computational analysis of genes and proteins that are available online has been rapidly growing during the past few years. Inevitably, this is an arbitrary selection of those we have used extensively and found to be useful. HMM, hidden marker model; PSI-BLAST, position-specific iterating BLAST; PSSM, position specific scoring matrix. inclusion bodies in neurons by binding polyglucosan through the CBD and cleaving the β-linkages through the putative glucohydrolase domain. A more detailed examination of the sequence and domain architecture of laforin shows, however, that the putative glucohydrolase domain largely overlaps with the confidently predicted dualspecificity phosphatase domain, which suggests that the presence of the glucohydrolase motifs in this protein is spurious (KR Sreekumar, L Aravind, EV Koonin, unpublished data). The dual-specificity phosphatase activity of laforin has been confirmed experimentally [26]. The CBD domain of laforin probably targets the protein to polyglucosan inclusion bodies, whereas the phosphatase domain might activate a protein dephosphorylation pathway that triggers its destruction through a still unknown mechanism. This example illustrates both the utility of computational analysis of protein domains for function prediction and the need for caution in interpreting computational results. Fukutin Fukuyama type congenital dystrophy is caused by retroposon insertions in the gene encoding a protein called fukutin [27]. Extensive database searches using the PSI-BLAST program [28] detected moderate, but statistically significant similarity between fukutin and a family of bacterial and eukaryotic enzymes that catalyze phosphoryl-ligand transfer. Additional multiple alignment analysis showed that fukutin shares with these proteins the predicted catalytic residues [29]. The combination of these observations with the prediction of a signal peptide in fukutin, and the experimental data on its localization in the endoplasmic reticulum and secretory granules [27] has led to the prediction that fukutin modifies cell-surface molecules, most probably through the attachment of phosphoryl-sugar moieties [29]. Frataxin Friedreich’s ataxia (FRDA) is an autosomal recessive disease characterized by progressive ataxia, hypertrophic cardiomyopathy and diabetes mellitus. Gibson et al. [30] have carried out computational analysis of the FRDA gene product, frataxin, and shown that it has homologues in bacteria of the γ subdivision of the Proteobacteria, but not in any other prokaryotes, and also that it possesses a non-globular N-terminal domain predicted to function as a mitochondrial targeting peptide [30]. They predicted a distinct fold, consisting of a β sheet flanked by two long α helices, on the basis of a multiple alignment of the frataxin homologs, and further proposed that frataxin is anuclear encoded, and, accordingly, that FRDA is a ‘mitochondrial disease’. Both the mitochondrial localization and the αβ sandwich structure predictions, which were based on computational analysis of frataxin, have been confirmed experimentally [31,32], and more recent studies have shown that frataxin is a key regulator of mitochondrial energy conversion [33•]. McKusick–Kaufman syndrome Mutations in a gene on chromosome 20 have been shown to cause McKusick–Kaufman syndrome (MKKS) and Bardet–Biedl syndrome (see also review by Sheffield et al., this issue, pp 317–321) — developmental anomalies that involve hydrometrocolpos, postaxial polydactyly and congenital heart disease [34••,35]. Sequence database searches show that the predicted protein encoded by the MKKS gene belongs to a family of group II chaperonins that promote protein folding in an ATP-dependent manner. The three-dimensional structural model of the MKKS protein predicts that the Y37C mutation lies in the highly conserved loop region between helix 1 and strand 2 which is involved in interaction among subunits [34••]. Another mutation, H84Y, is predicted to lie in a region responsible for ATP hydrolysis. This appears to be the first description of a disease caused by a defect in a molecular chaperone — a finding that complements the data on the role of defects in transcription regulators in several diseases, as discussed above. Vascular endothelial growth factor receptor-3 Another example of a human disease for which sequence analysis and three-dimensional structure modeling has provided insights into pathogenesis is primary lymphoedema caused by mutations in vascular endothelial growth factor receptor-3 (VEGFR-3) [36••]. The VEGFR-3 protein consists of six classical immunoglobulin domains, four immunoglobulin C2 domains, a signal peptide, a single transmembrane segment and an intracellular tyrosine kinase catalytic domain. Four mutations that co-segregate with lymphoedema phenotype map within the tyrosine kinase domain. Finegold and co-workers [36••] have built a three-dimensional model of VEGFR-3, based on the crystal structure of VEGFR-2, from which functional implications of mutations observed in VEGFR-3 can be inferred. For example, the model shows that the G857R mutation (the last glycine of the conserved GXGXXG kinase motif) lies in a critical turn between β strands 1 and 2 and is expected to affect ATP-binding and catalysis severely. DTD, Pendrin and congenital chloride diarrhoea Diastrophic dysplasia/achondrogenesis type IB (DTD), Pendred’s syndrome and congenital chloride diarrhoea are 254 Genetics of disease caused by malfunctions of anion transporters of the sulfate transporter family [37,38]. Unexpectedly, it has been shown that the C-terminal cytoplasmic portion of these proteins shares a conserved domain named STAS (after sulfate transporter anti-sigma) with bacterial anti-sigmafactor antagonists, and this connection provides clues for the regulation of anion transporters [39•]. (Anti-sigma factors prevent formation of active RNA polymerase holoenzyme by trapping the sigma factor. Anti-sigma antagonists bind the anti-sigma factors and prevent the formation of sigma-antisigma complex.) The STAS domain of the antisigma-factor antagonist binds GTP and has a weak GTPase activity [40], leading to the prediction that the cytoplasmic domain of the anion transporters might regulate transport through NTP binding [39•]. Mulibrey nanism Muscle–liver–brain–eye (Mulibrey) nanism is an autosomal recessive disorder that affects several tissues of mesodermal origin, which suggests that a highly pleiotropic gene is involved in this disease. The complex multidomain architecture of the recently cloned MUL gene product is compatible with this notion [41•]. The MUL protein comprises a RING-finger domain, a B-box domain, a coiled-coil domain and a MATH domain. Avela et al. [41•] suggest that MUL is likely to be involved in various protein–protein interactions that might be important in development regulation; however, the combination of domains present in this protein calls for more specific functional predictions. There is growing evidence that the primary function of the RING domain is as the E3 component of the ubiquitin ligase cascade that facilitates the transfer of ubiquitin from the E2 protein to target proteins [42]. MATH is a versatile adaptor domain that mediates specific protein–protein interactions in the programmed cell death system, chromatin remodeling and other functional contexts [43]. Thus, MUL can be predicted to function as an E3 ubiquitin ligase, with additional regulatory interactions, possibly related to apoptosis, mediated by the MATH domain. Mutations outside functional protein domains Mutations that lie outside globular domains and transmembrane segments of proteins, or outside the protein-coding part of a gene altogether often affect gene expression. Such mutations may be located in untranslated regions, promoters, introns and sequences encoding signal peptides. The effect of these mutations may manifest itself as a defective globular or transmembrane domain, however, owing to exon skipping, aberrant splicing and abnormal cellular localization. Untranslated and non-coding regions Sequence analysis can provide hints as to the molecular basis of defects caused by mutations in untranslated regions. For example, in individuals with severe protein C deficiency, who suffer from massive disseminated intravascular coagulation or neonatal purpura fulminans, promoter mutations in the protein C (PROC) gene have been identified. These mutations alter the consensus sequence for the transcription factor HNF-3-binding, resulting in reduced PROC promoter activity, and accordingly a reduced level of protein C [44]. A splice-site mutation in intron 9 of the OPA1 gene, which has been detected in individuals with autosomal dominant optic atrophy, leads to either exon-10 skipping or a frameshift with a premature stop codon [45]. The OPA1 product is a mitochondrial protein comprising an N-terminal mitochondrial localization signal and a dynamin GTPase domain. The defective protein produced by the mutant gene lacks a part of the GTPase domain [45,46]. Signal peptides Duplications in exon 1 of the TNFRSF11A (RANK) gene that segregates with familial expansile osteolysis have been identified recently [47]. These mutations are short in-frame insertions in the hydrophobic core region of the RANK signal peptide, and affect proper cleavage of the signal peptide. The RANK protein consists of four tumor-necrosis factor repeats and a transmembrane segment, besides the signal peptide that directs the protein to the secretory pathway. Failure to cleave the signal peptide might result in higher intracellular accumulation of defective RANK in compartments of the secretion pathway that could lead to a higher incidence of receptor self-association and increased RANK signal transduction [47]. An NF-κB responsive reporter assay of transfected cells has indeed suggested that mutations occurring in familial expansile osteolysis result in increased constitutive RANK signaling [47]. Conclusions and future directions The examples of disease genes discussed here and in previous publications [6,48–50] illustrate the importance of sequence and structure analysis in predicting the molecular basis of pathogenesis, in guiding further work and in understanding the biology of the respective functional systems. Proteins for which predictions based on sequence analysis have been experimentally verified include frataxin and the MALT-associated paracaspase. Using the complete genome sequences, efforts are being made to assemble databases of orthologous genes from diverse organisms, and these databases are already proving to be valuable in functional annotation of genes [51,52•]. The presence of orthologs for many of the human disease genes in model organisms such as mouse, fruitfly, nematode and yeast enable experimental verification of the predictions. In particular, apparent counterparts for 178 of 287 analyzed human disease genes have been identified in fruitfly by large-scale comparative genomics [53••,54••]. At present, some of the human disease genes do not seem to have an ortholog in other organisms, but the anticipated availability in the next several years of the complete genome sequences of primates and other mammals should drastically reduce the number of ‘orphan’ disease genes. Computational analysis of disease-associated genes and their products Sreekumar et al. The predictive power of sequence and structure analysis, combined with the wealth of genome sequence data, should not only enhance our understanding of the biology that underlies the effect of mutations in disease genes, but also enable the identification of many new drug targets and accelerate the process of drug design and the development of therapeutic strategies. References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •• of outstanding interest 1. •• Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF et al.: The genome sequence of Drosophila melanogaster. Science 2000, 287:2185-2195. The second (nearly) complete genome of a multicellular eukaryote to be sequenced, after the nematode Caenorhabditis elegans. The completion of the fly genome is particularly important: first, because the enormous wealth of genetic data available for this organism can be systematically correlated with gene and protein sequences; and second, because an in-depth comparison of two animal genomes has become possible for the first time. 2. •• The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408:796-815. Sequencing of the first complete plant genome is fundamentally important as it sets the stage for a comparative study of genomes representing the three major lineages of the eukaryotic crown group — animals, fungi and plants. 3. •• Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. This paper reports the results of the public effort on human genome sequencing. A preliminary analysis of the predicted human proteome performed with a variety of computational techniques, with particular emphasis on the ‘Interpro’ collection of protein domains, is presented. The major conclusions are that human proteins consistently have more complex domain architectures compared to their homologs from other eukaryotes and domain rearrangements have a prominent role in eukaryotic evolution. Over 1300 apparent orthologs common to human, fly, worm and yeast have been identified. 4. •• Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence of the human genome. Science 2001, 291:1304-1351. The report of the human genome sequence from the private company Celera Genomics. In the preliminary analysis of the predicted human proteome, molecular functions for ~60% of the proteins have been assigned by automatic means utilizing the protein family databases such as Pfam and SMART. The authors have also identified a partial set of human–fly and human–worm orthologs. A preliminary survey of the differences between the human genome and other sequenced eukaryotic genomes is also presented. 5. Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching molecular sequence databases. Nat Genet 1994, 6:119-129. 6. Bork P, Koonin EV: Predicting functions from protein sequences — where are the bottlenecks? Nat Genet 1998, 18:313-318. 7. Baxevanis AD, Francis Ouellette BF: Bioinformatics: a practical guide to the analysis of genes and proteins. In Methods of Biochemical Analysis, vol 39. New York: John Wiley; 1998:1-370. 8. 9. Willis TG, Jadayel DM, Du MQ, Peng H, Perry AR, Abdul-Rauf M, Price H, Karran L, Majekodunmi O, Wlodarska I et al.: Bcl10 is involved in t(1;14)(p22;q32) of MALT B cell lymphoma and mutated in multiple tumor types. Cell 1999, 96:35-45. Du MQ, Peng H, Liu H, Hamoudi RA, Diss TC, Willis TG, Ye H, Dogan A, Wotherspoon AC, Dyer MJ, Isaacson PG: BCL10 gene mutation in lymphoma. Blood 2000, 95:3885-3890. 10. Hofmann K, Bucher P, Tschopp J: The CARD domain: a new apoptotic signalling motif. Trends Biochem Sci 1997, 22:155-156. 11. Hofmann K: The modular nature of apoptotic signaling proteins. Cell Mol Life Sci 1999, 55:1113-1128. 12. Zhang Q, Siebert R, Yan M, Hinzmann B, Cui X, Xue L, Rakestraw KM, Naeve CW, Beckmann G, Weisenburger DD, Sanger WG, Nowotny H et al.: Inactivating mutations and overexpression of BCL10, a 255 caspase recruitment domain-containing gene, in MALT lymphoma with t(1;14)(p22;q32). Nat Genet 1999, 22:636-638. 13. Uren AG, O’Rourke K, Aravind L, Pisabarro MT, Seshagiri S, Koonin EV, •• Dixit VM: Identification of paracaspases and metacaspases. Two ancient families of caspase-like proteins, one of which plays a key role in MALT lymphoma. Mol Cell 2000, 6:961-967. This example of combining computational analysis and experimental verification reveals the molecular basis of a class of MALT lymphoma translocation. Computational analysis using sensitive tools for sequence-profile analysis was critical for the identification of two new families of caspase-related proteases, which produce a new perspective on the evolution of this important class of enzymes. 14. Steiner H, Kostka M, Romig H, Basset G, Pesold B, Hardy J, Capell A, •• Meyn L, Grim ML, Baumeister R et al.: Glycine 384 is required for presenilin-1 function and is conserved in bacterial polytopic aspartyl proteases. Nat Cell Biol 2000, 2:848-851. Reports the site-directed mutagenesis of presenilin 1 and detection of the similarity between its putative catalytic site and that of bacterial type-4 prepilin peptidases. A functional and possibly evolutionary connection, revealed by computer analysis of protein sequences and structures, is plausible despite the absence of significant sequence similarity. 15. Ma L, Golden S, Wu L, Maxson R: The molecular basis of →His mutation in the Boston-type craniosynostosis: the Pro148→ N-terminal arm of the MSX2 homeodomain stabilizes DNA binding without altering nucleotide sequence preferences. Hum Mol Genet 1996, 5:1915-1920. 16. Wilkie AO, Tang Z, Elanko N, Walsh S, Twigg SR, Hurst JA, Wall SA, Chrzanowska KH, Maxson RE Jr: Functional haploinsufficiency of the human homeobox gene MSX2 causes defects in skull ossification. Nat Genet 2000, 24:387-390. 17. Vastardis H, Karimbux N, Guthua SW, Seidman JG, Seidman CE: A human MSX1 homeodomain missense mutation causes selective tooth agenesis. Nat Genet 1996, 13:417-421. 18. van den Boogaard MJ, Dorland M, Beemer FA, and van Amstel HK: MSX1 mutation is associated with orofacial clefting and tooth agenesis in humans. Nat Genet 2000, 24:342-343. 19. Ferda Percin E, Ploder LA, Yu JJ, Arici K, Horsford DJ, Rutherford A, Bapat B, Cox DW, Duncan AM, Kalnins VI et al.: Human microphthalmia associated with mutations in the retinal homeobox gene CHX10. Nat Genet 2000, 25:397-401. 20. Thompson AA, Nguyen LT: Amegakaryocytic thrombocytopenia and radio-ulnar synostosis are associated with HOXA11 mutation. Nat Genet 2000, 26:397-398. 21. Stockton DW, Das P, Goldenberg M, D’Souza RN, Patel PI: Mutation of PAX9 is associated with oligodontia. Nat Genet 2000, 24.18-19 22. Minassian BA, Lee JR, Herbrick JA, Huizenga J, Soder S, Mungall AJ, Dunham I, Gardner R, Fong CY, Carpenter S et al.: Mutations in a gene encoding a novel protein tyrosine phosphatase cause progressive myoclonus epilepsy. Nat Genet 1998, 20:171-174. 23. Minassian BA, Ianzano L, Meloche M, Andermann E, Rouleau GA, •• Delgado-Escueta AV, Scherer SW: Mutation spectrum and predicted function of laforin in Lafora’s progressive myoclonus epilepsy. Neurology 2000, 55:341-346. This paper underscores both the potential of computational analysis of protein domains and the need to be cautious in interpreting such results. Computational analysis of laforin reveals the presence of a carbohydrate-binding domain and a dual specificity phosphatase domain; however, the apparent glucohydrolase motifs detected in this protein are likely to be spurious. 24. Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2000, 28:263-266. 25. Hofmann K, Bucher P, Falquet L, Bairoch A: The PROSITE database, its status in 1999. Nucleic Acids Res 1999, 27:215-219. 26. Ganesh S, Agarwala KL, Ueda K, Akagi T, Shoda K, Usui T, Hashikawa T, Osada H, Delgado-Escueta AV, Yamakawa K: Laforin, defective in the progressive myoclonus epilepsy of lafora type, is a dual-specificity phosphatase associated with polyribosomes. Hum Mol Genet 2000, 9:2251-2261. 27. Kobayashi K, Nakahori Y, Miyake M, Matsumura K, Kondo-Iida E, Nomura Y, Segawa M, Yoshioka M, Saito K, Osawa M, Hamano K et al.: An ancient retrotransposal insertion causes Fukuyama-type congenital muscular dystrophy. Nature 1998, 394:388-392. 28. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of 256 Genetics of disease protein database search programs. Nucleic Acids Res 1997, 25:3389-3402. 29. Aravind L, Koonin EV: The fukutin protein family — predicted enzymes modifying cell-surface molecules. Curr Biol 1999, 9:R836-R837. 30. Gibson TJ, Koonin EV, Musco G, Pastore A, Bork P: Friedreich’s ataxia protein: phylogenetic evidence for mitochondrial dysfunction. Trends Neurosci 1996, 19:465-468. 31. Koutnikova H, Campuzano V, Foury F, Dolle P, Cazzalini O, Koenig M: Studies of human, mouse and yeast homologues indicate a mitochondrial function for frataxin. Nat Genet 1997, 16:345-351. 32. Dhe-Paganon S, Shigeta R, Chi YI, Ristow M, Shoelson SE: Crystal structure of human frataxin. J Biol Chem 2000, 275:30753-30756. 33. Ristow M, Pfister MF, Yee AJ, Schubert M, Michael L, Zhang CY, Ueki K, • Michael MD 2nd, Lowell BB, Kahn CR: Frataxin activates mitochondrial energy conversion and oxidative phosphorylation. Proc Natl Acad Sci USA 2000, 97:12239-12243. Experimental validation of the role of frataxin in mitochondrial energy conversion that corroborates with the computational prediction. 34. Stone DL, Slavotinek A, Bouffard GG, Banerjee-Basu S, Baxevanis AD, •• Barr M, Biesecker LG: Mutation of a gene encoding a putative chaperonin causes McKusick–Kaufman syndrome. Nat Genet 2000, 25:79-82. This paper illustrates how DNA sequence data obtained form positional cloning experiments can be subjected to computational analysis and further structure modeling to provide insights into biology of the system. On the basis of sequence analysis, MKKS is predicted to be a chaperonin. The structure of MKKS is modeled to explain the loss of function associated with mutations in this gene. 35. Slavotinek AM, Stone EM, Mykytyn K, Heckenlively JR, Green JS, Heon E, Musarella MA, Parfrey PS, Sheffield VC, Biesecker LG: Mutations in MKKS cause bardet-biedl syndrome. Nat Genet 2000, 26:15-16. 36. Karkkainen MJ, Ferrell RE, Lawrence EC, Kimak MA, Levinson KL, •• McTigue MA, Alitalo K, Finegold DN: Missense mutations interfere with VEGFR-3 signalling in primary lymphoedema. Nat Genet 2000, 25:153-159. Homology-based structure modeling of VEGFR-3 predicts that a mutation implicated in primary lymphoedema eliminates ATP binding by the serine/threonine protein kinase domain of this protein. 37. Scott DA, Wang R, Kreman TM, Sheffield VC, Karnishki LP: The Pendred syndrome gene encodes a chloride-iodide transport protein. Nat Genet 1999, 21:440-443. 38. Everett LA, Green ED: A family of mammalian anion transporters and their involvement in human genetic diseases. Hum Mol Genet 1999, 8:1883-1891. 39. Aravind L, Koonin EV: The STAS domain — a link between anion • transporters and antisigma- factor antagonists. Curr Biol 2000, 10:R53-R55. This paper describes an NTP-binding STAS domain common to anion transporters and bacterial antisigma-factor antagonists, providing clues to the regulation of anion transporters. This finding illustrates the potential of sequence analysis to reveal completely unexpected connections between domains contained in proteins that are not considered to be evolutionarily or functionally related. 40. Najafi SM, Harris DA, Yudkin MD: The SpoIIAA protein of Bacillus subtilis has GTP-binding properties. J Bacteriol 1996, 178:6632-6634. 41. Avela K, Lipsanen-Nyman M, Idanheimo N, Seemanova E, Rosengren S, • Makela TP, Perheentupa J, Chapelle A, Lehesjoki AE: Gene encoding a new RING-B-box-coiled-coil protein is mutated in mulibrey nanism. Nat Genet 2000, 25:298-301. Identification of a disease gene coding for a protein with a complex domain architecture. It appears that even more specific predictions than described in this paper are possible on the basis of the combination of domains present in the MUL protein. 42. Jackson PK, Eldridge AG, Freed E, Furstenthal L, Hsu JY, Kaiser BK, Reimann JD: The lore of the RINGs: substrate recognition and catalysis by ubiquitin ligases. Trends Cell Biol 2000, 10:429-439. 43. Aravind L, Dixit VM, Koonin EV: The domains of death: evolution of the apoptosis machinery. Trends Biochem Sci 1999, 24:47-53. 44. Millar DS, Johansen B, Berntorp E, Minford A, Bolton-Maggs P, Wensley R, Kakkar V, Schulman S, Torres A, Bosch N, Cooper DN: Molecular genetic analysis of severe protein C deficiency. Hum Genet 2000, 106:646-653. 45. Delettre C, Lenaers G, Griffoin JM, Gigarel N, Lorenzo C, Belenguer P, Pelloquin L, Grosgeorge J, Turc-Carel C, Perret E et al.: Nuclear gene OPA1, encoding a mitochondrial dynamin-related protein, is mutated in dominant optic atrophy. Nat Genet 2000, 26:207-210. 46. Alexander C, Votruba M, Pesch UE, Thiselton DL, Mayer S, Moore A, Rodriguez M, Kellner U, Leo-Kottler B, Auburger G et al.: OPA1, encoding a dynamin-related GTPase, is mutated in autosomal dominant optic atrophy linked to chromosome 3q28. Nat Genet 2000, 26:211-215. 47. Hughes AE, Ralston SH, Marken J, Bell C, MacPherson H, Wallace RG, van Hul W, Whyte MP, Nakatsuka K, Hovy L, Anderson DM: Mutations in TNFRSF11A, affecting the signal peptide of RANK, cause familial expansile osteolysis. Nat Genet 2000, 24:45-48. 48. Koonin EV, Altschul SF, Bork P: BRCA1 protein products... Functional motifs. Nat Genet 1996, 13:266-268. 49. Madej T, Boguski MS, Bryant SH: Threading analysis suggests that the obese gene product may be a helical cytokine. FEBS Lett 1995, 373:13-18. 50. Mushegian AR, Bassett DE Jr, Boguski MS, Bork P, Koonin EV: Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. Proc Natl Acad Sci USA 1997, 94:5831-5836. 51. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278:631-637. 52. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, • Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29:22-28. An evolutionary classification of proteins from sequenced genomes based on the notion of orthologous relationships. This new version includes proteins from C. elegans and Drosophila that have homologs in Archaea and/or Bacteria. 53. Fortini ME, Skupski MP, Boguski MS, Hariharan IK: A survey of •• human disease gene counterparts in the Drosophila genome. J Cell Biol 2000, 150:F23-F30. A careful analysis of human disease genes with the goal of identifying counterparts of disease genes in fruitfly — an approach that is applicable to other model eukaryotic organisms as well. 54. Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, •• Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W et al.: Comparative genomics of the eukaryotes. Science 2000, 287:2204-2215. A comparative analysis of the predicted proteomes of three eukaryotes — yeast, worm and fly — that reveals unity and diversity of protein families and domain architecture of proteins in these organisms. Being the first attempt of such systematic comparative analysis, many of the conclusions need further refinement, but some of the approaches used in this study have general application in comparative genomics. 55. Morett E, Bork P: A novel transactivation domain in parkin. Trends Biochem Sci 1999, 24:229-231. 56. Shimura H, Hattori N, Kubo S, Mizuno Y, Asakawa S, Minoshima S, • Shimizu N, Iwai K, Chiba T, Tanaka K, Suzuki T: Familial Parkinson disease gene product, parkin, is a ubiquitin-protein ligase. Nat Genet 2000, 25:302-305. Experimental confirmation of the predicted function of a protein containing the RING domain. 57. Bomont P, Cavalier L, Blondeau F, Hamida CB, Belal S, Tazir M, Demir E, Topaloglu H, Korinthenberg R, Tuysuz B et al.: The gene encoding gigaxonin, a new member of the cytoskeletal BTB/kelch repeat family, is mutated in giant axonal neuropathy. Nat Genet 2000, 26:370-374. 58. Gal A, Li Y, Thompson DA, Weir J, Orth U, Jacobson SG, Apfelstedt-Sylla E, Vollrath D: Mutations in MERTK, the human orthologue of the RCS rat retinal dystrophy gene, cause retinitis pigmentosa. Nat Genet 2000, 26:270-271. 59. Bamford RN, Roessler E, Burdine RD, Saplakoglu U, dela Cruz J, Splitt M, Towbin J, Bowers P, Marino B, Schier AF et al.: Loss-offunction mutations in the EGF-CFC gene CFC1 are associated with human left-right laterality defects. Nat Genet 2000, 26:365-369. 60. Akama TO, Nishida K, Nakayama J, Watanabe H, Ozaki K, Nakamura T, Dota A, Kawasaki S, Inoue Y, Maeda N et al.: Macular corneal dystrophy type I and type II are caused by distinct mutations in a new sulphotransferase gene. Nat Genet 2000, 26:237-241. 61. Bech-Hansen NT, Naylor MJ, Maybaum TA, Sparkes RL, Koop B, Birch DG, Bergen AA, Prinsen CF, Polomeno RC, Gal A et al.: Computational analysis of disease-associated genes and their products Sreekumar et al. Mutations in NYX, encoding the leucine-rich proteoglycan nyctalopin, cause X-linked complete congenital stationary night blindness. Nat Genet 2000, 26:319-323. 62. Pusch CM, Zeitz C, Brandau O, Pesch K, Achatz H, Feil S, Scharfe C, Maurer J, Jacobi FK, Pinckers A et al.: The complete form of X-linked congenital stationary night blindness is caused by mutations in a gene encoding a leucine-rich repeat protein. Nat Genet 2000, 26:324-327. 63. Kutsche K, Yntema H, Brandt A, Jantke I, Gerd Nothwang H, Orth U, Boavida MG, David D, Chelly J, Fryns JP et al.: Mutations in ARHGEF6, encoding a guanine nucleotide exchange factor for rho GTPases, in patients with X-linked mental retardation. Nat Genet 2000, 26:247-250. 64. Verpy E, Leibovici M, Zwaenepoel I, Liu XZ, Gal A, Salem N, Mansour A, Blanchard S, Kobayashi I, Keats BJ et al.: A defect in harmonin, a PDZ domain-containing protein expressed in the inner ear sensory hair cells, underlies usher syndrome type 1C. Nat Genet 2000, 26:51-55. 65. Kaplan JM, Kim SH, North KN, Rennke H, Correia LA, Tong HQ, Mathis BJ, Rodriguez-Perez JC, Allen PG, Beggs AH, Pollak MR: Mutations in ACTN4, encoding alpha-actinin-4, cause familial focal segmental glomerulosclerosis. Nat Genet 2000, 24:251-256. 66. Mutations in MYH9 result in the may-hegglin anomaly, and fechtner and sebastian syndromes. Nat Genet 2000, 26:103-105. 67. Netchine I, Sobrier ML, Krude H, Schnabel D, Maghnie M, Marcos E, Duriez B, Cacheux V, Moers A, Goossens M, Gruters A, Amselem S: Mutations in LHX3 result in a new syndrome revealed by combined pituitary hormone deficiency. Nat Genet 2000, 25:182-186. 68. Chavanas S, Bodemer C, Rochat A, Hamel-Teillac D, Ali M, Irvine AD, Bonafe JL, Wilkinson J, Taieb A, Barrandon Y, Harper JI, de Prost Y, Hovnanian A: Mutations in SPINK5, encoding a serine protease inhibitor, cause Netherton syndrome. Nat Genet 2000, 25:141-142. 69. Bignell GR, Warren W, Seal S, Takahashi M, Rapley E, Barfoot R, Green H, Brown C, Biggs PJ, Lakhani SR, Jones C, Hansen J et al.: Identification of the familial cylindromatosis tumour-suppressor gene. Nat Genet 2000, 25:160-165. 70. Waeber G, Delplanque J, Bonny C, Mooser V, Steinmann M, Widmann C, Maillard A, Miklossy J, Dina C, Hani EH et al.: The gene MAPK8IP1, encoding islet-brain-1, is a candidate for type 2 diabetes. Nat Genet 2000, 24:291-295. 71. Momeni P, Glockner G, Schmidt O, von Holtum D, Albrecht B, Gillessen-Kaesbach G, Hennekam R, Meinecke P, Zabel B, Rosenthal A, Horsthemke B, Ludecke HJ: Mutations in a new gene, encoding a zinc-finger protein, cause tricho- rhino-phalangeal syndrome type I. Nat Genet 2000, 24:71-74. 72. Delepine M, Nicolino M, Barrett T, Golamaully M, Lathrop GM, Julier C: EIF2AK3, encoding translation initiation factor 2-alpha kinase 3, is mutated in patients with Wolcott-Rallison syndrome. Nat Genet 2000, 25:406-409. 73. Arbour NC, Lorenz E, Schutte BC, Zabner J, Kline JN, Jones M, Frees K, Watt JL, Schwartz DA: TLR4 mutations are associated with endotoxin hyporesponsiveness in humans. Nat Genet 2000, 25:187-191. 74. Hastbacka J, de la Chapelle A, Mahtani MM, Clines G, Reeve-Daly MP, Daly M, Hamilton BA, Kusumi K, Trivedi B, Weaver A et al.: The diastrophic dysplasia gene encodes a novel sulfate transporter: positional cloning by fine-structure linkage disequilibrium mapping. Cell 1994, 78:1073-1087. 75. Superti-Furga A, Rossi A, Steinmann B, Gitzelmann R: A chondrodysplasia family produced by mutations in the diastrophic dysplasia sulfate transporter gene: genotype/phenotype correlations. Am J Med Genet 1996, 63:144-147. 76. Hoglund P, Haila S, Socha J, Tomaszewski L, Saarialho-Kere U, Karjalainen-Lindsberg ML, Airola K, Holmberg C, de la Chapelle A, Kere J: Mutations of the Down-regulated in adenoma (DRA) gene cause congenital chloride diarrhoea. Nat Genet 1996, 14:316-319. 77. Zemni R, Bienvenu T, Vinet MC, Sefiani A, Carrie A, Billuart P, McDonell N, Couvert P, Francis F, Chafey P et al.: A new gene involved in X-linked mental retardation identified by analysis of an X;2 balanced translocation. Nat Genet 2000, 24:167-170. 78. Sun H, Smallwood PM, Nathans J: Biochemical defects in ABCR protein variants associated with human retinopathies. Nat Genet 2000, 26:242-246. 257 79. Paloneva J, Kestila M, Wu J, Salminen A, Bohling T, Ruotsalainen V, Hakola P, Bakker AB, Phillips JH, Pekkarinen P et al.: Loss-offunction mutations in TYROBP (DAP12) result in a presenile dementia with bone cysts. Nat Genet 2000, 25:357-361. 80. Naureckiene S, Sleat DE, Lackland H, Fensom A, Vanier MT, Wattiaux R, Jadot M, Lobel P: Identification of HE1 as the Second Gene of Niemann-Pick C Disease. Science 2000, 290:2298-2301. 81. Davies JP, Levy B, Ioannou YA: Evidence for a Niemann-pick C (NPC) gene family: identification and characterization of NPC1L1. Genomics 2000, 65:137-145. 82. Bulman MP, Kusumi K, Frayling TM, McKeown C, Garrett C, Lander ES, Krumlauf R, Hattersley AT, Ellard S, Turnpenny PD: Mutations in the human delta homologue, DLL3, cause axial skeletal defects in spondylocostal dysostosis. Nat Genet 2000, 24:438-441. 83. Oldridge M, Fortuna AM, Maringa M, Propping P, Mansour S, Pollitt C, DeChiara TM, Kimble R B, Valenzuela DM, Yancopoulos GD, Wilkie AO: Dominant mutations in ROR2, encoding an orphan receptor tyrosine kinase, cause brachydactyly type B. Nat Genet 2000, 24:275-278. 84. Hu Z, Bonifas JM, Beech J, Bench G, Shigihara T, Ogawa H, Ikeda S, Mauro T, Epstein EH Jr: Mutations in ATP2C1, encoding a calcium pump, cause Hailey–Hailey disease. Nat Genet 2000, 24:61-65. 85. Irrthum A, Karkkainen MJ, Devriendt K, Alitalo K, Vikkula M: Congenital hereditary lymphedema caused by a mutation that inactivates VEGFR3 tyrosine kinase. Am J Hum Genet 2000, 67:295-301. 86. Wolfe MS, Xia W, Ostaszewski BL, Diehl TS, Kimberly WT, Selkoe DJ: Two transmembrane aspartates in presenilin-1 required for presenilin endoproteolysis and γ-secretase activity. Nature 1999, 398:513-517. 87. Nielsen H, Engelbrecht J, Brunak S, von Heijne G: A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Int J Neural Syst 1997, 8:581-599. 88. Claros MG, von Heijne G: TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci 1994, 10:685-686. 89. Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA: Database resources of the national center for biotechnology information. Nucleic Acids Res 2001, 29:11-16. 90. Schultz J, Copley RR, Doerks T, Ponting CP, Bork P: SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res 2000, 28:231-234. 91. Henikoff JG, Greene EA, Pietrokovski S, Henikoff S: Increased coverage of protein families with the blocks database servers. Nucleic Acids Res 2000, 28:228-230. 92. Attwood TK, Croning MD, Flower DR, Lewis AP, Mabey JE, Scordis P, Selley JN, Wright W: PRINTS-S: the database formerly known as PRINTS. Nucleic Acids Res 2000, 28:225-227. 93. Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14:755-763. 94. Rost B, Sander C, Schneider R: PHD — an automatic mail server for protein secondary structure prediction. Comput Appl Biosci 1994, 10:53-60. 95. McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure prediction server. Bioinformatics 2000, 16:404-405. 96. Fischer D: Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput 2000, 119-130. 97. Holm L, Sander C: Touring protein fold space with Dali/FSSP. Nucleic Acids Res 1998, 26:316-319. 98. Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis 1997, 18:2714-2723. 99. Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins Struct Funct Genet 1995, 23:356-369. 100. Antonarakis SE, McKusick VA: OMIM passes the 1,000-diseasegene mark. Nat Genet 2000, 25:11.