Download Computational analysis of human disease

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Designer baby wikipedia , lookup

Genomics wikipedia , lookup

Public health genomics wikipedia , lookup

Microevolution wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome (book) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Epistasis wikipedia , lookup

Mutation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Protein moonlighting wikipedia , lookup

Frameshift mutation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

NEDD9 wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
247
Computational analysis of human disease-associated genes and
their protein products
Kodangattil R Sreekumar, L Aravind and Eugene V Koonin*
The complete genome sequences for human, Drosophila
melanogaster and Arabidopsis thaliana have been reported
recently. With the availability of complete sequences for many
bacteria and archaea, and five eukaryotes, comparative
genomics and sequence analysis are enabling us to identify
counterparts of many human disease genes in model
organisms, which in turn should accelerate the pace of research
and drug development to combat human diseases. Continuous
improvement of specialized protein databases, together with
sensitive computational tools, have enhanced the power and
reliability of computational prediction of protein function.
Addresses
National Center for Biotechnology Information, National Library of
Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
∗e-mail: [email protected]
Current Opinion in Genetics & Development 2001, 11:247–257
0959-437X/01/$ — see front matter
© 2001 Elsevier Science Ltd. All rights reserved.
Abbreviations
BIR
baculovirus inhibition of apoptosis protein repeats
CARD
caspase recruitment domain
CBD
carbohydrate-binding domain
FRDA
Friedreich’s ataxia
IAP
inhibitor of apoptosis
MALT
mucosa-associated lymphoid tissue
MKKS
McKusick–Kaufman syndrome
PTP
protein tyrosine phosphatase
TFPP
type-4 prepilin peptidase
VEGFR-3 vascular endothelial growth factor receptor-3
phylogenetically divergent sources. This branch of biology
has greatly benefited from the finished and continuing
genome-sequencing projects, as well as sequencing efforts
undertaken on a smaller scale. Genome sequencing of
three eukaryotic organisms — the fruitfly Drosophila
melanogaster, the thale cress Arabidopsis thaliana and (in a
draft form) Homo sapiens — has been completed in the past
two years [1••–4••]. Table 1 lists some of the genes implicated in human disorders during the past year, and some of
the previously identified human disease genes for which
computational analysis in the same time frame has produced
important new findings.
Procedures for undertaking rigorous sequence analysis,
certain pitfalls in such analysis, methods to circumvent
these problems, and the predictive power of computational
analysis have been described previously [5–7]. Selected
tools and databases that are currently available and are
widely used in sequence and structure analysis are listed in
Table 2. Here we focus on recent insights obtained by
computational analysis of disease genes, and in particular
how defects in one or more protein domains affect the cellular function of the encoded protein. The work that we
discuss serves only to illustrate how computational analysis
of disease genes can provide important clues about molecular basis of pathogenesis (we apologize to those
researchers whose important contributions to this burgeoning
field are not be cited owing to space limitations).
Conserved globular domains and their defects
Introduction
Inherited diseases are caused by mutation(s) in one or
more genes. Diseases that are due to defect(s) in a single
gene are called monogenic diseases; polygenic diseases are
caused by defect(s) in more than one gene, with all those
genes individually contributing to the development of the
disease. In some diseases, such as Lafora’s (see below), the
defect in proteins implicated in the disease can be directly
related to the specific pathological state, whereas in others
the link between the disease manifestation and the apparent defect observed at the level of the nucleotide or
deduced amino-acid sequence may not be obvious.
Irrespective of whether there is an obvious relationship
between a defective gene and the pathological state, once
a disease-associated gene (or a ‘disease gene’, in a common, if not precise, parlance) is identified, computational
analysis of the gene is a critical step that can provide clues
to the molecular basis of pathogenesis and invaluable
insights for further experimental analysis.
Most proteins consist of one or more evolutionarily conserved domains, each with a distinct structure and
function. An alteration of the amino-acid sequence of a
domain is likely to hamper its proper functioning, especially if the alteration is in a highly conserved region and/or
involves a non-conservative replacement of an amino-acid
residue. Several tools are available for the automatic detection of protein domains to aid in function prediction
(Table 2); however, the existing domain databases are still
far from comprehensive, and the detection methods have
limited sensitivity. It is therefore critical to analyze protein
sequence in detail, on a case-by-case basis, by using several
tools and methods and by assessing the relevance of the
results obtained by computational approaches in the context of the available experimental data. Below, some recent
discoveries of defects in proteins that result in human disorders are discussed with a view to relate a defective
domain/motif to specific biological consequences.
Bcl10 and mucosa-associated lymphoid tissue lymphoma
Computational sequence analysis draws its strength from
the conservation of critical features of a gene/protein in
A frameshift mutation in Bcl10, resulting in truncation distal to the caspase recruitment domain (CARD), is
248
Genetics of disease
Table 1
Protein products of some genes mutated in human diseases.
Disease
Protein
Domain(s)*
Soluble, globular proteins
MALT
Paracaspase DEATH+ 2(IG) +
lymphoma
paracaspase
Other sequence
features/motifs
–
Demonstrated or predicted
protein function
Effect of mutation(s)
Reference
Predicted protein oligomerization, Translocation produces a
interaction with other proteins
chimericprotein that might be
and protease activity potentially
an inhibitor of apoptosis.
involved in apoptosis
[13]
[8,9]
MALT
lymphoma
Bcl10
CARD
Serine/
Threoninerich region
The Ser/Thr-rich domain may
be involved in regulatory
phosphorylation. The CARD
domain mediates specific proteinprotein interactions in the
apoptotic system
Frameshift and truncation
beyond the CARD domain
might renderthe protein
unstable or disruptSer/Thr
phosphorylation
Fukuyama
type CMD
Fukutin
Predicted
phosphorylsugar/choline
transferase
Signal peptide
Similarity to bacterial proteins
suggests a role in modifying
cell-surface glycoproteins or
glycolipids
Transposon insertion disrupts
[27,29]
the predicted enzymatic domain
Friedreich’s
ataxia
Frataxin
Frataxin
Mitochondrial
import peptide
Nuclear-encoded mitochondrial
protein with a key role in the
regulation of energy conversion
Several point mutations affect [30,33]
conserved amino-acid residues
Parkinson’s
disease
Parkin
Ubiquitin+
PARKIN_finger +
RING
–
E3 ubiquitin-protein ligase
Mutations in ubiquitin domain
[55,56•]
affect binding to target proteins;
mutations in RING prevent
recruitment of E2
Giant axonal Gigaxonin
neuropathy
POZ+IVR+ kelch
repeats
–
POZ-kelch domain
combination suggests a role in
cytoskeleton structure–assembly
Mutations that disrupt POZ or [57]
Kelch domains might disrupt
protein–protein interactions
causing in cytoskeleton defects
Retinitis
pigmentosa
MERTK
Ig-like+ Fn3-like+
Tyr-kinase
–
Fn3 and Ig-like modules indicate
ligand-binding ability; Tyr-Kinase
domain implies protein
phosphorylation; MERTK may be
a receptor for a specific extracellular ligand
Mutations affecting Tyr-kinase [58]
domaindisrupt the phosphorylation
cascade activated by it.
Laterality
defects
CFC1
EGF+ CFC
Macular
corneal
dystrophy
CHST6
Sulfo-transferase Signal peptide
Transfers sulfate groups from
PAPS to various substrates
R50C mutation in the conserved [60]
region encompassing the sulfate
donor binding site may prevent
PAPSbinding
Dominant
OPA1
optic atrophy
Dynamin-like
GTPase
Mitochondrial
import signal
Possible role in maintenance
and inheritance of mitochondria
R290Q, G300E, deletion of
[45,46]
invariant I432 and a nonsense
mutation in the dynamin GTPase
domain
X-linked
congenital
SNB
nyctalopin
LRR-NT+
15(LRR)+
LRR-CT
Signal peptide,
GP anchor
Predicted LRR–containing
glyco-protein of extracelluar
matrix mediating protein–protein
interactions
Missense mutations cause
truncation, insertion and
deletions affecting LRRs
and/or loss of GPI–anchor
[61,62]
X-linked
mental
retardation
ARHGEF6
CH+SH3+
RhoGEF+ PH
–
Signal transduction via the
RhoGTPase cycle (RhoGEF
domain); predicted to interact
with actin filaments proline-rich
peptides in proteins and inositol
phosphate.
Intronic mutation resulting in
exon 2 skipping, affects CH
domain
[63]
Lafora’s
disease
EPM2A
CBD-4 +tyrosine Signal peptide
phosphatase
(PTP)
Signal peptide
Extracellular signal molecule
and membraneassociating region
Mutations in the conserved
residuesin EGF domain could
disrupt inter-actions with its
receptor
Binds polysaccharides via CBD-4; W32G mutation in CBD,
phosphatase domain may signal T194I in the PTP domain
the catabolism of Lafora bodies.
[59]
[23]
Computational analysis of disease-associated genes and their products Sreekumar et al.
249
Table 1 (continued)
Protein products of some genes mutated in human diseases.
Disease
Protein
Domain(s)*
Other sequence
features/motifs
Demonstrated or predicted
protein function
Usher
syndrome
type 1c
Harmonin
2(PDZ)+bZIP+
PDZ
Proline/Serine/
Threonine-rich
region; leucine
zipper
Predicted to dimerize via leucine Mutations resulting in loss of
zipper and mediate proteinmost orall of the globular
protein interactions via PDZ
domains.
domain. The P/S/T-rich region
might serve as binding site for
SH3 and WW domain-containing
proteins
[64]
Familial
segmental
glomerulosclerosis
ACTN4
2(CH)+ 4(SPEC)
+ 2(EF)
–
Predicted actin-binding and
cross-linking protein (CH
domains) and Ca-binding
(EF hands) protein. Probable
component of the cyto-skeletal
network (SPEC domain)
K228E, T232I, and S235P
mutations in the 2nd
CH domain cause increased
affinity for actin
[65]
May-Hegglin MYH9
anomaly and
Fechtner and
Sebastian
syndromes
Myosin + IQ
–
Nonmuscle myosin heavy chain
9; predicted to bind calmodulin
(IQ motif)
N93K predicted to destabilize [66]
the 2nd helix of myosin catalytic
domain, R702C predicted to
alter helical stability and ATPase
activity.
Mulibrey
nanism
MUL
RING+ B-box +
BBC + MATH
–
Predicted nuclear protein,
may participate in ubiquitinmediated protein degradataion
as a E3 ubiquitin ligase
(RING) and possibly in
apoptosis (MATH)
Deletions (splice site and
lcoding) ead to truncation,
with the loss of MATH
domain in two cases
McKusickKaufman
syndrome
MKKS
Group II
chaperonin
–
Facilitates correct protein
folding in conjunction with ATP
hydrolysis
One mutation predicted to
[34••]
affect intermolecular interaction
based on structural modeling
combined
pitutary
hormone
deficiency
LHX3
2(LIM) + HOX
–
Predicted transcriptional
regulator via DNA-binding
homeodomain protein–protein
interaction via LIM domain
A conserved Y replaced by C [67]
in LIM domain; frameshift
leading to loss of homeodomain
Netherton
syndrome
SPINK5
15 (KAZAL)
Signal peptide
Potential extracellular adhesion
molecule
Insertions and deletions leading [68]
to frameshift and protein
truncation
familial
cylindromatosis
CYLD
3(CAP-GLY)+
Proline-rich
B-BOX+ UCH-2 region
Nonsense and splice site
mutations leading to truncation [69]
of the C-terminal 2/3 of the
protein
Type 2
diabetes
MAPK8IP1
SH3 + PTB
Predicted to coordinate attachment of cellular organelles to
microtubules via CAP-GLY
domain and down-regulate
protein degradation via the
ubiquitin hydrolase via UCH-2
domain
Regulation of JNK signaling
pathway by mediating
protein–protein interactions via
the SH3 and PTB domains
DNA-binding protein,
transcription factor
P148H enhances DNA-binding [15,16]
affinity of homeobox; R172H
and deletion of RK159-160
cause low DNA-binding affinity
of homeobox
Multiple DNA-binding
domains predict a
transcription factor
3 nonsense and 3 frameshift
mutations cause loss of 2–4
Zn-fingers
[71]
ER membrane-associated
translation initiation
factor eIF-2α kinase
Mutations in kinase domain or
truncations N-terminal to the
kinase domain. β-propeller
domainin the N-terminus was
previouslyundetected
[72]
CranioMSX2
synostosis
and enlarged
parietal
foramina
Homeobox
Tricho-rhino- TRPS1
phalangeal
syndrome
type 1
8 (C2H2) + GATA
+ 2(C2H2)
Membrane proteins
Wolcott–
EIF2AK3
Rallison
syndrome
β-propeller +
TMS TM + S/T
protein kinase
N-terminal lowcomplexity
region, probably
non-globular
–
Signal peptide
Effect of mutation(s)
Reference
[41•]
S59N mutation outside the two [70]
globular domains, not fully
penetrant
250
Genetics of disease
Table 1 (continued)
Protein products of some genes mutated in human diseases.
Disease
Protein
Domain(s)*
Other sequence
features/motifs
Demonstrated or predicted
protein function
Effect of mutation(s)
Reference
Endotoxin
TLR4
hypo-responsiveness
11(LRR) +
LRRCT + TMS
+ TIR
Signal peptide
Intracellular signaling (TIR
domain), cell adhesion and
LRRs interaction with protein
ligands (extracellular LRRs)
Mutations A290G and T399I
within the LRRs
[73]
Diastrophic
dysplasia
DTD
8(TMS) + STAS
–
Sulfate transporter
Impaired sulfate transport across [39,74,
cell membrane cause low sulfate 75]
content of cartilage
Pendred
syndrome
Pendrin
10(TMS) + STAS
–
Potential iodide-chloride transporter
Mutation in predicted
phosphate-binding loop.
Also deletions, missense
and splice site mutations
[38,39•]
congenital
chloride
diarrhoea
DRA
10(TMS) + STAS
–
Chloride–NaHCO3 exchanger
Missense and deletion
mutations
[39•,76]
X-linked
mental
retardation
TM4SF2
4(TMS)
Member of the tetraspanin family
that potentially interact with
β-1 integrins.
Translocation removing 4th
TMS
[77]
Membrane-associated, ATPdependent transporter
Mutations in the NTP-binding
motifs (nearly) abolish
ATP binding
[78]
Signal peptide
Stargardt
ABCA4
disease, agerelated
macular
degeneration
5(TMS) + ABC+
7(TMS) + ABC
Presenile
dementia
TYROBP
TMS
Signal peptide
and ITAM motif
Predicted transmembrane protein Large deletion, and single base [79]
involved in signal transduction via pair deletion cause truncation
the ITAM motif
in TM segment and loss of
ITAM motif
Niemann–
Pick C1
disease
NPC1
13(TMS) +
sterol-sensing
domain
Signal peptide
Lysosomal protein involved in
Glu20stop: truncation, perhaps [80,81]
sequestering sterols in lysosomes null; another truncation results
from frameshift in exon2
Spondylocostal
dysostosis
DLL3
DSL + 11(EGF)
+ TMS
Signal peptide
Predicted Ca2+-binding
membrane protein involved in
signal transduction
Insertion leading to truncation
in DSL domain; deletion in
4th EGF repeat and G385D
mutation in 5thEGF repeat
Serine/
Threonine-rich
region
Predicted protein tyrosine kinase
involved in signal transduction
and ligand-binding
Y755stop, W749stop, and a
deletion leading to frameshift
[83]
beyond G750, perhaps affect
interaction with an SH3 domain
ATP-powered Ca2+ pump
that transports Ca2+ into Golgi
bodies
Large deletion, missense and
splice site mutations cause
truncation
Class III receptor Tyr kinase;
binds ligands and is involved
in signal transduction
G857R mutation in kinase motif; [36••,85]
R1041P, R forms H-bonds with
substrate OH- and aspartic acid
that is involved in phosphotransfer
Apparently a membrane aspartyl
protease, with two aspartate
required for activity
G384A mutation, next to the
[14••,86]
critical D385, causes higher
levels of residues amyloidogenic
Aβ42 production
BrachyROR2
IGc2 + Frizzled
dactyly type B+ Kringle + TMS
+ TyrKinase
Hailey–
Hailey
disease
ATP2C1
8(TMS) + P-type
ATPase
Primary
lymphoedema
VEGFR-3
2(IG) + IGc2 +
IG + IGc2 +
3(IG) + IGc2 +
IG + IGc2 +
TMS+TyrKinase
Alzheimer’s
disease
Presenilin 1
9(TMS)
–
Signal peptide
–
[82]
[84]
*Domain name abbreviations are as in the SMART database. The
number of domains of a particular type is indicated (for example, 2IG
denotes two consecutive immunoglobulin domains). Consecutive
domains are connected with ‘+’. Abbreviations: CMD, congenital
muscular dystrophy; PAPS, phosphoadenosine phosphosulfate; TM,
transmembrane; SNB, stationary night blindness.
associated with mucosa-associated lymphoid tissue
(MALT) lymphomas that have the translocation
t(1;14)(p22;q32) [8,9]. The CARD domain was first identified in a comparison of the sequences of several apoptotic
proteins [10]; this CARD domain comprises six antiparallel
α helices and mediates homotypic interactions between
proteins involved in apoptosis [11]. Besides caspase
recruitment, which is a critical step in the chain of events
leading to apoptosis, protein–protein interactions mediated
by CARD contribute to activation of the transcription
Computational analysis of disease-associated genes and their products Sreekumar et al.
factor NF-κB. The mutated Bcl10 protein seen in MALT
lymphomas has lost its pro-apoptotic functions but retains
the ability to activate NF-κB [12].
In a different MALT lymphoma translocation,
t(11;18(q21;q21), a fusion of the inhibitor of apoptosis
(IAP)-2 gene to the MLT1/MALT1 locus generates a
chimeric protein comprising BIR repeats of IAP-2 and the
caspase-like predicted protease domain of the protein designated paracaspase [13••]. The paracaspase family of
caspase homologs has been identified recently, along with
the metacaspase family, in a detailed computational analysis
of the caspase-like protease superfamily [13••].
Paracaspases contain a predicted caspase-like proteolytic
domain, a Death domain capable of mediating specific protein–protein interactions and, in certain cases, including
the human representative immunoglobulin domains. In
the MALT lymphoma translocation t(11;18)(q21;q21), the
prodomain of human paracaspase is replaced with BIR
repeats that might function as inhibitors of apoptosis. The
BIR–paracaspase fusion has been found to activate NF-κB
in a manner dependent on the predicted catalytic cysteine
of the paracaspase, thus supporting the computational prediction [13••]. Notably, the prodomain of human
paracaspase interacts with Bcl10, which suggests that the
two translocations associated with MALT lymphomas
affect the same apoptotic pathway.
Alzheimer disease — presenilins
Familial Alzheimer disease is associated with mutations in
genes that encode the paralogous integral membrane proteins presenilin 1 and presenilin 2. Presenilins are required
to cleave several other integral membrane proteins, including the β-amyloid precursor proteins Notch and Ire1 that
are central to the pathogenesis of Alzheimer disease. It has
been shown that two aspartate residues of presenilin 1 —
one located in the intracellular loop between helices 6
and 7 and the other one located in helix 7 — are required
for these proteolytic events, leading to the hypothesis that
presenilins themselves belong to a distinct class of membrane
proteases called γ-secretases [14••].
Sequence database searches using all available methods,
including different types of profile analysis, have failed to
detect similarity between presenilins and any known proteases. However, a pattern search carried out by Steiner
et al. [14••] revealed that the signature [G/A]xGDh (where
x is any residue, h is a bulky hydrophobic residue, and
alternative residues are shown in brackets) is shared by presenilins and bacterial type-4 prepilin peptidases (TFPP),
and includes one of the functionally critical aspartates of
both protein families.
Presenilins and TFPPs have similar membrane topology,
with eight transmembrane helices present in each family,
but show no appreciable sequence similarity beyond the
above signature. Therefore, although Steiner et al.’s [14••]
251
observation reinforces the hypothesis that presenilins are
the catalytically active γ-secretases, rather than cofactors
of a still unidentified protease, it remains unclear whether
they are homologs of TFPPs or whether the similarity in
the (predicted) active sites is due to convergence. Given
the relatively relaxed functional constraints that are typical of membrane proteins, with selection preserving
mainly the membrane topology rather then sequence,
common origin cannot be ruled out despite the absence of
sequence conservation.
MSX1 and MSX2
Craniosynostosis and enlarged parietal foramina are caused
by mutations in the homeodomain — a highly conserved
DNA-binding domain containing three helical regions —
of the MSX2 protein [15,16]. A mutation (P148H) that
leads to craniosynostosis enhances the DNA-binding affinity of this protein, whereas another mutation (R172H) that
results in parietal foramina has the opposite effect of lowering the protein’s affinity for DNA. Similarly, a mutation
in the MSX1 gene (R31P) that affects the second helix of
the homeodomain [17], and a nonsense mutation that leads
to a truncation at position 104 and a peptide lacking the
entire homeodomain [18] are implicated in tooth agenesis.
Other examples of diseases caused by mutations in homeodomains include microphthalmia [19], amegakaryocytic
thrombocytopenia and radio-ulnar synostosis [20]. The
paired (PAX) domain, another DNA-binding module
involved in transcription regulation, is defective in certain
individuals diagnosed with oligodontia [21]. An enhanced or
diminished DNA-binding capacity of a transcription factor is
likely to result in altered level of protein(s) encoded by genes
under its control, which in turn might lead to pathogenesis.
Laforin
The presence of polyglucosan inclusion bodies in the brain
is characteristic of progressive myoclonus epilepsy or
Lafora’s disease. The EPM2A product, laforin, which is
defective in patients suffering from this disease, was initially designated as a protein tyrosine phosphatase (PTP)
because of the presence of a consensus sequence characteristic of the catalytic site of PTPs [22]. Minassian et al.
[23••] have carried out a detailed computational analysis
resulting in significant insights about this protein and its
probable link to the accumulation of polyglucosan inclusions. Analysis of protein domains in laforin using the Pfam
database [24] revealed the presence of the carbohydratebinding domain (CBD)-4 at the amino (N) terminus of the
protein, in addition to the dual-specificity phosphatase
domain, which belongs to a distinct subfamily of PTPs, at
the carboxyl (C) terminus. Furthermore, two sequence motifs
characteristic of the glucohydrolase family as documented
in PROSITE [25] were detected.
On the basis of these findings and the mutations detected
in the EPM2A gene, Minassian et al. [23••] proposed that
normal laforin prevents the accumulation of polyglucosan
252
Genetics of disease
Table 2
Selected tools and databases useful in sequence analysis, with an emphasis on those dedicated to disease genes*.
Tools/database
Comments
URL
Reference
Tools and databases for functional annotation
SignalP
Predicts location of signal peptide and cleavage site
in proteins.
http://www.cbs.dtu.dk/services/SignalP/
[87]
TopPred
Predicts location and orientation of TM segments.
http://www.sbc.su.se/~erikw/toppred2/
[88]
CDD
Collection of protein domains with links to structures if
available. CDD uses a library of PSSMs that is searched by
Reverse Position-Specific BLAST to match a protein query.
http://www.ncbi.nlm.nih.gov./Structure/cdd/cdd.
shtml
[89]
COG
Phylogenetic classification of proteins from complete
genomes that groups orthologous proteins; useful for
functional annotation of newly sequenced genomes.
COGNITOR can be used to place a protein
into one of the COGs.
http://www.ncbi.nlm.nih.gov./COG/
[52•]
Pfam
Collection of protein domains and families consisting of
http://www.sanger.ac.uk/Pfam/
multiple alignments and hidden Markov models generated in a
semi-automatic manner.
[24]
SMART
Searchable collection of protein domains for detecting and
analysing domain architecture.
http://smart.embl-heidelberg.de/
[90]
BLOCKS
Database of highly conserved regions of proteins derived
automatically and represented in the form of ungapped
multiple alignments. Also tools to detect and verify protein
sequence conservation.
http://www.blocks.fhcrc.org/
[91]
PRINTS
Database of protein fingerprints (group of motifs distributed
in the sequence characteristic of a family).
http://bmbsgi11.leeds.ac.uk/bmb5dp/prints.html
[92]
PROSITE
Database of biologically significant sites, patterns and
profiles in proteins that helps in function prediction.
http://www.expasy.ch/prosite/
[25]
http://www.ncbi.nlm.nih.gov/BLAST/
[28]
Tools for sequence similarity search
BLAST
Programs for rapid sequence similarity searches for
protein and DNA sequences. Also provides searches for
nucleotide sequences translated in all six frames.
HMMER
Programs for constructing multiple protein
sequence alignments and performing database searches
using the HMM approach.
http://hmmer.wustl.edu
[93]
PSI-BLAST
An implementation of BLAST capable of detecting distantly
related proteins by creating a PSSM from the significant
matches detected in one round and using that PSSM
as the query in the next iteration.
http://www.ncbi.nlm.nih.gov./blast/psiblast.cgi
[28]
http://dodo.cpmc.columbia.edu/pp/
predictprotein.html
[94]
Tools for protein structure prediction and comparison
PHD
Neural network dependent secondary structure prediction.
PSI-PRED
Prediction of secondary structure using PSI-BLAST
derived scoring matrices.
http://insulin.brunel.ac.uk/psipred
[95]
Hybrid fold
recognition
Combined algorithm for fold recognition using both
sequence-based evolutionary information and secondary
structure predictions.
http://www.cs.bgu.il/~bioinbgu
[96]
FSSP/DALI
Automatic structural classification of protein domains with the http://www2.ebi.ac.uk/dali/fssp/
DALI search facility for structure comparisons.
[97]
SWISS-PDB
Viewer and Swiss
Model
Powerful structure visualization and homology modeling
system for personal computers.
http://www.expasy.ch/spdbv/
[98]
VAST
Structure similarity search tool based on the
definition of the threshold of statistically significant
structural similarity.
http://www.ncbi.nlm.nih.gov/Structure/
[99]
Databases for disease genes
Genes and
Information on selected human diseases with links to genes,
Disease
chromosomes, scientific literature and OMIM.
OMIM
Catalog of human genetic disorders and associated
phenotypic and genetic information.
http://www.ncbi.nlm.nih.gov./disease/
http://www.ncbi.nlm.nih.gov./entrez/
query.fcgi?db=OMIM
[100]
Computational analysis of disease-associated genes and their products Sreekumar et al.
253
Table 2 legend
*The number of tools and databases for computational analysis of genes and proteins that are available online has been rapidly growing during the
past few years. Inevitably, this is an arbitrary selection of those we have used extensively and found to be useful. HMM, hidden marker model;
PSI-BLAST, position-specific iterating BLAST; PSSM, position specific scoring matrix.
inclusion bodies in neurons by binding polyglucosan
through the CBD and cleaving the β-linkages through the
putative glucohydrolase domain. A more detailed examination of the sequence and domain architecture of laforin
shows, however, that the putative glucohydrolase domain
largely overlaps with the confidently predicted dualspecificity phosphatase domain, which suggests that
the presence of the glucohydrolase motifs in this protein
is spurious (KR Sreekumar, L Aravind, EV Koonin,
unpublished data).
The dual-specificity phosphatase activity of laforin has
been confirmed experimentally [26]. The CBD domain of
laforin probably targets the protein to polyglucosan inclusion bodies, whereas the phosphatase domain might
activate a protein dephosphorylation pathway that triggers
its destruction through a still unknown mechanism. This
example illustrates both the utility of computational analysis of protein domains for function prediction and the need
for caution in interpreting computational results.
Fukutin
Fukuyama type congenital dystrophy is caused by retroposon
insertions in the gene encoding a protein called fukutin [27].
Extensive database searches using the PSI-BLAST program
[28] detected moderate, but statistically significant similarity
between fukutin and a family of bacterial and eukaryotic
enzymes that catalyze phosphoryl-ligand transfer. Additional
multiple alignment analysis showed that fukutin shares with
these proteins the predicted catalytic residues [29].
The combination of these observations with the prediction
of a signal peptide in fukutin, and the experimental data
on its localization in the endoplasmic reticulum and secretory granules [27] has led to the prediction that fukutin
modifies cell-surface molecules, most probably through
the attachment of phosphoryl-sugar moieties [29].
Frataxin
Friedreich’s ataxia (FRDA) is an autosomal recessive disease
characterized by progressive ataxia, hypertrophic cardiomyopathy and diabetes mellitus. Gibson et al. [30] have carried
out computational analysis of the FRDA gene product,
frataxin, and shown that it has homologues in bacteria of the
γ subdivision of the Proteobacteria, but not in any other
prokaryotes, and also that it possesses a non-globular N-terminal domain predicted to function as a mitochondrial
targeting peptide [30]. They predicted a distinct fold, consisting of a β sheet flanked by two long α helices, on the basis
of a multiple alignment of the frataxin homologs, and further
proposed that frataxin is anuclear encoded, and, accordingly,
that FRDA is a ‘mitochondrial disease’.
Both the mitochondrial localization and the αβ sandwich
structure predictions, which were based on computational
analysis of frataxin, have been confirmed experimentally
[31,32], and more recent studies have shown that frataxin
is a key regulator of mitochondrial energy conversion [33•].
McKusick–Kaufman syndrome
Mutations in a gene on chromosome 20 have been shown
to cause McKusick–Kaufman syndrome (MKKS) and
Bardet–Biedl syndrome (see also review by Sheffield et al.,
this issue, pp 317–321) — developmental anomalies that
involve hydrometrocolpos, postaxial polydactyly and congenital heart disease [34••,35]. Sequence database searches
show that the predicted protein encoded by the MKKS
gene belongs to a family of group II chaperonins that
promote protein folding in an ATP-dependent manner.
The three-dimensional structural model of the MKKS protein predicts that the Y37C mutation lies in the highly
conserved loop region between helix 1 and strand 2 which is
involved in interaction among subunits [34••]. Another mutation, H84Y, is predicted to lie in a region responsible for ATP
hydrolysis. This appears to be the first description of a disease
caused by a defect in a molecular chaperone — a finding that
complements the data on the role of defects in transcription
regulators in several diseases, as discussed above.
Vascular endothelial growth factor receptor-3
Another example of a human disease for which sequence
analysis and three-dimensional structure modeling has provided insights into pathogenesis is primary lymphoedema
caused by mutations in vascular endothelial growth factor
receptor-3 (VEGFR-3) [36••]. The VEGFR-3 protein consists of six classical immunoglobulin domains, four
immunoglobulin C2 domains, a signal peptide, a single
transmembrane segment and an intracellular tyrosine
kinase catalytic domain.
Four mutations that co-segregate with lymphoedema phenotype map within the tyrosine kinase domain. Finegold
and co-workers [36••] have built a three-dimensional
model of VEGFR-3, based on the crystal structure of
VEGFR-2, from which functional implications of mutations observed in VEGFR-3 can be inferred. For example,
the model shows that the G857R mutation (the last glycine
of the conserved GXGXXG kinase motif) lies in a critical
turn between β strands 1 and 2 and is expected to affect
ATP-binding and catalysis severely.
DTD, Pendrin and congenital chloride diarrhoea
Diastrophic dysplasia/achondrogenesis type IB (DTD),
Pendred’s syndrome and congenital chloride diarrhoea are
254
Genetics of disease
caused by malfunctions of anion transporters of the sulfate
transporter family [37,38]. Unexpectedly, it has been
shown that the C-terminal cytoplasmic portion of these
proteins shares a conserved domain named STAS (after
sulfate transporter anti-sigma) with bacterial anti-sigmafactor antagonists, and this connection provides clues for
the regulation of anion transporters [39•]. (Anti-sigma factors prevent formation of active RNA polymerase
holoenzyme by trapping the sigma factor. Anti-sigma
antagonists bind the anti-sigma factors and prevent the formation of sigma-antisigma complex.) The STAS domain of
the antisigma-factor antagonist binds GTP and has a weak
GTPase activity [40], leading to the prediction that the
cytoplasmic domain of the anion transporters might regulate
transport through NTP binding [39•].
Mulibrey nanism
Muscle–liver–brain–eye (Mulibrey) nanism is an autosomal
recessive disorder that affects several tissues of mesodermal
origin, which suggests that a highly pleiotropic gene is
involved in this disease. The complex multidomain architecture of the recently cloned MUL gene product is
compatible with this notion [41•]. The MUL protein comprises a RING-finger domain, a B-box domain, a coiled-coil
domain and a MATH domain. Avela et al. [41•] suggest that
MUL is likely to be involved in various protein–protein
interactions that might be important in development regulation; however, the combination of domains present in this
protein calls for more specific functional predictions.
There is growing evidence that the primary function of the
RING domain is as the E3 component of the ubiquitin ligase cascade that facilitates the transfer of ubiquitin from
the E2 protein to target proteins [42]. MATH is a versatile
adaptor domain that mediates specific protein–protein
interactions in the programmed cell death system, chromatin remodeling and other functional contexts [43].
Thus, MUL can be predicted to function as an E3 ubiquitin ligase, with additional regulatory interactions, possibly
related to apoptosis, mediated by the MATH domain.
Mutations outside functional protein domains
Mutations that lie outside globular domains and transmembrane segments of proteins, or outside the protein-coding
part of a gene altogether often affect gene expression. Such
mutations may be located in untranslated regions, promoters,
introns and sequences encoding signal peptides. The effect
of these mutations may manifest itself as a defective globular or transmembrane domain, however, owing to exon
skipping, aberrant splicing and abnormal cellular localization.
Untranslated and non-coding regions
Sequence analysis can provide hints as to the molecular basis
of defects caused by mutations in untranslated regions. For
example, in individuals with severe protein C deficiency,
who suffer from massive disseminated intravascular coagulation or neonatal purpura fulminans, promoter mutations in
the protein C (PROC) gene have been identified. These
mutations alter the consensus sequence for the transcription
factor HNF-3-binding, resulting in reduced PROC promoter
activity, and accordingly a reduced level of protein C [44].
A splice-site mutation in intron 9 of the OPA1 gene, which
has been detected in individuals with autosomal dominant
optic atrophy, leads to either exon-10 skipping or a
frameshift with a premature stop codon [45]. The OPA1
product is a mitochondrial protein comprising an N-terminal mitochondrial localization signal and a dynamin
GTPase domain. The defective protein produced by the
mutant gene lacks a part of the GTPase domain [45,46].
Signal peptides
Duplications in exon 1 of the TNFRSF11A (RANK) gene
that segregates with familial expansile osteolysis have
been identified recently [47]. These mutations are short
in-frame insertions in the hydrophobic core region of the
RANK signal peptide, and affect proper cleavage of the
signal peptide.
The RANK protein consists of four tumor-necrosis factor
repeats and a transmembrane segment, besides the signal
peptide that directs the protein to the secretory pathway.
Failure to cleave the signal peptide might result in higher
intracellular accumulation of defective RANK in compartments of the secretion pathway that could lead to a higher
incidence of receptor self-association and increased RANK
signal transduction [47]. An NF-κB responsive reporter
assay of transfected cells has indeed suggested that mutations occurring in familial expansile osteolysis result in
increased constitutive RANK signaling [47].
Conclusions and future directions
The examples of disease genes discussed here and in previous publications [6,48–50] illustrate the importance of
sequence and structure analysis in predicting the molecular basis of pathogenesis, in guiding further work and in
understanding the biology of the respective functional systems. Proteins for which predictions based on sequence
analysis have been experimentally verified include frataxin
and the MALT-associated paracaspase. Using the complete
genome sequences, efforts are being made to assemble
databases of orthologous genes from diverse organisms,
and these databases are already proving to be valuable in
functional annotation of genes [51,52•].
The presence of orthologs for many of the human disease
genes in model organisms such as mouse, fruitfly, nematode and yeast enable experimental verification of the
predictions. In particular, apparent counterparts for 178 of
287 analyzed human disease genes have been identified in
fruitfly by large-scale comparative genomics [53••,54••]. At
present, some of the human disease genes do not seem to
have an ortholog in other organisms, but the anticipated
availability in the next several years of the complete
genome sequences of primates and other mammals should
drastically reduce the number of ‘orphan’ disease genes.
Computational analysis of disease-associated genes and their products Sreekumar et al.
The predictive power of sequence and structure analysis,
combined with the wealth of genome sequence data,
should not only enhance our understanding of the biology
that underlies the effect of mutations in disease genes, but
also enable the identification of many new drug targets and
accelerate the process of drug design and the development
of therapeutic strategies.
References and recommended reading
Papers of particular interest, published within the annual period of review,
have been highlighted as:
• of special interest
•• of outstanding interest
1.
••
Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD,
Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF et al.:
The genome sequence of Drosophila melanogaster. Science
2000, 287:2185-2195.
The second (nearly) complete genome of a multicellular eukaryote to be
sequenced, after the nematode Caenorhabditis elegans. The completion of
the fly genome is particularly important: first, because the enormous wealth of
genetic data available for this organism can be systematically correlated with
gene and protein sequences; and second, because an in-depth comparison
of two animal genomes has become possible for the first time.
2.
••
The Arabidopsis Genome Initiative: Analysis of the genome
sequence of the flowering plant Arabidopsis thaliana. Nature
2000, 408:796-815.
Sequencing of the first complete plant genome is fundamentally important as
it sets the stage for a comparative study of genomes representing the three
major lineages of the eukaryotic crown group — animals, fungi and plants.
3.
••
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J,
Devon K, Dewar K, Doyle M, FitzHugh W et al.: Initial sequencing
and analysis of the human genome. Nature 2001, 409:860-921.
This paper reports the results of the public effort on human genome sequencing. A preliminary analysis of the predicted human proteome performed with
a variety of computational techniques, with particular emphasis on the
‘Interpro’ collection of protein domains, is presented. The major conclusions
are that human proteins consistently have more complex domain architectures
compared to their homologs from other eukaryotes and domain rearrangements have a prominent role in eukaryotic evolution. Over 1300 apparent
orthologs common to human, fly, worm and yeast have been identified.
4.
••
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG,
Smith HO, Yandell M, Evans CA, Holt RA et al.: The sequence of the
human genome. Science 2001, 291:1304-1351.
The report of the human genome sequence from the private company Celera
Genomics. In the preliminary analysis of the predicted human proteome, molecular functions for ~60% of the proteins have been assigned by automatic
means utilizing the protein family databases such as Pfam and SMART. The
authors have also identified a partial set of human–fly and human–worm
orthologs. A preliminary survey of the differences between the human
genome and other sequenced eukaryotic genomes is also presented.
5.
Altschul SF, Boguski MS, Gish W, Wootton JC: Issues in searching
molecular sequence databases. Nat Genet 1994, 6:119-129.
6.
Bork P, Koonin EV: Predicting functions from protein sequences —
where are the bottlenecks? Nat Genet 1998, 18:313-318.
7.
Baxevanis AD, Francis Ouellette BF: Bioinformatics: a practical
guide to the analysis of genes and proteins. In Methods of
Biochemical Analysis, vol 39. New York: John Wiley; 1998:1-370.
8.
9.
Willis TG, Jadayel DM, Du MQ, Peng H, Perry AR, Abdul-Rauf M,
Price H, Karran L, Majekodunmi O, Wlodarska I et al.: Bcl10 is
involved in t(1;14)(p22;q32) of MALT B cell lymphoma and
mutated in multiple tumor types. Cell 1999, 96:35-45.
Du MQ, Peng H, Liu H, Hamoudi RA, Diss TC, Willis TG, Ye H,
Dogan A, Wotherspoon AC, Dyer MJ, Isaacson PG: BCL10 gene
mutation in lymphoma. Blood 2000, 95:3885-3890.
10. Hofmann K, Bucher P, Tschopp J: The CARD domain: a new
apoptotic signalling motif. Trends Biochem Sci 1997, 22:155-156.
11. Hofmann K: The modular nature of apoptotic signaling proteins.
Cell Mol Life Sci 1999, 55:1113-1128.
12. Zhang Q, Siebert R, Yan M, Hinzmann B, Cui X, Xue L, Rakestraw KM,
Naeve CW, Beckmann G, Weisenburger DD, Sanger WG, Nowotny H
et al.: Inactivating mutations and overexpression of BCL10, a
255
caspase recruitment domain-containing gene, in MALT lymphoma
with t(1;14)(p22;q32). Nat Genet 1999, 22:636-638.
13. Uren AG, O’Rourke K, Aravind L, Pisabarro MT, Seshagiri S, Koonin EV,
•• Dixit VM: Identification of paracaspases and metacaspases. Two
ancient families of caspase-like proteins, one of which plays a
key role in MALT lymphoma. Mol Cell 2000, 6:961-967.
This example of combining computational analysis and experimental verification
reveals the molecular basis of a class of MALT lymphoma translocation.
Computational analysis using sensitive tools for sequence-profile analysis was critical for the identification of two new families of caspase-related proteases, which
produce a new perspective on the evolution of this important class of enzymes.
14. Steiner H, Kostka M, Romig H, Basset G, Pesold B, Hardy J, Capell A,
•• Meyn L, Grim ML, Baumeister R et al.: Glycine 384 is required for
presenilin-1 function and is conserved in bacterial polytopic
aspartyl proteases. Nat Cell Biol 2000, 2:848-851.
Reports the site-directed mutagenesis of presenilin 1 and detection of the
similarity between its putative catalytic site and that of bacterial type-4
prepilin peptidases. A functional and possibly evolutionary connection,
revealed by computer analysis of protein sequences and structures, is plausible
despite the absence of significant sequence similarity.
15. Ma L, Golden S, Wu L, Maxson R: The molecular basis of
→His mutation in the
Boston-type craniosynostosis: the Pro148→
N-terminal arm of the MSX2 homeodomain stabilizes DNA
binding without altering nucleotide sequence preferences. Hum
Mol Genet 1996, 5:1915-1920.
16. Wilkie AO, Tang Z, Elanko N, Walsh S, Twigg SR, Hurst JA, Wall SA,
Chrzanowska KH, Maxson RE Jr: Functional haploinsufficiency of
the human homeobox gene MSX2 causes defects in skull
ossification. Nat Genet 2000, 24:387-390.
17.
Vastardis H, Karimbux N, Guthua SW, Seidman JG, Seidman CE:
A human MSX1 homeodomain missense mutation causes
selective tooth agenesis. Nat Genet 1996, 13:417-421.
18. van den Boogaard MJ, Dorland M, Beemer FA, and van Amstel HK:
MSX1 mutation is associated with orofacial clefting and tooth
agenesis in humans. Nat Genet 2000, 24:342-343.
19. Ferda Percin E, Ploder LA, Yu JJ, Arici K, Horsford DJ, Rutherford A,
Bapat B, Cox DW, Duncan AM, Kalnins VI et al.: Human
microphthalmia associated with mutations in the retinal
homeobox gene CHX10. Nat Genet 2000, 25:397-401.
20. Thompson AA, Nguyen LT: Amegakaryocytic thrombocytopenia and
radio-ulnar synostosis are associated with HOXA11 mutation. Nat
Genet 2000, 26:397-398.
21. Stockton DW, Das P, Goldenberg M, D’Souza RN, Patel PI: Mutation
of PAX9 is associated with oligodontia. Nat Genet 2000, 24.18-19
22. Minassian BA, Lee JR, Herbrick JA, Huizenga J, Soder S, Mungall AJ,
Dunham I, Gardner R, Fong CY, Carpenter S et al.: Mutations in a
gene encoding a novel protein tyrosine phosphatase cause
progressive myoclonus epilepsy. Nat Genet 1998, 20:171-174.
23. Minassian BA, Ianzano L, Meloche M, Andermann E, Rouleau GA,
•• Delgado-Escueta AV, Scherer SW: Mutation spectrum and
predicted function of laforin in Lafora’s progressive myoclonus
epilepsy. Neurology 2000, 55:341-346.
This paper underscores both the potential of computational analysis of protein
domains and the need to be cautious in interpreting such results.
Computational analysis of laforin reveals the presence of a carbohydrate-binding domain and a dual specificity phosphatase domain; however, the apparent
glucohydrolase motifs detected in this protein are likely to be spurious.
24. Bateman A, Birney E, Durbin R, Eddy SR, Howe KL, Sonnhammer EL:
The Pfam protein families database. Nucleic Acids Res 2000,
28:263-266.
25. Hofmann K, Bucher P, Falquet L, Bairoch A: The PROSITE database,
its status in 1999. Nucleic Acids Res 1999, 27:215-219.
26. Ganesh S, Agarwala KL, Ueda K, Akagi T, Shoda K, Usui T,
Hashikawa T, Osada H, Delgado-Escueta AV, Yamakawa K: Laforin,
defective in the progressive myoclonus epilepsy of lafora type, is
a dual-specificity phosphatase associated with polyribosomes.
Hum Mol Genet 2000, 9:2251-2261.
27.
Kobayashi K, Nakahori Y, Miyake M, Matsumura K, Kondo-Iida E,
Nomura Y, Segawa M, Yoshioka M, Saito K, Osawa M, Hamano K
et al.: An ancient retrotransposal insertion causes Fukuyama-type
congenital muscular dystrophy. Nature 1998, 394:388-392.
28. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of
256
Genetics of disease
protein database search programs. Nucleic Acids Res 1997,
25:3389-3402.
29. Aravind L, Koonin EV: The fukutin protein family — predicted enzymes
modifying cell-surface molecules. Curr Biol 1999, 9:R836-R837.
30. Gibson TJ, Koonin EV, Musco G, Pastore A, Bork P: Friedreich’s
ataxia protein: phylogenetic evidence for mitochondrial
dysfunction. Trends Neurosci 1996, 19:465-468.
31. Koutnikova H, Campuzano V, Foury F, Dolle P, Cazzalini O, Koenig M:
Studies of human, mouse and yeast homologues indicate a
mitochondrial function for frataxin. Nat Genet 1997, 16:345-351.
32. Dhe-Paganon S, Shigeta R, Chi YI, Ristow M, Shoelson SE: Crystal
structure of human frataxin. J Biol Chem 2000, 275:30753-30756.
33. Ristow M, Pfister MF, Yee AJ, Schubert M, Michael L, Zhang CY, Ueki K,
•
Michael MD 2nd, Lowell BB, Kahn CR: Frataxin activates
mitochondrial energy conversion and oxidative phosphorylation.
Proc Natl Acad Sci USA 2000, 97:12239-12243.
Experimental validation of the role of frataxin in mitochondrial energy conversion
that corroborates with the computational prediction.
34. Stone DL, Slavotinek A, Bouffard GG, Banerjee-Basu S, Baxevanis AD,
•• Barr M, Biesecker LG: Mutation of a gene encoding a putative
chaperonin causes McKusick–Kaufman syndrome. Nat Genet
2000, 25:79-82.
This paper illustrates how DNA sequence data obtained form positional
cloning experiments can be subjected to computational analysis and further
structure modeling to provide insights into biology of the system. On the
basis of sequence analysis, MKKS is predicted to be a chaperonin. The
structure of MKKS is modeled to explain the loss of function associated with
mutations in this gene.
35. Slavotinek AM, Stone EM, Mykytyn K, Heckenlively JR, Green JS, Heon E,
Musarella MA, Parfrey PS, Sheffield VC, Biesecker LG: Mutations in
MKKS cause bardet-biedl syndrome. Nat Genet 2000, 26:15-16.
36. Karkkainen MJ, Ferrell RE, Lawrence EC, Kimak MA, Levinson KL,
•• McTigue MA, Alitalo K, Finegold DN: Missense mutations interfere
with VEGFR-3 signalling in primary lymphoedema. Nat Genet
2000, 25:153-159.
Homology-based structure modeling of VEGFR-3 predicts that a mutation
implicated in primary lymphoedema eliminates ATP binding by the
serine/threonine protein kinase domain of this protein.
37.
Scott DA, Wang R, Kreman TM, Sheffield VC, Karnishki LP: The
Pendred syndrome gene encodes a chloride-iodide transport
protein. Nat Genet 1999, 21:440-443.
38. Everett LA, Green ED: A family of mammalian anion transporters
and their involvement in human genetic diseases. Hum Mol Genet
1999, 8:1883-1891.
39. Aravind L, Koonin EV: The STAS domain — a link between anion
•
transporters and antisigma- factor antagonists. Curr Biol 2000,
10:R53-R55.
This paper describes an NTP-binding STAS domain common to anion transporters and bacterial antisigma-factor antagonists, providing clues to the
regulation of anion transporters. This finding illustrates the potential of
sequence analysis to reveal completely unexpected connections between
domains contained in proteins that are not considered to be evolutionarily or
functionally related.
40. Najafi SM, Harris DA, Yudkin MD: The SpoIIAA protein of Bacillus
subtilis has GTP-binding properties. J Bacteriol 1996,
178:6632-6634.
41. Avela K, Lipsanen-Nyman M, Idanheimo N, Seemanova E, Rosengren S,
•
Makela TP, Perheentupa J, Chapelle A, Lehesjoki AE: Gene encoding
a new RING-B-box-coiled-coil protein is mutated in mulibrey
nanism. Nat Genet 2000, 25:298-301.
Identification of a disease gene coding for a protein with a complex domain
architecture. It appears that even more specific predictions than described
in this paper are possible on the basis of the combination of domains present in the MUL protein.
42. Jackson PK, Eldridge AG, Freed E, Furstenthal L, Hsu JY, Kaiser BK,
Reimann JD: The lore of the RINGs: substrate recognition and
catalysis by ubiquitin ligases. Trends Cell Biol 2000, 10:429-439.
43. Aravind L, Dixit VM, Koonin EV: The domains of death: evolution of
the apoptosis machinery. Trends Biochem Sci 1999, 24:47-53.
44. Millar DS, Johansen B, Berntorp E, Minford A, Bolton-Maggs P,
Wensley R, Kakkar V, Schulman S, Torres A, Bosch N, Cooper DN:
Molecular genetic analysis of severe protein C deficiency. Hum
Genet 2000, 106:646-653.
45. Delettre C, Lenaers G, Griffoin JM, Gigarel N, Lorenzo C, Belenguer P,
Pelloquin L, Grosgeorge J, Turc-Carel C, Perret E et al.: Nuclear gene
OPA1, encoding a mitochondrial dynamin-related protein, is
mutated in dominant optic atrophy. Nat Genet 2000, 26:207-210.
46. Alexander C, Votruba M, Pesch UE, Thiselton DL, Mayer S, Moore A,
Rodriguez M, Kellner U, Leo-Kottler B, Auburger G et al.: OPA1, encoding
a dynamin-related GTPase, is mutated in autosomal dominant optic
atrophy linked to chromosome 3q28. Nat Genet 2000, 26:211-215.
47.
Hughes AE, Ralston SH, Marken J, Bell C, MacPherson H, Wallace RG,
van Hul W, Whyte MP, Nakatsuka K, Hovy L, Anderson DM: Mutations
in TNFRSF11A, affecting the signal peptide of RANK, cause familial
expansile osteolysis. Nat Genet 2000, 24:45-48.
48. Koonin EV, Altschul SF, Bork P: BRCA1 protein products...
Functional motifs. Nat Genet 1996, 13:266-268.
49. Madej T, Boguski MS, Bryant SH: Threading analysis suggests that
the obese gene product may be a helical cytokine. FEBS Lett
1995, 373:13-18.
50. Mushegian AR, Bassett DE Jr, Boguski MS, Bork P, Koonin EV:
Positionally cloned human disease genes: patterns of
evolutionary conservation and functional motifs. Proc Natl Acad
Sci USA 1997, 94:5831-5836.
51. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on
protein families. Science 1997, 278:631-637.
52. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT,
•
Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG
database: new developments in phylogenetic classification of
proteins from complete genomes. Nucleic Acids Res 2001, 29:22-28.
An evolutionary classification of proteins from sequenced genomes based on
the notion of orthologous relationships. This new version includes proteins from
C. elegans and Drosophila that have homologs in Archaea and/or Bacteria.
53. Fortini ME, Skupski MP, Boguski MS, Hariharan IK: A survey of
•• human disease gene counterparts in the Drosophila genome.
J Cell Biol 2000, 150:F23-F30.
A careful analysis of human disease genes with the goal of identifying counterparts of disease genes in fruitfly — an approach that is applicable to other
model eukaryotic organisms as well.
54. Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR,
•• Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W et al.:
Comparative genomics of the eukaryotes. Science 2000,
287:2204-2215.
A comparative analysis of the predicted proteomes of three eukaryotes —
yeast, worm and fly — that reveals unity and diversity of protein families and
domain architecture of proteins in these organisms. Being the first attempt
of such systematic comparative analysis, many of the conclusions need further refinement, but some of the approaches used in this study have general
application in comparative genomics.
55. Morett E, Bork P: A novel transactivation domain in parkin. Trends
Biochem Sci 1999, 24:229-231.
56. Shimura H, Hattori N, Kubo S, Mizuno Y, Asakawa S, Minoshima S,
•
Shimizu N, Iwai K, Chiba T, Tanaka K, Suzuki T: Familial Parkinson
disease gene product, parkin, is a ubiquitin-protein ligase. Nat
Genet 2000, 25:302-305.
Experimental confirmation of the predicted function of a protein containing
the RING domain.
57.
Bomont P, Cavalier L, Blondeau F, Hamida CB, Belal S, Tazir M,
Demir E, Topaloglu H, Korinthenberg R, Tuysuz B et al.: The gene
encoding gigaxonin, a new member of the cytoskeletal
BTB/kelch repeat family, is mutated in giant axonal neuropathy.
Nat Genet 2000, 26:370-374.
58. Gal A, Li Y, Thompson DA, Weir J, Orth U, Jacobson SG,
Apfelstedt-Sylla E, Vollrath D: Mutations in MERTK, the human
orthologue of the RCS rat retinal dystrophy gene, cause retinitis
pigmentosa. Nat Genet 2000, 26:270-271.
59. Bamford RN, Roessler E, Burdine RD, Saplakoglu U, dela Cruz J,
Splitt M, Towbin J, Bowers P, Marino B, Schier AF et al.: Loss-offunction mutations in the EGF-CFC gene CFC1 are associated with
human left-right laterality defects. Nat Genet 2000, 26:365-369.
60. Akama TO, Nishida K, Nakayama J, Watanabe H, Ozaki K, Nakamura T,
Dota A, Kawasaki S, Inoue Y, Maeda N et al.: Macular corneal
dystrophy type I and type II are caused by distinct mutations in a
new sulphotransferase gene. Nat Genet 2000, 26:237-241.
61. Bech-Hansen NT, Naylor MJ, Maybaum TA, Sparkes RL, Koop B,
Birch DG, Bergen AA, Prinsen CF, Polomeno RC, Gal A et al.:
Computational analysis of disease-associated genes and their products Sreekumar et al.
Mutations in NYX, encoding the leucine-rich proteoglycan
nyctalopin, cause X-linked complete congenital stationary night
blindness. Nat Genet 2000, 26:319-323.
62. Pusch CM, Zeitz C, Brandau O, Pesch K, Achatz H, Feil S, Scharfe C,
Maurer J, Jacobi FK, Pinckers A et al.: The complete form of
X-linked congenital stationary night blindness is caused by
mutations in a gene encoding a leucine-rich repeat protein. Nat
Genet 2000, 26:324-327.
63. Kutsche K, Yntema H, Brandt A, Jantke I, Gerd Nothwang H, Orth U,
Boavida MG, David D, Chelly J, Fryns JP et al.: Mutations in
ARHGEF6, encoding a guanine nucleotide exchange factor for
rho GTPases, in patients with X-linked mental retardation. Nat
Genet 2000, 26:247-250.
64. Verpy E, Leibovici M, Zwaenepoel I, Liu XZ, Gal A, Salem N,
Mansour A, Blanchard S, Kobayashi I, Keats BJ et al.: A defect in
harmonin, a PDZ domain-containing protein expressed in the
inner ear sensory hair cells, underlies usher syndrome type 1C.
Nat Genet 2000, 26:51-55.
65. Kaplan JM, Kim SH, North KN, Rennke H, Correia LA, Tong HQ,
Mathis BJ, Rodriguez-Perez JC, Allen PG, Beggs AH, Pollak MR:
Mutations in ACTN4, encoding alpha-actinin-4, cause familial
focal segmental glomerulosclerosis. Nat Genet 2000, 24:251-256.
66. Mutations in MYH9 result in the may-hegglin anomaly, and
fechtner and sebastian syndromes. Nat Genet 2000, 26:103-105.
67.
Netchine I, Sobrier ML, Krude H, Schnabel D, Maghnie M, Marcos E,
Duriez B, Cacheux V, Moers A, Goossens M, Gruters A, Amselem S:
Mutations in LHX3 result in a new syndrome revealed by combined
pituitary hormone deficiency. Nat Genet 2000, 25:182-186.
68. Chavanas S, Bodemer C, Rochat A, Hamel-Teillac D, Ali M, Irvine AD,
Bonafe JL, Wilkinson J, Taieb A, Barrandon Y, Harper JI, de Prost Y,
Hovnanian A: Mutations in SPINK5, encoding a serine protease
inhibitor, cause Netherton syndrome. Nat Genet 2000, 25:141-142.
69. Bignell GR, Warren W, Seal S, Takahashi M, Rapley E, Barfoot R,
Green H, Brown C, Biggs PJ, Lakhani SR, Jones C, Hansen J et al.:
Identification of the familial cylindromatosis tumour-suppressor
gene. Nat Genet 2000, 25:160-165.
70. Waeber G, Delplanque J, Bonny C, Mooser V, Steinmann M,
Widmann C, Maillard A, Miklossy J, Dina C, Hani EH et al.: The gene
MAPK8IP1, encoding islet-brain-1, is a candidate for type 2
diabetes. Nat Genet 2000, 24:291-295.
71. Momeni P, Glockner G, Schmidt O, von Holtum D, Albrecht B,
Gillessen-Kaesbach G, Hennekam R, Meinecke P, Zabel B,
Rosenthal A, Horsthemke B, Ludecke HJ: Mutations in a new gene,
encoding a zinc-finger protein, cause tricho- rhino-phalangeal
syndrome type I. Nat Genet 2000, 24:71-74.
72. Delepine M, Nicolino M, Barrett T, Golamaully M, Lathrop GM,
Julier C: EIF2AK3, encoding translation initiation factor 2-alpha
kinase 3, is mutated in patients with Wolcott-Rallison syndrome.
Nat Genet 2000, 25:406-409.
73. Arbour NC, Lorenz E, Schutte BC, Zabner J, Kline JN, Jones M,
Frees K, Watt JL, Schwartz DA: TLR4 mutations are associated with
endotoxin hyporesponsiveness in humans. Nat Genet 2000,
25:187-191.
74. Hastbacka J, de la Chapelle A, Mahtani MM, Clines G,
Reeve-Daly MP, Daly M, Hamilton BA, Kusumi K, Trivedi B, Weaver A
et al.: The diastrophic dysplasia gene encodes a novel sulfate
transporter: positional cloning by fine-structure linkage
disequilibrium mapping. Cell 1994, 78:1073-1087.
75. Superti-Furga A, Rossi A, Steinmann B, Gitzelmann R: A
chondrodysplasia family produced by mutations in the diastrophic
dysplasia sulfate transporter gene: genotype/phenotype
correlations. Am J Med Genet 1996, 63:144-147.
76. Hoglund P, Haila S, Socha J, Tomaszewski L, Saarialho-Kere U,
Karjalainen-Lindsberg ML, Airola K, Holmberg C, de la Chapelle A,
Kere J: Mutations of the Down-regulated in adenoma (DRA) gene
cause congenital chloride diarrhoea. Nat Genet 1996, 14:316-319.
77.
Zemni R, Bienvenu T, Vinet MC, Sefiani A, Carrie A, Billuart P,
McDonell N, Couvert P, Francis F, Chafey P et al.: A new gene
involved in X-linked mental retardation identified by analysis of an
X;2 balanced translocation. Nat Genet 2000, 24:167-170.
78. Sun H, Smallwood PM, Nathans J: Biochemical defects in ABCR
protein variants associated with human retinopathies. Nat Genet
2000, 26:242-246.
257
79. Paloneva J, Kestila M, Wu J, Salminen A, Bohling T, Ruotsalainen V,
Hakola P, Bakker AB, Phillips JH, Pekkarinen P et al.: Loss-offunction mutations in TYROBP (DAP12) result in a presenile
dementia with bone cysts. Nat Genet 2000, 25:357-361.
80. Naureckiene S, Sleat DE, Lackland H, Fensom A, Vanier MT, Wattiaux R,
Jadot M, Lobel P: Identification of HE1 as the Second Gene of
Niemann-Pick C Disease. Science 2000, 290:2298-2301.
81. Davies JP, Levy B, Ioannou YA: Evidence for a Niemann-pick C
(NPC) gene family: identification and characterization of NPC1L1.
Genomics 2000, 65:137-145.
82. Bulman MP, Kusumi K, Frayling TM, McKeown C, Garrett C, Lander ES,
Krumlauf R, Hattersley AT, Ellard S, Turnpenny PD: Mutations in the
human delta homologue, DLL3, cause axial skeletal defects in
spondylocostal dysostosis. Nat Genet 2000, 24:438-441.
83. Oldridge M, Fortuna AM, Maringa M, Propping P, Mansour S, Pollitt C,
DeChiara TM, Kimble R B, Valenzuela DM, Yancopoulos GD, Wilkie
AO: Dominant mutations in ROR2, encoding an orphan receptor
tyrosine kinase, cause brachydactyly type B. Nat Genet 2000,
24:275-278.
84. Hu Z, Bonifas JM, Beech J, Bench G, Shigihara T, Ogawa H, Ikeda S,
Mauro T, Epstein EH Jr: Mutations in ATP2C1, encoding a calcium
pump, cause Hailey–Hailey disease. Nat Genet 2000, 24:61-65.
85. Irrthum A, Karkkainen MJ, Devriendt K, Alitalo K, Vikkula M:
Congenital hereditary lymphedema caused by a mutation that
inactivates VEGFR3 tyrosine kinase. Am J Hum Genet 2000,
67:295-301.
86. Wolfe MS, Xia W, Ostaszewski BL, Diehl TS, Kimberly WT, Selkoe DJ:
Two transmembrane aspartates in presenilin-1 required for
presenilin endoproteolysis and γ-secretase activity. Nature 1999,
398:513-517.
87.
Nielsen H, Engelbrecht J, Brunak S, von Heijne G: A neural network
method for identification of prokaryotic and eukaryotic signal
peptides and prediction of their cleavage sites. Int J Neural Syst
1997, 8:581-599.
88. Claros MG, von Heijne G: TopPred II: an improved software for
membrane protein structure predictions. Comput Appl Biosci
1994, 10:685-686.
89. Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU,
Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA:
Database resources of the national center for biotechnology
information. Nucleic Acids Res 2001, 29:11-16.
90. Schultz J, Copley RR, Doerks T, Ponting CP, Bork P: SMART: a
web-based tool for the study of genetically mobile domains.
Nucleic Acids Res 2000, 28:231-234.
91. Henikoff JG, Greene EA, Pietrokovski S, Henikoff S: Increased
coverage of protein families with the blocks database servers.
Nucleic Acids Res 2000, 28:228-230.
92. Attwood TK, Croning MD, Flower DR, Lewis AP, Mabey JE, Scordis P,
Selley JN, Wright W: PRINTS-S: the database formerly known as
PRINTS. Nucleic Acids Res 2000, 28:225-227.
93. Eddy SR: Profile hidden Markov models. Bioinformatics 1998,
14:755-763.
94. Rost B, Sander C, Schneider R: PHD — an automatic mail server
for protein secondary structure prediction. Comput Appl Biosci
1994, 10:53-60.
95. McGuffin LJ, Bryson K, Jones DT: The PSIPRED protein structure
prediction server. Bioinformatics 2000, 16:404-405.
96. Fischer D: Hybrid fold recognition: combining sequence derived
properties with evolutionary information. Pac Symp Biocomput
2000, 119-130.
97.
Holm L, Sander C: Touring protein fold space with Dali/FSSP.
Nucleic Acids Res 1998, 26:316-319.
98. Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: an
environment for comparative protein modeling. Electrophoresis
1997, 18:2714-2723.
99. Madej T, Gibrat JF, Bryant SH: Threading a database of protein
cores. Proteins Struct Funct Genet 1995, 23:356-369.
100. Antonarakis SE, McKusick VA: OMIM passes the 1,000-diseasegene mark. Nat Genet 2000, 25:11.