Download What should be known about human gene nomenclature in - C-HPP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pathogenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Minimal genome wikipedia , lookup

Genetic engineering wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene therapy wikipedia , lookup

Point mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of human development wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Protein moonlighting wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

NEDD9 wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene nomenclature wikipedia , lookup

Transcript
Nomenclature issues relevant to HPP
Version 1.0 / October 10, 2012 / Amos Bairoch
Here are a number of information items in relation with nomenclature of human
biological "objects".
Genes
There is an organization which assigns symbols for human genes: the HUGO Gene
Nomenclature Committee (HGNC) (see www.genenames.org). HGNC follows a number
of guidelines and rules (see www.genenames.org/guidelines.html) and it tries to ensure
consistency in gene nomenclature with the mouse and rat communities.
What should be known about human gene nomenclature in the context of HPP is the
following:
-
-
-
-
-
-
-
Human gene symbols generally consist of upper-case Latin letters or by a
combination of upper-case letters and Arabic numerals with no embedded
punctuation.
There are exceptions to the above rule. For example the so-called C orf genes where
the "orf" part is lower case (C1orf21, C2orf88, etc.), mitochondrial encoded genes
that have embedded "-" (MT-CO3, MT-ND2, etc.) and some other miscellaneous but
rare exceptions.
Gene symbols are not stable. HGNC tries to make them as stable as possible, but it
often happens that a "provisional" gene symbol such as the C orf symbols or
symbols based solely on the presence of a domain changes as new "functional" data
becomes available.
Compared to other species such as yeast (the YnR/YnL naming system), Drosophila
(the CG numbers), etc., there are NO stable symbols for human genes in any
database or resource. The only stable entity is the HGNC gene accession number.
Example: INS for insulin is HGNC:6081.
Unfortunately HGNC has not yet assigned gene symbols and therefore accession
numbers to all human protein-coding genes. As stated on their web site HGNC has
"assigned unique gene symbols and names to over 33,000 human loci, of which
around 19,000 are protein coding."
UniProt and thus also neXtProt has tried, for the more than 1'000 genes with no
official gene symbol to assign a temporary name based on what authors have
proposed but this still leaves about 460 genes with no associated gene symbols.
Ensembl make use of the HGNC nomenclature for gene name and assign to each
gene (whether it is protein-coding or not) a stable identifier in the form "ENSG"
followed by a 11 digit number (example: ENSG00000146648).
-
-
-
When there is no HGNC symbol, Ensembl assigns an arbitrary gene symbol often
based on the sequencing contig of the original human genome sequencing project.
Example: the UniProtKB/Swiss-Prot protein with the accession B9A014 is not yet
part of HGNC and is called “AP000322.54” by Ensembl.
Whenever possible, when the ortholog of a human gene exists in other vertebrate
species the same gene symbol is used. The casing will however be different: mouse
genes start with a upper case letter and the following letters are lower case,
zebrafish is all lower case. So: “FGF1” in human is called “Fgf1” in mouse and rat
and called “fgf1” in zebrafish.
Like for other rules, there are exceptions to this conservation of gene symbols. There
are two sets of exceptions:
o Zinc finger genes: in human they are prefixed by “ZNF” and in mouse by “Zfp”
and to make things worse, the numbering is not conserved. For example the
ortholog of human ZNF22 is mouse Zfp422.
o The Corf genes nomenclature is not used in mouse. This is logical as there is
no conservation of gene to chromosome mapping between the two species,
only some regions of synteny. Instead, for Corf genes, the corresponding
symbol assigned by MGI will be based on a clone “name”. Example: the
mouse ortholog of human C9orf117 is known as 1700019L03Rik.
Therefore, in the context of HPP, we can provide gene symbols for all but 460 genes. It
is therefore necessary to use for these “missing” symbols the UniProtKB/Swiss-Prot
accession numbers.
Transcripts
For transcripts there is no stable or unified nomenclature system. There are only some
stable identifiers in various databases. Individual literature reports have given names for
some transcripts based on sequence length (example: short or long isoform),
localization of the final product (example: mitochondrial or cytoplasmic isoform), the
molecular weight of the protein product (example: p50, p70) or just numbers or Greek
letters.
For identifiers the current situation is described below. As an example we have used
EGFR (the gene coding for the EGF receptor).
UniProtKB/Swiss-Prot and thus also neXtProt assign stable identifiers to the alternative
isoforms (produced by either alternative splicing, initiation or promoters).
-
These identifiers are in the form AC-n (the accession number of the entry followed by
a dash and a number).
-
UniProt/neXtProt also assign names, generally using numbers and also reports those
used in the literature, generally as synonyms. While the UniProt names are not
supposed to be stable, in practice they rarely change.
For EGFR, there are currently 4 "alternative" isoforms:
-
P00533-1
P00533-2
P00533-3
P00533-4
Name: 1; Synonyms: p170
Name: 2; Synonyms: p60, Truncated, TEGFR
Name: 3; Synonyms: p110
Name: 4
Ensembl assigns stable identifiers for transcripts in the form "ENST" followed by an 11digit number (example: ENST00000275493). They are also assigned names of the type
Gene_symbol-nnn (example: EGFR-001). Transcripts are assigned to one of the
following categories (called "Biotype"): "Protein coding", "Processed transcript" or
"Retained intron".
For EGFR there are currently 11 "transcripts", 8 of which are predicted to be proteincoding.
The
Consensus
CDS
(CCDS)
project
(see
http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) is a collaborative effort by NCBI,
EBI, the Sanger Institute and UCSC to identify a core set of human and mouse protein
coding regions that are consistently annotated and of high quality. CCDS assigns an
accession number for each transcript. So each human gene can be associated with a
number of different protein-coding transcripts. For EGFR here is the current data:
It should be noted that CCDS accession numbers are composed of two numerical parts,
separated by a period. The first number is an arbitrary assigned number which is stable,
the second number a version number which is incremented whenever the nucleotide
sequence is changed (which does not necessarily imply that the protein sequence is
changed).
In the context of HPP, as the identification will be done using the UniProtKB/Swiss-Prot
“complete proteome” set (a document will be drafted about what it really means), one
needs to indicate not a “name” but the stable identifier of the alternative form (example:
P00533-1). To improve readability, it may be desirable to add the gene symbol when it
exists, and refer to this isoform as “EGFR P00533-1”.
Proteins
Protein name nomenclature has a long history with deep roots into the history of
science. There have been many attempts to force the biological community to abide to
standardized naming schemes, some of which, like the effort of the IUPAC-IUBMB
Enzyme Committee, dating back from the 1960s. Unfortunately, sociological aspects of
the life science research process make this endeavor almost intractable. As Michael
Ashburner at the University of Cambridge once famously quipped, “Biologists would
rather share their toothbrush than share a gene name”. This statement can seem
contradictory as we just saw that there is a standardized gene nomenclature (not only
for human genes but for almost 20 different model organisms) but unfortunately the
recommended nomenclature is often not followed (at least this is something that HPP
can enforce in the context of the project) and the community that needs to make use of
human gene and proteins names is much bigger and heterogeneous than that of more
“focused” community such as that of yeast or Drosophila researchers. What will please
an enzymologist will be anathema to a developmental biologist and totally obscure to a
medical researcher or a geneticist.
Some groups have been successful in imposing a unified nomenclature for some
specific protein families or groups. Good example are the efforts to produce unified
names for cytokines (the interleukins nomenclature), integrins or to the cell
differentiation molecules (the so called CD antigens), all of which have been successful.
But these efforts only concern a very small percentage of all human proteins and they
can also be confusing when the proteins being named fall into two or more categories.
As an example, the protein that should be called integrin beta-1 is also called CD29
depending on if you are an integrin researcher or a CD aficionado.
The World Health Organization (WHO), tries to insist in the use of INNs (International
Nonproprietary Names) for proteins used as drugs, but there are only available for less
than 100 human proteins and rarely used outside of the pharmacological community
(example: interleukin-2 INN is aldesleukin, a name found in 97 PubMed abstracts versus
52700 for the usual name).
All these issues lead UniProtKB/Swiss-Prot, more than 10 years ago, to embark into two
complementary efforts:
1) Drafting and distributing guidelines on how to name a protein
2) Attributing, for each protein in the database, a recommended name (RN)
following as far as possible the rules listed in the naming guideline document.
The UniProtKB protein naming guidelines which are also used at NCBI and in other
resources are available at www.uniprot.org/docs/nameprot
This document emphasizes the use of a recommended protein name (RN) which is as
“neutral” as possible. One reason for this is that it should be possible to propagate a
protein name to all orthologous proteins, from various organisms. This is why, ideally,
the recommended protein name should not contain a specific characteristic of the
protein, and in particular it should not reflect the function or role of the protein, nor its
subcellular location, its tissue specificity, its molecular weight or its species of origin, all
of which that can change across species and as we learn more about a protein and its
main function and role.
Therefore the naming guidelines focus more on what should be avoided rather than on
precisely how a protein should be named. Examples of some of the naming guidelines
are shown here:
-
An RN should not contain information about the molecular weight of the protein
(e.g. "unicornase subunit A" is preferred to "unicornase 52 kDa subunit”);
An RN should not be based on the name of a disease (e.g. "Bloom syndrome
protein" is not suitable);
An RN should not be based on tissue specificity (e.g. "testis-specific protein ..." is
not suitable);
An RN must not include the species name (e.g. "Yeast Ku70 protein" is not
suitable);
-
An RN should not be based on the gene induction (e.g. "androgen-induced
protein 1" is not suitable)
The guideline document describes many rules that are followed whenever it is possible
by the UniProt consortium when manually annotating Swiss-Prot entries. Of course
some established names need to be kept even so that they do not abide to the
nomenclature guideline. A famous example is “Cellular tumor antigen p53” where the
p53 is based on the molecular weight of the observed human protein, a characteristic
not conserved in every species and not very informative.
In the context of HPP, stakeholders should make use of the recommended names of the
human proteins that are provided by UniProt and that are also used in neXtProt. But one
needs to know that recommended names are, like gene symbols, not stable, thus HPP
needs to make it mandatory for data providers to also provide accession numbers.