Download What should be known about human gene nomenclature in - C-HPP

Nomenclature issues relevant to HPP Version 1.0 / October 10, 2012 / Amos Bairoch Here are a number of information items in relation with nomenclature of human biological "objects". Genes There is an organization which assigns symbols for human genes: the HUGO Gene Nomenclature Committee (HGNC) (see www.genenames.org). HGNC follows a number of guidelines and rules (see www.genenames.org/guidelines.html) and it tries to ensure consistency in gene nomenclature with the mouse and rat communities. What should be known about human gene nomenclature in the context of HPP is the following: - - - - - - - Human gene symbols generally consist of upper-case Latin letters or by a combination of upper-case letters and Arabic numerals with no embedded punctuation. There are exceptions to the above rule. For example the so-called C orf genes where the "orf" part is lower case (C1orf21, C2orf88, etc.), mitochondrial encoded genes that have embedded "-" (MT-CO3, MT-ND2, etc.) and some other miscellaneous but rare exceptions. Gene symbols are not stable. HGNC tries to make them as stable as possible, but it often happens that a "provisional" gene symbol such as the C orf symbols or symbols based solely on the presence of a domain changes as new "functional" data becomes available. Compared to other species such as yeast (the YnR/YnL naming system), Drosophila (the CG numbers), etc., there are NO stable symbols for human genes in any database or resource. The only stable entity is the HGNC gene accession number. Example: INS for insulin is HGNC:6081. Unfortunately HGNC has not yet assigned gene symbols and therefore accession numbers to all human protein-coding genes. As stated on their web site HGNC has "assigned unique gene symbols and names to over 33,000 human loci, of which around 19,000 are protein coding." UniProt and thus also neXtProt has tried, for the more than 1'000 genes with no official gene symbol to assign a temporary name based on what authors have proposed but this still leaves about 460 genes with no associated gene symbols. Ensembl make use of the HGNC nomenclature for gene name and assign to each gene (whether it is protein-coding or not) a stable identifier in the form "ENSG" followed by a 11 digit number (example: ENSG00000146648). - - - When there is no HGNC symbol, Ensembl assigns an arbitrary gene symbol often based on the sequencing contig of the original human genome sequencing project. Example: the UniProtKB/Swiss-Prot protein with the accession B9A014 is not yet part of HGNC and is called “AP000322.54” by Ensembl. Whenever possible, when the ortholog of a human gene exists in other vertebrate species the same gene symbol is used. The casing will however be different: mouse genes start with a upper case letter and the following letters are lower case, zebrafish is all lower case. So: “FGF1” in human is called “Fgf1” in mouse and rat and called “fgf1” in zebrafish. Like for other rules, there are exceptions to this conservation of gene symbols. There are two sets of exceptions: o Zinc finger genes: in human they are prefixed by “ZNF” and in mouse by “Zfp” and to make things worse, the numbering is not conserved. For example the ortholog of human ZNF22 is mouse Zfp422. o The Corf genes nomenclature is not used in mouse. This is logical as there is no conservation of gene to chromosome mapping between the two species, only some regions of synteny. Instead, for Corf genes, the corresponding symbol assigned by MGI will be based on a clone “name”. Example: the mouse ortholog of human C9orf117 is known as 1700019L03Rik. Therefore, in the context of HPP, we can provide gene symbols for all but 460 genes. It is therefore necessary to use for these “missing” symbols the UniProtKB/Swiss-Prot accession numbers. Transcripts For transcripts there is no stable or unified nomenclature system. There are only some stable identifiers in various databases. Individual literature reports have given names for some transcripts based on sequence length (example: short or long isoform), localization of the final product (example: mitochondrial or cytoplasmic isoform), the molecular weight of the protein product (example: p50, p70) or just numbers or Greek letters. For identifiers the current situation is described below. As an example we have used EGFR (the gene coding for the EGF receptor). UniProtKB/Swiss-Prot and thus also neXtProt assign stable identifiers to the alternative isoforms (produced by either alternative splicing, initiation or promoters). - These identifiers are in the form AC-n (the accession number of the entry followed by a dash and a number). - UniProt/neXtProt also assign names, generally using numbers and also reports those used in the literature, generally as synonyms. While the UniProt names are not supposed to be stable, in practice they rarely change. For EGFR, there are currently 4 "alternative" isoforms: - P00533-1 P00533-2 P00533-3 P00533-4 Name: 1; Synonyms: p170 Name: 2; Synonyms: p60, Truncated, TEGFR Name: 3; Synonyms: p110 Name: 4 Ensembl assigns stable identifiers for transcripts in the form "ENST" followed by an 11digit number (example: ENST00000275493). They are also assigned names of the type Gene_symbol-nnn (example: EGFR-001). Transcripts are assigned to one of the following categories (called "Biotype"): "Protein coding", "Processed transcript" or "Retained intron". For EGFR there are currently 11 "transcripts", 8 of which are predicted to be proteincoding. The Consensus CDS (CCDS) project (see http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) is a collaborative effort by NCBI, EBI, the Sanger Institute and UCSC to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality. CCDS assigns an accession number for each transcript. So each human gene can be associated with a number of different protein-coding transcripts. For EGFR here is the current data: It should be noted that CCDS accession numbers are composed of two numerical parts, separated by a period. The first number is an arbitrary assigned number which is stable, the second number a version number which is incremented whenever the nucleotide sequence is changed (which does not necessarily imply that the protein sequence is changed). In the context of HPP, as the identification will be done using the UniProtKB/Swiss-Prot “complete proteome” set (a document will be drafted about what it really means), one needs to indicate not a “name” but the stable identifier of the alternative form (example: P00533-1). To improve readability, it may be desirable to add the gene symbol when it exists, and refer to this isoform as “EGFR P00533-1”. Proteins Protein name nomenclature has a long history with deep roots into the history of science. There have been many attempts to force the biological community to abide to standardized naming schemes, some of which, like the effort of the IUPAC-IUBMB Enzyme Committee, dating back from the 1960s. Unfortunately, sociological aspects of the life science research process make this endeavor almost intractable. As Michael Ashburner at the University of Cambridge once famously quipped, “Biologists would rather share their toothbrush than share a gene name”. This statement can seem contradictory as we just saw that there is a standardized gene nomenclature (not only for human genes but for almost 20 different model organisms) but unfortunately the recommended nomenclature is often not followed (at least this is something that HPP can enforce in the context of the project) and the community that needs to make use of human gene and proteins names is much bigger and heterogeneous than that of more “focused” community such as that of yeast or Drosophila researchers. What will please an enzymologist will be anathema to a developmental biologist and totally obscure to a medical researcher or a geneticist. Some groups have been successful in imposing a unified nomenclature for some specific protein families or groups. Good example are the efforts to produce unified names for cytokines (the interleukins nomenclature), integrins or to the cell differentiation molecules (the so called CD antigens), all of which have been successful. But these efforts only concern a very small percentage of all human proteins and they can also be confusing when the proteins being named fall into two or more categories. As an example, the protein that should be called integrin beta-1 is also called CD29 depending on if you are an integrin researcher or a CD aficionado. The World Health Organization (WHO), tries to insist in the use of INNs (International Nonproprietary Names) for proteins used as drugs, but there are only available for less than 100 human proteins and rarely used outside of the pharmacological community (example: interleukin-2 INN is aldesleukin, a name found in 97 PubMed abstracts versus 52700 for the usual name). All these issues lead UniProtKB/Swiss-Prot, more than 10 years ago, to embark into two complementary efforts: 1) Drafting and distributing guidelines on how to name a protein 2) Attributing, for each protein in the database, a recommended name (RN) following as far as possible the rules listed in the naming guideline document. The UniProtKB protein naming guidelines which are also used at NCBI and in other resources are available at www.uniprot.org/docs/nameprot This document emphasizes the use of a recommended protein name (RN) which is as “neutral” as possible. One reason for this is that it should be possible to propagate a protein name to all orthologous proteins, from various organisms. This is why, ideally, the recommended protein name should not contain a specific characteristic of the protein, and in particular it should not reflect the function or role of the protein, nor its subcellular location, its tissue specificity, its molecular weight or its species of origin, all of which that can change across species and as we learn more about a protein and its main function and role. Therefore the naming guidelines focus more on what should be avoided rather than on precisely how a protein should be named. Examples of some of the naming guidelines are shown here: - An RN should not contain information about the molecular weight of the protein (e.g. "unicornase subunit A" is preferred to "unicornase 52 kDa subunit”); An RN should not be based on the name of a disease (e.g. "Bloom syndrome protein" is not suitable); An RN should not be based on tissue specificity (e.g. "testis-specific protein ..." is not suitable); An RN must not include the species name (e.g. "Yeast Ku70 protein" is not suitable); - An RN should not be based on the gene induction (e.g. "androgen-induced protein 1" is not suitable) The guideline document describes many rules that are followed whenever it is possible by the UniProt consortium when manually annotating Swiss-Prot entries. Of course some established names need to be kept even so that they do not abide to the nomenclature guideline. A famous example is “Cellular tumor antigen p53” where the p53 is based on the molecular weight of the observed human protein, a characteristic not conserved in every species and not very informative. In the context of HPP, stakeholders should make use of the recommended names of the human proteins that are provided by UniProt and that are also used in neXtProt. But one needs to know that recommended names are, like gene symbols, not stable, thus HPP needs to make it mandatory for data providers to also provide accession numbers.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download What should be known about human gene nomenclature in - C-HPP