Download Appendix 1 - HUGO Gene Nomenclature Committee

Appendix 1: symbol assignment flowchart DATA INPUT Gene annotated by CCDS Gene symbol submission by researcher Gene identified by HGNC from publication Gene identified by HGNC from database Gene symbol suggested by other nomenclature group LOCUS TYPE DETERMINATION Identify locus type according to external annotation e.g. RefSeq, HAVANA, Pseudogene.org, SwissProt If protein coding, follow table 1a For pseudogenes, follow table 1b For ncRNA genes, follow table 1c For other locus types, e.g. ERVs, immunoglobulin genes, follow specific guidelines Table 1a: flowchart for naming protein-coding genes SEQUENCE ANALYSIS Compare with HGNC database via in-house BLAST and external BLAST Identify genomic location via BLAT and/or in-house “map by coordinates” tool Identify orthologs via HCOP Analyse protein sequence for domains, motifs, TM regions using Pfam, TMHMM etc SYMBOL DESIGNATION Determine if there is a known function via literature and database searches, and correspondence with researchers. If yes, assign a unique symbol and name based on function e.g. ACAT1 (acetyl-CoA acetyltransferase 1) Deter If gene is a member of an established family, name with next available symbol in the family series (in coordination with specialist advisor). If the family has no established nomenclature, consider creating a new naming scheme in consultation with the research community. If the family has no known function, name as a FAM#. If gene has no known function but is a paralog of a known gene assign an appropriate symbol based on gene nomenclature of known gene, .e.g ADAL (adenosine deaminase like). If gene is an ortholog of a gene with known function in another species assign appropriate symbol with “homolog” included in the gene name e.g. CDC6 (cell division cycle 6 homolog). If gene product contains known protein domains/motifs/TM regions name based on these features e.g. ABHD1 (abhydrolase domain containing 1). Try to find other information from publications, databases or directly from researchers, e.g. cellular location, tissue specificity, chromosomal location, and name on this basis. If the gene cannot be named via any of the above steps, assign a C$orf# (chromosome $ open reading frame) symbol. GENE SYMBOL DISSEMINATION Contact researchers about release of symbol, and to confirm symbol will be used in subsequent publications Release symbol in public database for dissemination to NCBI Gene, Ensembl, UniProt, GeneCards, Vega, UCSC, locus specific databases etc Coordinate symbol update with other nomenclature committees, especially mouse Table 1b: flowchart for naming pseudogenes SEQUENCE ANALYSIS Compare with HGNC database via in-house BLAST to check that the pseudogene is not already named Identify genomic location via BLAT and/or in-house “map by coordinates” tool Identify parent human gene, relevant human gene family, or functional ortholog in other species via BLAST and comparison of annotations in external databases SYMBOL DESIGNATION Where possible name the pseudogene after its parent protein-coding gene; use the symbol format parent gene symbol P# e.g. CCNJP1; use the gene name format parent “gene name pseudogene #” e.g. cyclin J pseudogene 1 If the gene has no specific identifiable parent gene (unprocessed pseudogenes can present in clusters with proteincoding genes of the same family), name pseudogene within the gene family series but denote pseudogene status using a “P” at the end of the symbol. Use the gene symbol format family stem symbol #P e.g. ZNF890P; use the gene name format “family stem name #, pseudogene” e.g. zinc finger protein 890, pseudogene If the gene has a functional ortholog in a different species, name after this ortholog; use the gene symbol format ortholog symbol P e.g. GULOP; use the name format “ortholog name, pseudogene” e.g. gulonolactone (L-) oxidase, pseudogene Exceptions to the above rules include: symbols that do not follow our rules but that are entrenched in the literature, and pseudogenes that are part of established nomenclature systems that follow a different naming convention e.g. T cell receptor pseudogenes. If unable to include a P in the symbol, then add the word (pseudogene) at the end of the gene name e.g. symbol: TRAJ51, name: T cell receptor alpha joining 51 (pseudogene) GENE SYMBOL DISSEMINATION Contact researchers about release of symbol, and to confirm symbol will be used in subsequent publications Release symbol in public database for dissemination to NCBI Gene, Ensembl, UniProt, GeneCards, Vega, UCSC, locus specific databases etc Coordinate symbol update with other nomenclature committees if unitary pseudogene Table 1c: flowchart for naming non-coding RNA genes lowcs SEQUENCE ANALYSIS Compare with HGNC database via in-house BLAST and external BLAST to look for homologous ncRNAs Identify genomic location via BLAT and/or in-house “map by coordinates” tool Perform secondary structure analysis SYMBOL DESIGNATION If the gene is a member of an established small ncRNA class (as established by homology), name with next available symbol in the family series (in coordination with specialist advisor) e.g. MIR100. If the class has no established nomenclature, then create a new naming scheme in consultation with the research community. If the transcript product of a small ncRNA is predicted to not have the required secondary structure to function as a member of that class, then it is named as a pseudogene and provided with the next number available symbol in the family series but appended with a “P” for “pseudogene”, e.g. RNU7-2P. If the gene encodes a long non-coding RNA (lncRNA) (>200bp) then first determine if there is a known function via literature and database searches, and correspondences with researchers. If yes, assign unique symbol and name based on function e.g. XIST. If the lncRNA has no known function then it should be named based on its genomic location with reference to the closest protein-coding gene. Antisense lncRNA gene symbols have the ‘-AS’ suffix appended to the protein-coding symbol (e.g. BOK-AS1). Likewise intronic lncRNA gene symbols have the ‘IT’ suffix (e.g. SPRY4-IT1) and overlapping gene symbols have the “OT” suffix (e.g. HMBOX1-OT1). Intergenic lncRNA genes are named with the next consecutive LINC# number, e.g. LINC000028. GENE SYMBOL DISSEMINATION Contact researchers about release of symbol, and to confirm symbol will be used in subsequent publications Release symbol in public database for dissemination to NCBI Gene, Ensembl, UniProt, GeneCards, Vega, UCSC, locus specific databases etc Coordinate symbol update with other nomenclature committees and/ or specialist ncRNA resources

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Appendix 1 - HUGO Gene Nomenclature Committee