Download What is a GO term? - NDSU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome editing wikipedia , lookup

Gene therapy wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
VERTIGO (Vertical Gene Ontoloty)
Biologists waste time searching for all available information about each small area of research.
This is hampered further by variations in terminology in common usage at any given time, and
that inhibit effective searching by computers as well as people.
E.g., In a search for new targets for antibiotics, you want all gene products involved in bacterial
protein synthesis, that have significantly different sequence or structure from those in humans.
If one DB says these molecules are involved in 'translation' and another uses 'protein synthesis',
it is difficult for you - and even harder for a computer - to find functionally equivalent terms.
GO is an effort to address the need for consistent descriptions of gene products in different DBs.
The project began in 1988 as a collaboration between three model organism databases:
FlyBase (Drosophila),
Saccharomyces Genome Database (SGD)
Mouse Genome Database (MGD).
Since then, the GO Consortium has grown to include several of the world's major repositories for
plant, animal and microbial genomes. See the GO web page for a full list of member orgs.
GO has 3 structured, controlled vocabularies (ontologies) describing gene products
(the RNA or protein resulting after transcription) by their species-independent, associated
biological processes (BP),
cellular components (CC)
molecular functions (MF).
There are three separate aspects to this effort: The GO consortium
1. writes and maintains the ontologies themselves;
2. makes associations between the ontologies and genes / gene products in the collaborating DBs,
3. develops tools that facilitate the creation, maintainence and use of ontologies.
The use of GO terms by several collaborating databases facilitates uniform queries across them.
The controlled vocabularies are structured so that you can query them at different levels: e.g.,
1. use GO to find all gene products in the mouse genome that are involved in signal transduction,
2. zoom in on all the receptor tyrosine kinases.
This structure also allows annotators to assign properties to gene products at different levels,
depending on how much is known about a gene product.
GO is not a database of gene sequences or a catalog of gene products
GO describes how gene products behave in a cellular context.
GO is not a way to unify biological databases (i.e. GO is not a 'federated solution').
Sharing vocabulary is a step towards unification, but is not sufficient. Reasons include:
Knowledge changes and updates lag behind.
Curators evaluate data differently (e.g., agree to use the word 'kinase', but not to support
this by stating how and why we use 'kinase', and consistently to apply it. Only in this way
can we hope to compare gene products and determine whether they are related.
GO does not attempt to describe every aspect of biology. For example, domain structure, 3D
structure, evolution and expression are not described by GO.
GO is not a dictated standard, mandating nomenclature across databases.
Groups participate because of self-interest, and cooperate to arrive at a consensus.
The 3 organizing GO principles: molecular function, biological process, cellular component.
A gene product has one or more molecular functions and is used in one or more biological
processes; it might be associated with one or more cellular components.
E.g., the gene product cytochrome c can be described by the molecular function term
oxidoreductase activity, the biological process terms oxidative phosphorylation and
induction of cell death, and the cellular component terms mitochondrial matrix,
mitochondrial inner membrane.
The three organizing principles of the GO (Molecular Function):
Molecular function describes e.g., catalytic or binding activities, at the molecular level.
GO molecular function terms represent activities rather than the entities (molecules / complexes)
that perform actions, and do not specify where or when, or in what context, the action takes place.
Molecular functions correspond to activities that can be performed by individual gene products,
but some activities are performed by assembled complexes of gene products.
Examples of broad functional terms are catalytic activity, transporter activity, or binding;
Examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.
It is easy to confuse a gene product with its molecular function, and for that reason many GO
molecular functions are appended with the word "activity".
The documentation on gene products explains this confusion in more depth.
Organizing GO principles (Biological Process; Cellular Component)
A Biological Process is series of events accomplished by one or more ordered assemblies of
molecular functions.
Examples of broad biological process terms: cellular physiological process or signal transduction.
Examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport.
It can be difficult to distinguish between a biological process and a molecular function, but the
general rule is that a process must have more than one distinct steps.
A biological process is not equivalent to a pathway. We are specifically not capturing or trying to
represent any of the dynamics or dependencies that would be required to describe a pathway.
A cellular component is just that, a component of a cell but with the proviso that it is part of
some larger object, which may be an anatomical structure (e.g. rough endoplasmic reticulum or
nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).
What does the Ontology look like?
GO terms are organized in structures called directed acyclic graphs (DAGs), which differ from
hierarchies in that a child (more specialized term) can have many parent (less specialized term).
For example, the biological process term hexose biosynthesis has two parents, hexose
metabolism and monosaccharide biosynthesis. This is because biosynthesis is a subtype of
metabolism, and a hexose is a type of monosaccharide.
When any gene involved in hexose biosynthesis is annotated to this term, it is automatically
annotated to both hexose metabolism and monosaccharide biosynthesis, because every GO term
must obey the true path rule: if the child term describes the gene product, then all its parent terms
must also apply to that gene product.
It is easy to confuse a gene product and its molecular function, because very often these are
described in exactly the same words. For example, 'alcohol dehydrogenase' can describe what
you can put in an Eppendorf tube (the gene product) or it can describe the function of this stuff.
There is, however, a formal difference: a single gene product might have several molecular
functions, and many gene products can share a single molecular function.
For example, there are many gene products that have the function 'alcohol dehydrogenase'.
Some, but by no means all, of these are encoded by genes with the name alcohol dehydrogenase.
A particular gene product might have both the functions 'alcohol dehydrogenase' and
'acetaldehyde dismutase', and perhaps other functions as well.
It's important to grasp that, whenever we use terms such as alcohol dehydrogenase activity in
GO, we mean the function, not the entity; for this reason, most GO molecular function terms are
appended with the word 'activity'.
Many gene products associate into entities that function as complexes, or 'gene product groups',
which often include small molecules. They range in complexity from the relatively simple (for
example, hemoglobin contains the gene products alpha-globin and beta-globin, and the small
molecule heme) to complex assemblies of numerous different gene products, e.g., the ribosome.
At present, small molecules are not represented in GO. In the future, we might be able to create
cross products by linking GO to existing databases of small molecules such as Klotho , LIGAND
How do the terms in GO become associated with their appropriate gene products?
Collaborating databases annotate their gene products (or genes) with GO terms, providing
references and indicating what kind of evidence is available to support the annotations.
More information can be found in the GO Annotation Guide.
If you browse any of the contributing databases, you'll find that each gene or gene product has a
list of associated GO terms. Each database also publishes a table of these associations, and these
are freely available from the GO ftp site.
You can also browse the ontologies using a range of web-based browsers. A full list of these, and
other tools for analyzing gene function using GO, is available on the GO Tools page .
In addition, the GO consortium has prepared GO slims, 'slimmed down' versions of the
ontologies that allow you to annotate genomes or sets of gene products to gain a high-level view
of gene functions.
Using GO slims you can, for example, work out what proportion of a genome is involved in
signal transduction, biosynthesis or reproduction. See the GO Slim Guide for more information.
All data from the GO project is freely available. You can download the ontology data in a number
of different formats, including XML and mySQL, from the GO Downloads page.
For more information on the syntax of these formats, see the GO File Format Guide.
If you need lists of the genes or gene products that have been associated with a particular GO
term, the Current Annotations table tracks the number of annotations and provides links to the
gene association files for each of the collaborating databases is available.
GO allows us to annotate genes and their products with a limited set of attributes.
For example, GO does not allow us to describe genes in terms of which cells or tissues they're
expressed in, which developmental stages they're expressed at, or their involvement in disease.
It is not necessary for GO to do these things because other ontologies are being developed for
these purposes. The GO consortium supports the development of other ontologies and makes its
tools for editing and curating ontologies freely available.
A list of freely available ontologies that are relevant to genomics and proteomics and are
structured similarly to GO can be found at the Open Biomedical Ontologies website. A larger list,
which includes the ontologies listed at OBO and also other controlled vocabularies that do not
fulfil the OBO criteria is available at the Ontology Working Group page of the Microarray Gene
Expression Data Society (MGED).
Cross-products: The existence of several ontologies will also allow us to create 'cross-products'
that maximize the utility of each ontology while avoiding redundancy. For example, by
combining the developmental terms in the GO process ontology with a second ontology that
describes Drosophila anatomical structures, we could create an ontology of fly development.
We could repeat this process for other organisms without having to clutter up GO with large
numbers of species-specific terms. Similarly, we could create an ontology of biosynthetic
pathways by combining biosynthesis terms in the GO process ontology with a chemical ontology.
Mappings to other classification systems
GO is not the only attempt to build structured controlled vocabularies for genome annotation. Nor
is it the only such series of catalogs in current use. We have attempted to make translation tables
between these catalogs and GO. We caution that these mappings are neither complete nor exact;
they are to be used as a guide. One reason for this is absence of definitions from many of the
other catalogs and of a complete set of definitions in GO itself. More information on the syntax
of these mappings can be found in the GO File Format Guide.
Contributing to GO
The GO project is constantly evolving, and we welcome feedback from all users. If you need a
new term or definition, or would like to suggest that we reorganize a section of one of the
ontologies, please do so through our online request-tracking system, which is hosted by
SourceForge.net. Errors or omissions in annotations are reported to GO annotation mailing list.
You can also send questions or suggestions to the GOHELP. More information on mailing lists is
available from the mailing lists page.
What is a GO term?
The purpose of GO is to define particular attributes of gene products.
A term is simply the text string used to describe an entry in GO, e.g. cell, fibroblast growth factor
receptor binding or signal transduction. A node refers to a term and all its children.
GO does not contain the following:
Gene products: e.g. cytochrome c is not in GO; attributes of it, e.g., oxidoreductase activity, are.
Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not
a valid GO term because causing cancer is not the normal function of any gene.
Attributes of sequence such as intron/exon parameters: these are not attributes of gene products
and will be described in a separate sequence ontology (see OBO web site for more information).
Protein domains or structural features.
Protein-protein interactions.
General conventions when adding terms
The following stylistic points should be applied to all aspects of the ontologies.
Spelling conventions: Where there are differences in the accepted spelling between English and
US usage, use the US form, e.g. polymerizing, signaling, rather than polymerising, signalling.
There is a dictionary of 'words' used in GO terms in the file GODict.DAT.
Abbreviations: Avoid abbreviations unless they're self-explanatory. Use full element names, not
symbols. Use hydrogen for H+. Use copper and zinc rather than Cu and Zn. Use copper(II),
copper(III), etc., rather than cuprous, cupric, etc. For biomolecules, spell out the term in full
wherever practical: use fibroblast growth factor, not FGF.
Greek symbols: Spell out Greek symbols in full: e.g. alpha, beta, gamma.
Case: GO terms are all lower case except where demanded by context, e.g. DNA, not dna.
Singular/plural: Use singula, except where a term is only used in plural (eg caveolae).
Be descriptive: Be reasonably descriptive, even at the risk of verbal redundancy. Remember,
DBs that refer to GO terms might list only the finest-level terms associated with a particular gene
product. If the parent is aromatic amino acid family biosynthesis, then the child should be
aromatic amino acid family biosynthesis, anthranilate pathway, not just anthranilate pathway.
Anatomical qualifiers: Do not use anatomical qualifiers in the cellular process and molecular
function ontologies. For example, GO has the molecular function term DNA-directed DNA
polymerase activity but neither nuclear DNA polymerase nor mitochondrial DNA polymerase.
These terms with anatomical qualifiers are not necessary because annotators can use the cellular
component ontology to attribute location to gene products, independently of process or function.
Synonyms
When several words or phrases that could be used as the term name, one form will be chosen as term name
whilst the other possible names are added as synonyms.
Despite the name, GO synonyms are not always 'synonymous' in the strictest sense of the word, as they do
not always mean exactly the same as the term they are attached to.
Instead, a GO synonym may be broader or narrower than the term string; it may be a related phrase; it may be
alternative wording, spelling or use a different system of nomenclature; or it may be a true synonym. This
flexibility allows GO synonyms to serve as valuable search aids, as well as being useful for applications such
as text mining and semantic matching.
Having a single, broad relationship between a GO term and its synonyms is adequate for most search
purposes, but for other applications such as semantic matching, the inclusion of a more formal relationship
set is valuable. Thus, GO records a relationship type for each synonym, stored in OBO format flat file.
Synonym types: The synonym relationship types are:
term is an exact synonym (ornithine cycle is an exact synonym of urea cycle)
terms are related (cytochrome bc1 complex is a related to ubiquinol-cytochrome-c reductase activity)
synonym is broader than the term name (cell division is a broad synonym of cytokinesis)
synonym is narrower or more precise (pyrimidine-dimer repair by photolyase is a narrow synonym of photoreactive repair)
synonym is related to, but not exact, broader or narrower (virulence has synonym type of other related to term pathogenesis)
Synonym continued
related
[i] exact synonym
These types form a loose hierarchy:
[i] broad synonym [i] narrow synonym
[i] other related synonym
The default relationship is related to, as all synonyms are in some way related to the term name, but more
specific relationships are assigned where possible. The synonym type other related is used where the
relationship between a term and its synonym is NOT exact, narrower or broader.
In some cases, broader and narrower synonyms are created in the place of new parent or child terms because
some synonym strings may not be valid GO terms but may still be useful for search purposes. This may be
because the synonym is the name of a gene product e.g. ubiquitin-protein ligase activity has the narrower
synonym E3, as E3 is a specific gene product with ubiquitin-protein ligase activity.
Adding synonyms
When you add a synonym using DAG-Edit, choose a type from the pull-down selector (see the DAG-Edit
user guide for more information). DAG-Edit will incorporate the synonym type into the OBO format flat file
when you save. The default synonym type is the broadest, 'synonym' (equivalent to 'related' above).
Number of synonyms for a term is not limited, and the same text string can be used for more than 1 GO term
Add synonyms if you edit a term name but the old name is still a valid synonym; for example, if you change
respiration to cellular respiration, keep respiration as a synonym. This helps other users find familiar terms.
Add synonyms if the term has (or contains) a commonly used abbreviation. For example, FGF binding could
be used as a synonym for fibroblast growth factor binding.
Do not add a synonym if the only difference is case (e.g. start vs. START). Synonyms, like term names, are
all lower case except where demanded by context (e.g. DNA, not dna).
The synonyms found in GO and their relationships to the term string with which they are associated are
available as a text file. Details on file format can be found in the accompanying ReadMe file.
Rules For Synonyms
Acronyms are exactly synonymous with full name (if acronym is not used in any other sense elsewhere)
'Jargon' type phrases are exactly synonymous w full name (if phrase is not used in any other sense elsewhere)
proton is exactly synonymous with hydrogen in most senses EXCEPT where hydrogen means H 2 (i.e. gas)
include implicit information when making decision; take into account which ontology the term is in - e.g. an
entry term that ends in 'factor' is not synonymous with a molecular function.
ligand is NOT exactly synonymous with binding (ligand is an entity, binding an action)
XXX receptor ligand is NOT exactly synonymous with XXX (XXX is only one of the potential ligands so
XXX receptor ligand is broader than XXX)
XXX complex is NOT exactly synonymous with XXX (XXX is ambiguous - could describe activity of XXX)
porter and transporter are NOT exactly synonymous (transporter is broader)
symporter/antiporter and transporter are NOT exactly synonymous (transporter is broader)
Cross-referencing other databases
General database cross references (general dbxrefs) should be used whenever a GO term has an
identical meaning to an object in another database. Some ex. of common general dbxrefs in GO:
Ontology DB Sample dbxref Fctn Enzyme Commission EC:3.5.1.6 Transport Protein Database
TC:2.A.29.10.1 Biocatalysis/Biodegradation DB UM-BBD_enzymeID:e0310
Biocatalysis/Biodegradation DB UM-BBD_pathwayID:dcb MetaCyc Metabolic Pathway DB
MetaCyc:XXXX-RXN Process MetaCyc Metabolic Pathway DB MetaCyc:2ASDEG-PWY
Component None The GO.xrf_abbs file is maintained by the BioMOBY project, so to make
changes to the file, you need to use their web form.
Cross-referencing other databases
General database cross references (general dbxrefs) should be used whenever a GO term has an
identical meaning to an object in another database. Some ex. of common general dbxrefs in GO:
Ontology DB Sample dbxref Fctn Enzyme Commission EC:3.5.1.6 Transport Protein Database
TC:2.A.29.10.1 Biocatalysis/Biodegradation DB UM-BBD_enzymeID:e0310
Biocatalysis/Biodegradation DB UM-BBD_pathwayID:dcb MetaCyc Metabolic Pathway DB
MetaCyc:XXXX-RXN Process MetaCyc Metabolic Pathway DB MetaCyc:2ASDEG-PWY
Component None The GO.xrf_abbs file is maintained by the BioMOBY project, so to make
changes to the file, you need to use their web form.
Understanding relationships in GO
The GO ontologies are structured as a directed acyclic graph (DAG), which means that a
child (more specialized) term can have multiple parents (less specialized terms).
This makes GO a powerful system to describe biology, but creates some pitfalls for curators
Keeping the following guidelines in mind should help you to avoid these problems.
A child term can have one of two different relationships to its parent(s): is_a or part_of.
The same term can have different relationships to different parents; for example, the child
'GO term 3' may be an is_a of parent 'GO term 1' and a part_of parent, 'GO term 2':
In GO, an is_a relationship means that the term is a subclass of its parent. For example,
mitotic cell cycle is_a cell cycle, not confused with an 'instance' which is a specific example.
For example, clogs are a subclass or is_a of shoes, while the shoes I have on my feet now are
an instance of shoes. GO, like most ontologies, does not use instances. The is_a relationship
is transitive, which means that if 'GO term A' is a subclass of 'GO term B', and 'GO term
B' is an subclass of 'GO term C', 'GO term A' is also a subclass of 'GO term C':
For example:
Terminal N-glycosylation is a subclass of terminal glycosylation.
Terminal glycosylation is a subclass of protein glycosylation.
Terminal N-glycosylation is a subclass of protein glycosylation.
part_of in GO is more complex. There are 4 basic levels of restriction for a part_of relationship:
1st type has no restrictions - no inferences can be made from the relationship between parent and child
other than that parent may have child as a part, and the child may or may not be a part of the parent.
2nd type, 'necessarily is_part', means that wherever the child exists, it is as part of the parent.
To give a biological example, replication fork is part_of chromosome, so whenever replication fork
occurs, it is as part_of chromosome, but chromosome does not necessarily have part replication fork.
3rd type, 'necessarily has_part', is the exact inverse of type two; wherever the parent exists, it has the
child as a part, but the child is not necessarily part of the parent. For example, nucleus always has_part
chromosome, but chromosome isn't necessarily part_of nucleus.
4th type, is a combination of both two and three, 'has_part' and 'is_part'. An example of this is nuclear
membrane is part_of nucleus. So nucleus always has_part nuclear membrane, and nuclear membrane
is always part_of nucleus.
The part_of relationship used in GO is usually type two, 'necessarily is_part'. Note that part_of types 1
and 3 are not used in GO, as they would violate the true path rule. Like is_a, part_of is transitive, so
that if 'GO term A' is part_of 'GO term B', and 'GO term B' is part_of 'GO term C', 'GO term A' is
part_of 'GO term C':
E.g., Laminin-1 is part_of basal lamina.
Basal lamina is part_of basement membrane.
Laminin-1 is part_of basement membrane.
The ontology editing tool DAG-Edit, from version 1.411 on, allows you to specify the necessity of
relationships. The part_of relationship used in GO, necessarily is_part, would correspond to part_of,
[inverse] necessarily true. For more information, see the DAG-Edit user guide.
For info on how these relationships are represented in the GO flat files, see the GO File Format Guide.
For technical info on the relationships used in GO and OBO, see the OBO relationships ontology.
The true path rule states that "the pathway from a child term all the way up to its top-level parent(s)
must always be true". One of the implications of this is that the type of part_of relationship used in
GO, outlined more fully in the part_of relationship section above, is restricted to those types where a
child term must always be part_of its parent.
Often, annotating a new gene product reveals relationships in an ontology that break the true path
rule, or species specificity becomes a problem. In such cases, the ontology must be restructured by
adding more nodes and connecting terms such that any path upwards is true. When a term is added to
the ontology, the curator needs to add all of the parents and children of the new term.
This becomes clear with an example: consider how chitin metabolism is represented in the process
ontology. Chitin metabolism is a part of cuticle synthesis in fly and is also part of cell wall organization
in yeast. This was once represented in process ontology as: cuticle synthesis, [i]chitin metabolism, cell
wall biosynthesis, [i]chitin metabolism, ---[i]chitin biosynthesis, ---[i]chitin catabolism
Illustration The problem with this organization becomes apparent when one tries to annotate a
specific gene product from one species. A fly chitin synthase could be annotated to chitin biosynthesis,
and appear in a query for genes annotated to cell wall biosynthesis (and its children), which makes no
sense because flies don't have cell walls.
This is the revised ontology structure which ensures that the true path rule is not broken:
chitin metabolism, [i]chitin biosynthesis, [i]chitin catabolism, [i]cuticle chitin metabolism
---[i]cuticle chitin biosynthesis, ---[i]cuticle chitin catabolism
[i]cell wall chitin metabolism, ---[i]cell wall chitin biosynthesis, ---[i]cell wall chitin catabolism
Illustration The parent chitin metabolism now has the child terms cuticle chitin metabolism and cell
wall chitin metabolism, with the appropriate catabolism and synthesis terms beneath them. With this
structure, all the daughter terms can be followed up to chitin metabolism, but cuticle chitin metabolism
terms do not trace back to cell wall terms, so all the paths are true. In addition, gene products such as
chitin synthase can be annotated to nodes of appropriate granularity in both yeast and flies, and
queries will yield the expected results.
Dependent ontology terms
Some GO terms imply presence of others. Examples from the process ontology include the following:
If either X biosynthesis or X catabolism exists, then parent X metabolism must also exist.
If regulation of X exists, then the process X must also exist. Potentially any process in the ontology can be
regulated. Note: X may refer to a phenotype (for example cell size in regulation of cell size; in these cases, X
should not be added to the ontology.
GO nodes should aggressively avoid using species-specific definitions. Nevertheless, many functions,
processes and components are not common to all life forms. Our current convention is to include any term
that can apply to more than one taxonomic class of organism.
Within the ontologies, there are cases where a word or phrase has different meanings when applied to
different organisms. For example, embryonic development in insects is very different from embryonic
development in mammals. Such terms are distinguished from one another by their definitions and by the
sensu designation (sensu means 'in the sense of'), as in the term embryonic development (sensu Insecta).
Nodes should be divided into sensu sub-trees where the children are or are likely to be different.
Using sensu designation in a term does not exclude that term from being used to annotate species outside that
designation. e.g., a 'sensu Drosophila' term might reasonably used to annotate a mosquito gene product.
A GO node should never be more species-specific than any of its children. Child nodes can be at the same
level of species specificity as the parent node(s), or more specific. When adding more species-specific nodes,
curators should make sure that non-species-specific parents exist (or add them if necessary).
E.g., take the process of sporulation. This occurs in both bacteria and fungi, but bacterial sporulation is quite
a different process to fungal sporulation, so we therefore add two children to sporulation, sporulation (sensu
Bacteria) and sporulation (sensu Fungi). If we now want to add a term to represent the assembly of the spore
wall in fungi, we cannot just add spore wall assembly as a direct child of sporulation (sensu Fungi) as such a
term could conceivably refer to the assembly of spore walls in bacteria. We have to name the child term spore
wall assembly (sensu Fungi) to ensure that it is as species-specific as the parent term.
References and Evidence
Every annotation must be attributed to a source, which may be a literature reference,
another database or a computational analysis. The annotation must indicate what kind of
evidence is found in the cited source to support the association between the gene product
and the GO term. A simple controlled vocabulary is used to record evidence:
IMP
IGI
IPI
ISS
IDA
IEP
IEA
TAS
NAS
ND
RCA
IC
inferred from mutant phenotype
inferred from genetic interaction <database:gene_symbol[allele_symbol]>
inferred from physical interaction [with <database:protein_name>]
inferred from sequence similarity [with <database:sequence_id>]
inferred from direct assay
inferred from expression pattern
inferred from electronic annotation [with <database:id>]
traceable author statement
non-traceable author statement
no biological data available
inferred from reviewed computational analysis
inferred by curator [from <GO:id>]
Annotation File Format
Collaborating databases export to GO a tab delimited file, known informally as a "gene association file" of
links between database objects and GO terms. Despite the jargon, the database object may represent a gene or
a gene product (transcript or protein). Columns in the file are described below, a table showing the columns
in order, with examples, is available.
The entry in the DB_Object_ID field (see below) of the association file is the identifier for the database
object, which may or may not correspond exactly to what is described in a paper. For example, a paper
describing a protein may support annotations to the gene encoding the protein (gene ID in DB_Object_ID
field) or annotations to a protein object (protein ID in DB_Object_ID field).
The entry in the DB_Object_Symbol field should be a symbol that means something to a biologist, wherever
possible (gene symbol, for example). It is not an ID or an accession number (the second column,
DB_Object_ID, provides the unique identifier), although IDs can be used in DB_Object_Symbol if there is
no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).
The object type (gene, transcript, protein, protein_structure, or complex) listed in the DB_Object_Type field
MUST match the database entry identified by DB_Object_ID. Note that DB_Object_Type refers to the
database entry (i.e. does it represent a gene, protein, etc.); this column does not reflect anything about the GO
term or the evidence on which the annotation is based. For example, if your database entry represents a gene,
then 'gene' goes in the DB_Object_Type column, even if the annotation is to a component term relevant to the
localization of a protein product of the gene. The text entered in the DB_Object_Name and
DB_Object_Symbol can refer to the same database entry (recommended), or to a "broader" entity. For
example, several alternative transcripts from one gene may be annotated separately, each with a unique
transcript DB_Object_ID, but list the same gene symbol in the DB_Object_Symbol column.
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields:
DB refers to the database contributing the gene_association file the value must be present in the file of
database abbreviations. [Database abbreviations explanation] this field is mandatory, cardinality 1
DB_Object_ID unique identifier in DB for the item being annotated this field is mandatory, cardinality 1
DB_Object_Symbol (unique and valid) symbol to which DB_Object_ID is matched can use ORF name for
otherwise unnamed gene or protein if gene products are annotated, can use gene product symbol if
available, or many gene product annotation entries can share a gene symbol this field is mandatory,
cardinality 1
Qualifier flags that modify the interpretation of an annotation one (or more) of NOT, contributes_to,
colocalizes_with this field is not mandatory; cardinality 0, 1, >1; for cardinality >1 use a pipe to
separate entries (e.g. NOT|contributes_to)
GOid GO identifier for the term attributed to the DB_Object_ID this field is mandatory, cardinality 1
DB:Reference one or more unique identifiers for a single source cited as an authority for the attribution
of the GOid to the DB_Object_ID. This may be a literature reference or a database record. The syntax
is DB:accession_number. Note that only one reference can be cited on a single line in the
gene_association file. If a reference has identifiers in more than one database, multiple identifiers for
that reference can be included on a single line. For example, if the reference is a published paper that
has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier
within a model organism database. Note that if the model organism database has an identifier for the
reference, that idenitifier should always be included, even if a PubMed ID is also used. this field is
mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g.
SGD:8789|PMID:2676709).
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields:
Evidence: IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA this is mandatory, cardinality 1
With (or) From one of: DB:gene_symbol; DB:gene_symbol[allele_symbol]; DB:gene_id;
DB:protein_nam; DB:sequence_id; GO:GO_id. this field is not mandatory (except in the case of IC
evidence code), cardinality 0, 1, >1; for cardinality >1 use a pipe to separate entries (e.g.
CGSC:pabA|CGSC:pabB) . Note: This field is used to hold an additional identifier for annotations
using certain evidence codes (IC, IEA, IGI, IPI, ISS). For example, it can identify another gene product
to which the annotated gene product is similar (ISS) or interacts with (IPI). More information on the
meaning of 'with/from' column entries is available in the evidence documentation entries for the
relevant codes. Cardinality = 0 is not recommended, but is permitted because cases can be found in
literature where no database identifier can be found (e.g. physical interaction or sequence similarity to
a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality =
0 should link to an explanation of why there is no entry in 'with.' Cardinality may be >1 for any of the
evidence codes that use 'with'; for IPI and IGI cardinality >1 has a special meaning (see evidence
documentation for more information). For cardinality >1 use a pipe to separate entries (e.g.
FB:FBgn1111111|FB:FBgn2222222). Note that a gene ID may be used in the 'with' column for a IPI
annotation, or for an ISS annotation based on amino acid sequence or protein structure similarity, if
the database does not have identifiers for individual gene products. A gene ID may also be used if the
cited reference provides enough information to determine which gene ID should be used, but not
enough to establish which protein ID is correct. 'GO:GO_id' is used only when the evidence code is
'IC', and refers to the GO term(s) used as the basis of a curator inference. In these cases the entry in
the 'DB:Reference' column will be that used to assign the GO term(s) from which the inference is
made. This field is mandatory for evidence code IC. The ID is usually an identifier for an individual
entry in a database (such as a sequence ID, gene ID, GO ID, etc.). Identifiers from the Center for
Biological Sequence Analysis (CBS), however, represent tools used to find homology or sequence
similarity; these identifiers can be used in the 'with' column for ISS annotations. The 'with' column
may not be used with the evidence codes IDA, TAS, NAS, or ND.
The flat file format comprises 15 tab-delimited fields. Blue denotes required fields:
Aspect one of P (biological process), F (molecular function) or C (cellular component) this field is
mandatory; cardinality 1
DB_Object_Name name of gene or gene product. not mandatory, cardinality 0, 1 [white space allowed]
Synonym Gene_symbol [or other text]. Strongly recommend gene synonyms are included in the gene
association file, as this aids the searching of GO. this field is not mandatory, cardinality 0, 1, >1 [white
space allowed]; for cardinality >1 use a pipe to separate entries (e.g. YFL039C|ABY1|END7|actin gene)
DB_Object_Type what kind of thing is being annotated one of gene, transcript, protein,
protein_structure, complex this field is mandatory, cardinality 1
Taxon taxonomic identifier(s). For cardinality 1, the ID of the species encoding the gene product.
For cardinality 2, to be used only in conjunction with terms that have the term 'interaction between
organisms' as an ancestor. The first taxon id should be that of the organism encoding the gene or gene
product, and the taxon id after the pipe should be that of the other organism in the interaction.
mandatory, cardinality 1, 2; for cardinality 2 use a pipe to separate entries (e.g. taxon:1|taxon:1000)
Date: on which the annotation was made; format is YYYYMMDD this field is mandatory, cardinality 1
Assigned_by The database which made the annotation one of the values in the table of database
abbreviations. [Database abbreviations explanation] Used for tracking the source of an individual
annotation. Default value is value entered in column 1 (DB). Value will differ from column 1 for any
that is made by one database and incorporated into another. this field is mandatory, cardinality 1
Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are:
GOid (where dbname is always GO), DB:Reference, With, Taxon (where dbname is always taxon). For GO ids, do not
repeat the 'GO:' prefix (i.e. always use GO:0000000, not GO:GO:0000000)
Computational Annotation Methods
This section includes descriptions of automated annotation methods used by participating
databases (descriptions have been provided by each group listed). EBI | MGI | TIGR
EBI GOA Electronic Annotation The large-scale assignment of GO terms to UniProt
Knowledgebase entries involves electronic techniques. This strategy exploits existing
properties within database entries including keywords and Enzyme Commission (EC)
numbers and cross-reference to InterPro (a database of protein motifs) which are manually
mapped to GO. SWISS-PROT keyword and InterPro to GO mappings are maintained inhouse and shared on the GO home page for local database updates. Electronically
combining these mappings with a table of matching Uniprot Knowledgebase entries
generates a table of associations. For each GOA association, an evidence code, which
summarizes how the association is made is provided. Associations are made electronically
are labeled as 'inferred from electronic annotation' (IEA). Evelyn Camon, 2002-09-03
MGI Electronic Annotation Methods Every object in the MGI databases (markers,
seqids, references, etc.) has an MGI: accession ID. See details in GO
Computational Annotation Methods
TIGR ISS Annotation (Arabidopsis, T. brucei) For TIGR Arabidopsis or T. brucei
annotations using 'Inferred from Sequence Similarity' (ISS) evidence, the reference is
usually 'TIGR_Ath1:annotation' for Arabidopsis (author: TIGR Arabidopsis annotation
team) and TIGR_Tba1:annotation for T. brucei (author: TIGR Trypanosoma brucei
annotation team), which are defined as follows:
name: TIGR annotation based upon multiple sources of similarity evidence
description: TIGR_Ath1:annotation or TIGR_Tba1:annotation denotes a curator's
interpretation of a combination of evidence. Our internal software tools present us with a
great deal of evidence based domains, sequence similarities, signal sequences, paralogous
proteins, etc. The curator interprets the body of evidence to make a decision about a GO
assignment when an external reference is not available. The curator places one or more
accessions that informed the decision in the "with" field.
What this says is that we have used many sequence similarity hits, etc., to make our
decision. However, we choose only 1-3 pieces of information as "with" information, as it is
not practical to enter and submit many entries for each annotation. We also have internal
calculations of paralogy and new domains we are identifying which have not yet been
published, but which help inform our decisions.