Download Biological databases play a central role in bioinformatics.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Histone acetylation and deacetylation wikipedia , lookup

X-inactivation wikipedia , lookup

Magnesium transporter wikipedia , lookup

History of molecular evolution wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene desert wikipedia , lookup

Point mutation wikipedia , lookup

Community fingerprinting wikipedia , lookup

Molecular ecology wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression wikipedia , lookup

Protein moonlighting wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression profiling wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Molecular evolution wikipedia , lookup

Gene regulatory network wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

List of types of proteins wikipedia , lookup

Transcript
Biological databases play a central role in
bioinformatics.
™They offer scientists the opportunity to access sequence
and structure data for tens of thousands of sequences from a
broad range of organisms.
™We will see GO, CONSURF, PFAM
The Gene Ontology (GO) Project:
Structured Vocabularies for Molecular Biology and Their
Application to Genome and Expression Analysis
The focus of the Gene Ontology project is three-fold.
‰First, the project goal is to compile the Gene Ontologies; structured vocabularies
describing domains of molecular biology. The three domains under development
were chosen as ones that are shared by all organisms; Molecular Function,
Biological Process, and Cellular Component.
‰Second, the project supports the use of these structured vocabularies in the
annotation of gene products. Gene products are associated with the most precise
GO term supported by the experimental evidence. Structured vocabularies are
hierarchical, allowing both attributions and queries to be made at different levels of
specificity.
‰Third, the gene product-to-GO annotation sets are provided by participating
groups to the public through open access to the GO database and Web resource.
Thus, the community can access standardized annotations of gene products
across multiple species and resources.
1
We will describe the current ontologies and what is beyond the scope
scope
of the Gene Ontology project. It addresses the issue of how GO
vocabularies are constructed and related to genes and gene
products. It concludes with a discussion of how researchers can
access, browse, and utilize the GO project in the course of their
their own
research.
What are Ontologies and Why do
we Need Them?
Ontologies, in one sense used today in the fields of computer science
science and
bioinformatics, are “specifications of a relational vocabulary”
Three areas are considered orthogonal to each other, i.e., they are
treated as independent domains. The ontologies are developed to
include all terms falling into these domains without consideration
consideration of
whether the biological attribute is restricted to certain taxonomic
taxonomic
groups. Therefore, biological processes that occur only in plants
plants
(e.g., photosynthesis) or mammals (e.g., lactation) are included.
included.
2
How does GO work?
What information might we want to
capture about a gene product?
What does the gene product do?
Why does it perform these activities?
Where does it act?
GO: Three ontologies
What does it do?
Molecular Function
What processes is it
involved in?
Biological Process
Where does it act?
Cellular Component
gene product
3
The 3 Gene Ontologies
Molecular Function = elemental activity/task
– the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
Biological Process = biological goal or
objective
– broad biological goals, such as mitosis or purine metabolism,
metabolism,
that are accomplished by ordered assemblies of molecular
functions
Cellular Component = location or complex
– subcellular structures, locations, and macromolecular
complexes; examples include nucleus,
nucleus, telomere,
telomere, and RNA
polymerase II holoenzyme
Molecular Function
Molecular Function refers to the elemental activity or task
performed, or potentially performed, by individual gene products.
products.
Enzymatic activities such as “nuclease,” as well as structural
activities such as “structural constituent of chromatin” are included
included in
Molecular function.
An example of a broad functional term is “transporter” (enabling the
directed movement of substances, such as macromolecules, small
molecules, and ions, into, out of, or within a cell).
An example of a more detailed functional term is “protein“protein-glutamine
gammagamma-glutamyltransferase,” which crosscross-links adjacent polypeptide
chains by the formation of the N6N6-(L(L-isoglutamyl)isoglutamyl)-L-lysine
isopeptide; the gammagamma-carboxymide groups of peptidepeptide-bound
glutamine residues act as acyl donors, and the 66-aminoamino-groups of
peptidylpeptidyl- and peptidepeptide-bound lysine residues act as acceptors, to give
intraintra- and interinter-molecular N6N6-(5(5-glutamyl)lysine crosscross-links.
4
Biological Process
Biological Process refers to the broad biological objective or goal
goal in
which a gene product participates.
Biological Process includes the areas of development, cell
communication, physiological processes, and behavior.
An example of a broad process term is “mitosis” (the division of the
eukaryotic cell nucleus to produce two daughter nuclei that, usually,
usually,
contain the identical chromosome complement to their mother).
An example of a more detailed process term is “calcium“calcium-dependent
cellcell-matrix adhesion” (the binding of a cell to the extracellular matrix
matrix
via adhesion molecules that require the presence of calcium for the
interaction).
Cellular Component
Cellular Component refers to the location of action for a gene
product. This location may be a structural component of a cell, such
as the nucleus. It can also refer to a location as part of a molecular
molecular
complex, such as the ribosome.
How are GO Vocabularies Constructed?
GO vocabularies are updated and modified on a
regular basisA small number of GO curators are
empowered to make additions to and deletions
from GO.
A monthly snapshot of XML format files of GO
vocabularies is saved and posted on the GO
Web site.
5
Example:
Gene Product = hammer
Function (what)
Process (why)
Drive nail (into wood)
Carpentry
Drive stake (into soil)
Gardening
Smash roach
Pest Control
Clown’s juggling object
Entertainment
What’s in a name?
The same name can be used to describe
different concepts
6
What’s in a name?
Molecular Function
A single reaction or activity, not a gene
product
A gene product may have several functions
Sets of functions make up a biological
process
7
What’s in a name?
Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
Gluconeogenesis
All refer to the process of making glucose
from simpler components
What’s in a name?
The same name can be used to describe
different concepts
A concept can be described using
different names
Æ Comparison is difficult – in particular
across species or across databases
8
Ontology Structure
Ontologies can be represented as graphs,
where the nodes are connected by edges
Nodes = concepts in the ontology
Edges = relationships between the concepts
node
edge
node
node
Ontology Structure
The Gene Ontology is structured as a
hierarchical directed acyclic graph (DAG)
Terms can have more than one parent
and zero, one or more children
Terms are linked by two relationships
– is-a
– part-of
9
Simple hierarchies (Trees)
Directed Acyclic Graphs
Single parent
One or more parents
Directed Acyclic Graphs
(DAG)
protein complex
organelle
mitochondrion
[other protein
complexes]
[other organelles]
fatty acid beta-oxidation
multienzyme complex
is-a
part-of
10
Parent-Child Relationships
Nucleus
Nucleoplasm
A child is a subset of
a parent’s elements
Nuclear
envelope
Nucleolus Chromosome Perinuclear space
The cell component term
Nucleus has 5 children
True Path Rule
The path from a child term all the way up to
its top-level parent(s) must always be true
is-a
cell
part-of
Ê cytoplasm
Ê chromosome
L nuclear chromosome
L cytoplasmic chromosome
L mitochondrial chromosome
Ê nucleus
Ê nuclear chromosome
L
Ê
11
What’s in a GO term?
term: gluconeogenesis
id: GO:0006094
definition: The formation of glucose from
noncarbohydrate precursors, such as
pyruvate, amino acids and glycerol.
No GO Areas
GO covers ‘normal’ functions and
processes
– No pathological processes
– No experimental conditions
NO evolutionary relationships
NO gene products
NOT a system of nomenclature
12
Annotation of gene products
with GO terms
Mitochondrial P450
Cellular component:
mitochondrial inner membrane
GO:0005743
Biological process:
Electron transport
GO:0006118
substrate + O2 = CO2 +H20 product
Molecular function:
monooxygenase activity
GO:0004497
13
Why modify the GO
GO reflects current knowledge of biology
New organisms being added makes
existing terms arrangements incorrect
Not everything perfect from the outset
What can scientists do with GO?
• Access gene product functional information
• Find how much of a proteome is involved in a process/
function/ component in the cell
• Map GO terms and incorporate manual annotations into
own databases
• Provide a link between biological knowledge and …
• gene expression profiles
• proteomics data
14
Microarray analysis
Whole genome analysis
(J. D. Munkvold et al., 2004)
…analysis of high-throughput data according to GO
MicroArray data analysis
time
Defense response
Immune response
Response to stimulus
Toll regulated genes
JAK-STAT regulated genes
Puparial adhesion
Molting cycle
hemocyanin
Amino acid catabolism
Lipid metobolism
Peptidase activity
Protein catabloism
Immune response
Immune response
Toll regulated genes
attacked control
Bregje Wertheim at the Centre for Evolutionary Genomics,
15
Functional categories in eukaryotic proteomes. The categories were derived
from functional classification systems, including the Gene Ontology project.
(Figure 37 in {Lander, Linton, et al. 2001 8 /id}
Distribution of the molecular functions of the 26,383 human proteins. Each slice lists the numbers and
percentages (in parentheses) of gene functions assigned to a given category of molecular function. The outer
circle shows the assignment to molecular function categories in the Gene Ontology (GO) (Gene Ontology: tool
for the unification of biology. The Gene Ontology Consortium (2000) Nature Genet. 25: 25-29), and the inner
circle shows the assignment to Celera's Panther molecular function categories. (Figure 15 in {Venter, Adams, et
al. 2001 1181 /id})
16
CONSURF
Conservation scores of residues in
proteins
Two versions available
Pre-compiled results are availabble
17
PFAM
Hundreds of thousands of protein sequences are now known and the
deluge of data shows no signs of slowing. The sequence analysis of
proteins may seem like a perpetual (continuous) task. However, the
the majority
of protein sequences appear to fall into a few thousand protein families
(Chothia,
Chothia, 1992).
1992).
Very often these families are representative of proteins at the domain level,
where domains are discrete structural units that are frequently found in
different protein contexts.
Pfam is a database of such protein domain families (Sonnhammer
(Sonnhammer et al.,
1997;
1997; Bateman et al., 2002),
2002), with each family represented by multiple
sequence alignments and profile hidden Markov models (HMMs
). In
(HMMs).
addition, each family has associated annotation, literature references,
references, and
links to other databases.
The entries in Pfam are available via the Web and in flatfile format.
18
19
FUNCTIONAL DOMAIN
20
download
SCOP database
CONSURF database
PFAM database
PDB
GO database
structure
domain
function
21
PIR
22
Entrez
a client-server system for retrieval of
information related to molecular biology
can be used
– via web page
– via "embedded" client in other software
provided by National Center for
Biotechnology Information, part of the
National Library of Medicine (NIH)
23
Entrez Databases
PubMed: The biomedical literature
– PUBMED database contains Medline abstracts as well as
links to full text articles on sites maintained by journal
publishers
Nucleotide sequence database
(Genbank)
Protein sequence database
Structure: three-dimensional
macromolecular structures
Genome: complete genome assemblies
PopSet: population study data sets
24
Entrez Databases
OMIM: Online Mendelian Inheritance in
Man
Taxonomy: organisms in GenBank
Books: online books
ProbeSet: Gene Expression Omnibus
(GEO)
3D Domains: domains from Entrez
Structure
Entrez essentials
Semi-automated entry of information into
databases
Critical to usefulness is the links between
databases
25
Entrez literature searching
can find papers on a given subject
can find papers on a specific gene
can find papers related to a given paper
can switch between literature and
sequence databases
Pubmed has links to publishers’ websites
to view full text of articles
Pubmed Central has free full text copies
Entrez sequence searching
can find sequences for a given gene or
protein
can download copy of sequence
26
Example Entrez Session
Goal: Find literature and sequences for
cystic fibrosis genes
– Use OMIM with Keyword searching.
– Switch to Protein database to see sequence.
– Change to GenPept format to save sequence.
– Switch to Nucleotide database to see
sequence.
– Use neighbor feature to find related articles.
– Use MESH terms to find similar articles.
– Search the Nucleotide database by gene
name.
Example Entrez Session
27
Example Entrez Session
Example Entrez Session
28
Example Entrez Session
Example Entrez Session
29
Block Diagram for Entrez
Literature Searching
Results of
Previous Search
Additional
Search Criterion
Displayed Item
Selection
Entrez
Search
Engine
Results of
Search (List)
Item Display
Desired Output
Format
30