Download Functomics!?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

Genetic engineering wikipedia , lookup

Transposable element wikipedia , lookup

Biochemical cascade wikipedia , lookup

Interactome wikipedia , lookup

Expression vector wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression wikipedia , lookup

RNA-Seq wikipedia , lookup

Community fingerprinting wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Gene wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Gene regulatory network wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene expression profiling wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Genomic library wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Protein function
Where to find it.
How to predict it.
How to classify it.
Stuart Rison
Department of Biochemistry, UCL
[email protected]
Outline

Collecting functional information:



Small scale (single gene)
Large scale (sets of genes)
Function annotation schemes
 Problems with functional assignments
 [Comparing current schemes]
Collecting information for single genes





from 1° databases
from 2° databases
from Genome Databases (Model organisms)
by homology
not by homology
Annotation in databases: 1° and 2° databases





Some information can be found in 'primary' databases (sequence
and structure databases)
Usually limited although sometimes can be quite informative (e.g.
SwissProt)
Core data: sequence, citation information and taxonomic data
Annotation: Protein function; post-translational modifications;
domains and sites; Associated diseases; Sequence
conflicts/Variant
Most primary databases link to a number of value-added (2°)
databases (e.g. motif databases or disease databases) which
are often rich in information
Annotation in 1° databases: SwissProt
ID
HEM3_HUMAN
STANDARD;
PRT;
361 AA.
AC
P08397; P08396; Q16012;
…
DE
PORPHOBILINOGEN DEAMINASE (EC 4.3.1.8) (HYDROXYMETHYLBILANE SYNTHASE)
DE
(HMBS) (PRE-UROPORPHYRINOGEN SYNTHASE) (PBG-D).
GN
HMBS OR PBGD.
OS
Homo sapiens (Human).
OC
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
…(literature references)…
CC
FUNCTION: TETRAPOLYMERIZATION OF THE MONOPYRROLE PBG INTO THE
CC
HYDROXYMETHYLBILANE PREUROPORPHYRINOGEN IN SEVERAL DISCRETE STEPS.
CC
CATALYTIC ACTIVITY: 4 PORPHOBILINOGEN + H(2)O = HYDROXYMETHYLBILANE + 4 NH(3).
CC
COFACTOR: BINDS A DIPYRROMETHANE COFACTOR TO WHICH PORPHOBILINOGEN SUBUNITS…
CC
PATHWAY: THIRD STEP IN PORPHYRIN BIOSYNTHESIS BY THE SHEMIN PATHWAY. INVOLVED…
CC
ALTERNATIVE PRODUCTS: THERE ARE TWO ISOZYMES OF THIS ENZYME IN MAMMALS; THEY
CC
AREPRODUCED BY THE SAME GENE FROM ALTERNATIVE SPLICING…
CC
DISEASE: DEFECTS IN HMBS ARE THE CAUSE OF ACUTE INTERMITTENT PORPHYRIA (AIP); AN
CC
AUTOSOMAL DOMINANT DISEASE CHARACTERIZED BY ACUTE ATTACKS OF NEUROLOGICAL
CC
DYSFUNCTION…
CC
SIMILARITY: BELONGS TO THE HMBS FAMILY.
… (links to related databases - secondary databases) …
KW
Porphyrin biosynthesis; Heme biosynthesis; Lyase;
KW
Alternative splicing; Disease mutation.
…
(Sequence variations/Sequence)
Annotation in Motif databases: INTERPRO
http://interpro.ebi.ac.uk/servlet/IEntry?ac=IPR000860
Genome databases


Some deal with single organisms (e.g. SubtiList for B.
subtilis; Sanger Centre M. tuberculosis)
Some deal with multiple genomes (e.g. TIGR microbial
genomes database)


The level of annotation can be extensive
Many are much more than sequence repositories
extending the sequence with tons of information (e.g.
mutants; strains; complementation plasmids etc.)

If you are working with a model organism, chances of
obtaining reliable functional annotations are improved
Genome database: YPD
http://www.proteome.com/databases/YPD/reports/HEM3.html
Function assignment by homology I

If you just have a sequence
 The most common bioinformatics procedure
 Search your protein of interest against primary
databases; chances are if you find a homologue with
high-identity, it performs a similar function
 Many, many tools (BLAST, FASTA, S-W Search)

Beware of annotation by homology


relationship between seq. similarity and function not
straightforward
danger of propagation of incorrect functional information
Function assignment by homology II


Consider databases which distinguish experimental
function assignments from homology based ones
(e.g. YPD/WormPD, EcoCyc)
Or use databases which employ more rigorous
automated annotation tools (e.g. HAMAP @
SwissProt)
“Among the peculiarities recognized by the programs are: size discrepancy, absence
or mutation of regions involved in activity or binding (to metals, nucleotides, etc),
presence of paralogs, contradiction with the biological context (i.e. if a protein
belongs to a pathway supposed to be absent in a particular organism), etc. Such
"problematic" proteins will not be automatically annotated.”
Genome database: YPD
http://www.proteome.com/databases/YPD/reports/HEM3.html
Functional assignment “without homology”


Novel functional assignment methods now exists
which don’t make use of ‘direct’ homology searches
They exploit other relationships between proteins
which are used as indicators of shared function


Phylogenetic profiles
“Rosetta stone genes”
Phylogenetic profiles
Pellegrini M et al.,
“Assigning protein
functions by comparative
genome analysis: protein
phylogenetic profiles.”
PNAS (1999) 96(8):4285-8
Rosetta Stone method
More methods…
Marcotte EM, et al., Nature (1999) 402:83-86
Enright AJ, et al., Nature (1999) 404:86-90
Functional assignment “without homology”
Functional assignment “without homology”

Some access over the WWW




but experiemental
and only for certain organisms (Yeast, E. coli, M.
tuberculosis)
many proprietary methods
Considered one of the most promising solution for
preliminary annotation of “unknown function” proteins
in genome sequencing projects
Collecting information for many genes

Usually for “large-scale biology” (e.g. micro-array
experiments)

Genome Databases
Functional classification schemes

Genome Databases

Genome sequencing project are now the
primary driving force for extensive functional
annotation

We have the genes (ORFs), we want the
functions
FUNCTIONAL GENOMICS
(… more ’omes)
Functional classification schemes I

Dealing with large sets of genes  functional
classification schemes

Tentative schemes as early as 1983; use driven by
genome sequencing projects
 First extensive scheme published in 1993 by Monica
Riley [regularly updated (GenProtEC; EcoCyc)]


The majority of current schemes are heavily
influenced by the ‘Riley scheme’
‘2nd generation’ schemes are now being developed
Functional classification schemes II




Most schemes can be thought of as trees
Progression along the tree (root to leaves) represents
increasingly specific functions
ORFs are generally associated with leaf nodes (but of
course, they are also associated with intermediary
nodes)
Examples of use:


create gene sets linked by functionality (e.g. to detect
functional motifs)
validate a functional connection between genes (e.g. gene
expression studies)
An example scheme…
GeneProtEC
Metabolism of
small molecules
(900 ORFs)
Amino Acids
(112 ORFs)
Alanine
Central Intermediary
Metabolism
Amino sugars
Energy Metabolism
etc.
2 ORFs
etc.
8 ORFs
etc.
Aerobic respiration
32 ORFs
Fermentation
22 ORFs
Glycolysis
18 ORFs
etc.
Issues
 Functions: Apple
and Oranges
 Multi-dimensionality
 Multi-functionality
Issues: Apples and Oranges



Function is an umbrella catch-all term
Schemes do not distinguish between aspects of
functions
Most commonly they mix gene product type (T),
activity (A) and cellular role (R)
Cell division (R) : DNA replication (A)
Osmotic adaptation (R) : Ion channel (T,A)
Issues - Multi-dimensionality I

Human trypsin functions:





Biochemical: peptide bond hydrolysis
Molecular: proteolytic enzyme
Cellular: protein degradation
Physiological: digestion
Could conceive a number of other dimensions


Cellular location
Regulation
Issues - Multi-dimensionality II

Why differentiate function and process?

Figure of cell cycle-dependent Yeast gene expression
clusters (Pat Brown lab - Stanford)
Issues - Multi-functionality



Inherent: e.g. lac repressor; carbohydrate
metabolism and osmoprotection
Multi-subunit: e.g. succinic dehydrogenase; whole enzyme in TCA; subunit 1 - electron transport chain;
subunit 2 - cell structure
Circumstantial: e.g. acetate kinase; acetate only
environment - acetate metabolism; acetate absent fermentation enzyme
Gene Ontology - a collaboration



Drosophila (fruit fly) - FlyBase
Saccharomyces Genome Database (SGD)
Mus (mouse) - Mouse Genome Database (MGD)
Gene Ontology - the next generation

Multi-dimensional:




functional primitive: “a capability that a physical gene product
(or gene product group) carries as a potential” (e.g.
transporter or adenylate cyclase)
process: “a biological objective accomplished via one or more
ordered assemblies of functions” (e.g. cell growth and
maintenance or purine metabolism)
cellular component
Extensive: depth 11; nearly 4000 terms
 More complex organisation: away from tree structure
 Theoretically applicable to all species (designed for
multicellular eukaryotes)
Gene Ontology - Process
Gene Ontology - current status
http://www.geneontology.org/
Where to look for functional information single protein

With 1 or a few genes:




Primary databases (e.g. SwissProt)
Model organism databases (e.g. GenProtEC; SGD;
WormPD)
Metabolic/Pathway databases (e.g. KEGG)
Value-added databases (e.g. Motif databases; Disease
databases)

By homology

Not by homology
Where to look for functional information protein sets

Need some sort of functional classification scheme:


Tree like schemes (e.g. TIGR, GenProtEC)
Gene Ontology (FlyBase, MGD, SGD)

For comparative genomics, need schemes applied to
multiple organisms (e.g. PEDANT, TIGR)

Currently, greatest genome coverage is by PEDANT
(but non-manually curated)
Conclusions





Functional information is available but it is rarely
centralised
Function is a very broad definition; hard to know if the
information you need will be available at the level you
need it
New schemes (e.g. GO) are emerging which try and
cope with functional annotation better
And new automated functional annotation tools are
emerging (‘intelligent systems’; non-homology based)
You still need to validate predictions experimentally
A survey of (some) current schemes







1) EcoCyc/GenProtEC: E. coli scheme (Riley scheme, MBL)
2) SubtiList: Bacillus subtilis scheme (Institut Pasteur)
3) MIPS/PEDANT: yeast scheme (applied to other organisms in
PEDANT) (Munich Institute for Protein Science)
4) TIGR: microbial genomes scheme (The Institute for Genome
Research)
5) KEGG: multi-organism scheme (metabolic and regulatory
pathways) (Kyoto Encyclopaedia for Genes and Genomes)
6) WIT: multi-organism scheme (metabolic reconstruction) (What
is There; ANL)
7) Gene Ontology: a 2nd generation functional classification
scheme (EBI; FlyBase; MGD; SGD)
FuncWheel for the Combination Scheme
Conclusions - Scheme comparison I

Similar in the coverage of function (although
very varying ‘granularity’)
 ...yet different enough that direct comparison
complex
 Essentially deal with unicellular microbial
organisms (MIPS is tackling this)
 Certain ‘niche’ schemes (e.g. WIT/KEGG)
 ...or user community tailored schemes (e.g.
SubtiList)
WWW sites I

Primary databases (Sequence):

SwissProt:


PIR:


http://www.ncbi.nlm.nih.gov/Database/index.html
Primary databases (Structure)

Protein Data Bank:


http://www.rcsb.org/
Macromolecular Structure Database:


http://www-nbrf.Georgetown.edu/
NCBI databases:


http://www.expasy.ch/sprot
http://msd.ebi.ac.uk/
Value added:

INTERPRO:

http://interpro.ebi.ac.uk/
WWW sites II

Single genome databases:

Subtilist:


Saccharomyces Genome Database:


http://flybase.bio.indiana.edu/
Mouse Genome Database (MGD):


http://genprotec.mdbl.edu/
FlyBase:


http://ecocyc.pangeasystems.com/
GenProtEC:


http://genomewww.stanford.edu/Saccharomyces/
EcoCyc:


http://genolist.pasteur.fr/SubtiList/
http://www.informatics.jax.org/
Yeast Protein Database (YPD) and WormPD:

http://www.proteome.com/
WWW sites III

Multiple genome databases

The Institute for Genome Research:


MIPS/PEDANT:


KEGG:


http://www.genome.ad.jp/kegg/
WIT:

http://igweb.integratedgenomics.com/IGwit/
Non-homology based function prediction



http://www.expasy.ch/sprot/hamap/
Pathway databases


http://pedant.mips.biochem.mpg.de/
HAMAP:


http://www.tigr.org/microbialdb
Mycobacterium tuberculosis:
 http://www.doe-mbi.ucla.edu/people/sergio/TB/tb.html
Yeast:
 http://www.doe-mbi.ucla.edu/people/marcotte/yeast.html
A relevant paper

http://www.biochem.ucl.ac.uk/~rison/Publications/index.html