Download Title goes here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

Oncogenomics wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

Nucleosome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Primary transcript wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Transposable element wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Cancer epigenetics wikipedia , lookup

DNA supercoil wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Human genome wikipedia , lookup

DNA vaccination wikipedia , lookup

Gene nomenclature wikipedia , lookup

Metagenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Molecular cloning wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Epigenomics wikipedia , lookup

Genomic library wikipedia , lookup

Genome evolution wikipedia , lookup

Microsatellite wikipedia , lookup

Non-coding DNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Gene wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Point mutation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomics wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Genome editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Advancing Science with DNA Sequence
IMG terms and pathways
Natalia Ivanova
Iain Anderson
Thanos Lykidis
Nikos Kyrpides
Krishna Palaniappan
Amy Chen
Frank Korzeniewski
Yuri Grechkin
Ernest Szeto
Victor Markowitz
MGM Workshop
February 1, 2012
Advancing Science with DNA Sequence
New: SEED
subsystems
Transport DB,
Phenotypes
Why so many?
What’s the difference?
Which one should I use?
Advancing Science with DNA Sequence
Where it all comes from
• Experimental data: gene A in a
genome X




catalyzes a reaction
interacts with another protein(s)
gene knock-out causes certain phenotype
…
This information is recorded in a
structured way:
 ontologies (e.g. Gene Ontology)
 pathway collections (metabolic and
protein-protein interaction)
 other (reasoning rules, like TIGR Genome
Properties)
Advancing Science with DNA Sequence
Modeling the data properly – why
nobody does that
phenotype
gene
pathway
transcript
protein
evidence
reaction
enzyme
compounds
• Genes are connected to phenotypes via a multi-step
process, with many parameters
• We have very vague ideas about the steps/parameters for
the majority of genes/phenotypes
• If we design a relational database for gene/phenotype
connections, most tables will be empty
Advancing Science with DNA Sequence
What it looks like in real life –
KEGG vs MetaCyc
KEGG
http://www.genome.jp/kegg/
MetaCyc
http://metacyc.org/
Advancing Science with DNA Sequence
Ammonia oxidation pathway in
KEGG
Advancing Science with DNA Sequence
The same pathway/reaction in
MetaCyc
Advancing Science with DNA Sequence
Even MetaCyc record is still
incomplete
• Which subunit has which
cofactor?
• Type of Cu2+ cluster,
type of Fe2+ cluster?
• One of the subunits is a
cytochrome c, yet the
enzyme is cytosolic?
• Does it require any help
with maturation of metal
clusters?
• Pseudomonas sp. PB16 was shown to have only 1 enzyme from the
pathway, hydroxylamine reductase. Does it have the entire pathway?
Advancing Science with DNA Sequence
Even bigger mess: bioinformatics
inference
• Experimental data: gene A in a
genome X




catalyzes a reaction
interacts with another protein(s)
gene knock-out causes certain phenotype
…
What about gene B in genome Y,
which is similar to gene A?
Advancing Science with DNA Sequence
“True or false?” game
• If gene B was manually annotated, the
annotation must be correct
• If gene B was manually annotated, and it has
a bi-directional best BLAST hit to gene A
with e-value of 1.0e-5, the annotation must
be correct
• If gene B was manually annotated, and it has
>50% identity to gene A, it is found in the
same conserved chromosomal neighborhood
as gene A, the annotation must be correct
•…
Advancing Science with DNA Sequence
Poorly done inference - MetaCyc
• Software called PathoLogic
• Parses annotated files, tries to find matches between EC
numbers/full product names/partial product names and
reactions in MetaCyc database
• Automatically infers pathway presence based on matches to
MetaCyc reactions
• Tries to find candidate genes for “missing” enzymes by
doing BLAST of the genes assigned to this reaction in other
organisms
• Generates a lot of false positives - inferred the presence of
ammonia oxidation pathway in Staphylococcus based on the
presence of 1 gene annotated as ammonia monooxygenase
in GenBank file
Advancing Science with DNA Sequence
Better inference: KEGG
• Annotation is inferred
based on orthology,
defined as bi-directional
best BLAST hits, manually
refined based on
“Ortholog tables” and
chromosomal clusters
• Poorly documented, but
seems to generate a lot
less false positives than
PathoLogic
Advancing Science with DNA Sequence
Even the best structured inference
is far from perfect
• Problem: both BLAST or Smith-Waterman
don’t know which amino acids are more
important for protein function than
others
• Using consensus sequence (either as PSSM
or HMM) with family-specific bit score
cutoffs would be much better
Advancing Science with DNA Sequence
Pathway collections: KEGG,
MetaCyc and others
Which particular set of interactions is a
pathway? (i. e. how do we define
pathway boundaries within the network?)
Advancing Science with DNA Sequence
Ideal solution: pathway NR
• All pathway collections share a common
skeleton of reactions, which consist of
reactants (compounds)
• All reactions share the common base of
proteins annotated as catalysts
• Can we merge the information from
different collections, using the best features
of all of them?
Advancing Science with DNA Sequence
IMG terms: 3 types
A
B
R1
Not an IMG term!
Enzyme (EC x.x.x.x)
Enzyme (EC x.x.x.x)
monomeric, needs cofactor C
C
R2, spontaneous
Enzyme (EC x.x.x.x)
monomeric precursor
IMG term of the type
“Gene product”
 IMG terms of 3 types:
1. gene product
2. multi-subunit protein complex
3. modified protein
Enzyme (EC x.x.x.x)
heterotrimeric, needs cofactor D
R4, chaperone
Enzyme (EC x.x.x.x)
heterotrimeric, subunit C
IMG term of the type
“Modified protein”
Enzyme (EC x.x.x.x)
heterotrimeric, subunit A
D
IMG term of the type
“Protein complex”
R3, spontaneous
Enzyme (EC x.x.x.x)
heterotrimeric, subunit B
IMG term of the type
“Gene product”
Enzyme (EC x.x.x.x)
heterotrimeric, subunit A precursor
Advancing Science with DNA Sequence
Protein-protein interaction
pathways:
same model
Advancing Science with DNA Sequence
You’ve been warned!