Download Subsystem Approach to Genome Annotation

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA interference wikipedia , lookup

Non-coding RNA wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Non-coding DNA wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Protein moonlighting wikipedia , lookup

Metabolic network modelling wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene regulatory network wikipedia , lookup

List of types of proteins wikipedia , lookup

Gene expression profiling wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Subsystem Approach to
Genome Annotation
National Microbial Pathogen Data Resource
Claudia Reich
NCSA, University of Illinois, Urbana
Complete Microbial Genomes
• 464 complete microbial genomes in NCBI as of 3-1-07
• 691 microbial genomes in progress as of 3-1-07
Making Sense of Genome Data
• Locate Genes: identify ORFs automatically
NCBI’s ORF Finder
• Assign Function: by sequence similarity to
experimentally characterized proteins
 BLAST family of sequence comparison tools
Problems with Assignments by
• When ORF is a member of a protein family
• Paralogous genes
• ORFs encoding similar proteins acting on
different substrates
• Assignments can be transitive, and many
times removed from experimental data
Other Factors Can Aid in
Function Assignments
Molecular phylogeny
Paralogous and orthologous families
Conserved gene neighborhood
Metabolic context
Bidirectional best hit matches across
multiple genomes
Incorporating Information Other
Than Similarity
• KEGG: manually curated pathway and
metabolic maps
• GO: vocabularies that describe ORFs as
associated with
 biological processes
 cellular components
 molecular function
• MetaCyc: experimentally elucidated metabolic
What is Needed:
• A system that:
 integrates all the above concepts
 organizes genomic data in structured idioms
 allows high-throughput annotation of newly
sequenced genomes
 resolves discrepancies in different annotation
 informs experimental research
Enter the SEED*
• Database and annotation environment
• Underlies, and accessible through, NMPDR
• Expert annotation via subsystems building
• Provides the most accurate genome
annotations available
*Argonne National Lab, University of Chicago, UIUC, FIG
What is a Subsystem?
• Any organizing biological principle:
 metabolic pathway
• amino acid biosynthesis, nitrogen fixation, glycolysis
 complex structure
• ribosome, flagellum
 set of defining features
• virulome, pathogenicity islands
 functional concept
• bacterial sigma factors, DNA binding proteins
Subsystems are:
• Sets of functional roles, which are functions,
or abstractions of functions (such as an EC
number), that together implement a specific
biological process or concept
• Created manually by expert curators
• Experts annotate single subsystems over the
complete collection of genomes, thus
contributing and sharing their expertise with
the scientific community
How Subsystems are Built
• Create a subsystem for the biological concept,
and define the functional roles
• In one (or a few) key organisms that include
the subsystem, find the genes and assign
meaningful functional names
• Project the annotations to orthologous genes
• Expand to more genomes, creating a
Populated Subsystem
Populated Subsystems
• Are Spreadsheets where:
 Columns: functional roles
 Rows: specific genomes
 Cells: genes in the organism that implement the
functional role
How to Access Subsystems
• From Home page (left navigation bar):
Subsystem Summaries: select organism
• From Organism pages
• From Subsystem Search
• From protein pages: to specific subsystems
Subsystem Pages in NMPDR
Table of Functional Roles
Subsystem diagram (if appropriate)
Populated subsystem spreadsheet
Customizable spreadsheet viewing options
Functional variants and subsets of roles
Curator’s notes
Benefits of Subsystems
More accurate annotations
Annotation of protein families
Analysis of sets of functionally related proteins
Less error-prone to automatic projections to
novel genomes
Subsystems Reveal Interesting
• Pathway variants:
 Are they clustered by phylogeny?
• Delta subunit of RNA polymerase only Bacillales
 Are they clustered by functional niche?
 Horizontal gene transfer?
• Fused genes:
  and ’ subunit of RNA polymerase fused in
• Fissioned genes:
 ’ subunit of RNA polymerase is fissioned in
Subsystems Reveal Interesting
• Duplicate assignments
 More than one gene for one functional role?
• Alpha subunit of RNA polymerase in Magnetococcus
and Francisella
 Same sequenced region in more than one contig
in partially assembled genomes?
 Frameshifts or other sequencing errors?
 Annotation errors?
Subsystems Reveal Interesting
• Missing genes:
 Is the function essential?
 Is the function conserved?
 Does the missing gene cluster with homologs in
other organisms?
 Is the function performed by a newly recruited
 Has a gene been acquired by horizontal gene
transfer and now performs that function?
Synthesis of Selenocysteinyl-tRNA
• Two known pathway variants
 One step in Bacteria
• SelA is annotated
 Two steps in Archaea and Eucarya
• PSTK was missing until very recently
Explore Selenocysteine Usage
• Start by searching for gene name, selA, in an organism known
to use Sec, E. coli K12
• Start from subsystem tree; expand category of "Protein
metabolism," expand subcategory of "Selenoproteins"
• Open "Selenocysteine metabolism" subsystem from protein
page or SS tree
Genomes arranged phylogenetically
Roles defined on mouse-over
What genes are missing in which organisms?
Are there Sec metabolism genes present in any organisms that do not
have proteins that need Sec?
 Are there organisms known to need Sec for certain proteins, but that do
not have a complete Sec biosynthesis pathway?
 Why is there a hypothetical protein included in this subsystem?