Download Keystone2011poster

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative comparative linguistics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Human genetic variation wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

NEDD9 wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Non-coding DNA wikipedia , lookup

Point mutation wikipedia , lookup

Transposable element wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Copy-number variation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genomic library wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Maximum parsimony (phylogenetics) wikipedia , lookup

Gene nomenclature wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Identify gene markers for different taxonomic
groups in Archaea and Bacteria Genomes
Dongying Wu1,2, Jonathan A. Eisen1,2
1. DOE Joint Genome Institute, Walnut Creek, California 94598, USA
2. University of California, Davis, Davis, California 95616, USA
Introduction
The sequencing and phylogenetic analysis of rRNA molecules demonstrated that all
organisms could be placed on a single tree of life. Highly conserved, homologous 16S rRNA
genes' presence in all organismal lineages makes them the only universal marker that has
been adopted by biologist. Unfortunately phylogenetic trees based on rRNA sequences do not
always accurately reflect the evolutionary history of the organisms represented due to the
occurrence of lateral gene transfer, different rates of evolution in different lineages, and
convergent evolution of rRNAs that can result in rRNA sequences from distantly related
species becoming more similar to each other over time. There are also difficulties in
generating accurate alignments of rRNA genes. For metagenomic studies of bacteria and
archaea, more phylogenetic markers are needed in addition to the small subunit rRNA gene.
Phylogenetic analysis using protein markers have been limited in scope. To date, the
AMPHORA package developed by our lab only include 31 protein markers for bacteria. To
address such issue, we started to identifying phylogenetic markers at different taxonomic
levels systematically. The progress of genomic sequencing especially the phylogenetic
diversity driven Genome Encyclopedia of Bacteria and Archaea (GEBA) project provides a
great dataset for our top-down approach for marker identification.
B. Grouping Families at Different Taxonomic Levels into Super-families
Many gene families for different phylogenetics groups might be sub-families of higher
level of taxonomic groups. To capture the relationship of the gene families and identify
super families that span multiple phylogentic groups, we performed a tree-based method to
sort and clustering the gene families. HMM profiles were built for each of the 5133 gene
families that can be marker candidates at local taxonomy levels and one consensus sequence
was emitted from each HMM profile. The consensus sequences were subsequently
clustered into single linkage clusters according after an all vs all BLASTP search within the
consensus sequences. A neighbor-join tree was built for each cluster by FastTree. We picked
the clades that had only one consensus sequence from one taxonomic group, and the
sequences were distinct from other consensus sequences in the tree using the same tree
sampling program we used to identify phylogentic marker candidates at local levels (Figure
2).
Figure 4. The monophyletic value at different taxonomic level based on a PHYML tree
of ribosomal protein S4 family.
E. Identify Super-families that can be Markers for at least 5 Taxonomic Level
For each super-familie, a phylogenetic tree was built. At all the taxonomic levels we
were interested in, we calculated the universality (0-100), the evenness (0-100) and the
monophyletic value (0-100) of the super-family members in terms of genome distributions.
We only keep the super-families that can be marker candidates for at least 5 taxonomic
levels (for each taxonomic group of interest, the multiplication of universality, evenness
and monophyletic value should be larger than 729000).
Methods and Results
A. Gene Family Building for Different Taxonomic Groups
Only taxonomic groups with more than five completed genomes were selected for marker
identification (see Table 1 for the list). Amino acid sequences were downloaded from the JGI
Integrated Microbial Genomes (IMG) website.
Peptides from 9 gene families that with extraordinary high copy numbers (>1000) in
the bacterial level family building were filtered out first. All vs ALL BLASTP search were
performed for the entire proteomes in the group (with the e-value cutoff set at 1e-10),
followed by MCL clustering (with inflation value set at 2). Neighbor join trees were build
for all the MCL families. We've developed a tool to parse all the trees and identify clades
with single-copies genes across the genomes in the group. HMM profiles were built for the
selected clades and HMM search against the entire proteomes in the group was applied to
evaluate how distinct the gene families were.
We consider a gene family as a marker candidate if all the genomes in the group have
only one copy of the gene family members (high universality and high evenness, see Figure
1), and the family members can be distinctly picked up by the HMM profile built for the
family. 5133 families were identified that met our requirements (Table 1).
Universality
Evenness 100  e
C. Building Phylogenetic Marker Super-families
HMM profiles were built for all the potential marker super-families identified by
consensus sequence trees. Genes belonging to the super-families that had not been included
were retrieved by hmmsearch.
D. Evaluate the Marker Super-families at Different Taxonomic Level
A family might be a good marker for certain taxonomic groups but not suitable for other
groups. To make a good marker, a family needs to be universally distributed across the
genomes at a given taxonomy level and single-copied in each genome. Further more, the
genes for a given taxonomic group need to be monophyletic in a phylogenetic tree. We
developed a monophyletic measurement because we have to tolerate, to a certain degree, of
divergence in terms of monophyly. The monophyletic measurement is demonstrated in
Figure 3. The monophyletic value calculation for a genome family at different taxonomic
level is exemplified by the ribosomal protein S4 family (see Figure 4).
= 100 x Number of Genomes Covered by the Family
Total Number of Genomes
4

Ng
Figure 2. A neighbor-join tree of all the consensus sequences in the carbamoyl-phosphate
synthase cluster. Our tree parsing and sampling program automatically identify 10 marker
super-families that span multiple taxonomic groups (colored).
 / NiNm /
i
Ni: the number of the gene family members from the genome i;
Nm: the medium of Ni for all the genomes with the family
Ng: the number of genomes with the family

Figure 1. Gene family universality and evenness calculations.
Phylogenetic group
Genome Number
Gene Number
Maker Candidates
Archaea
62
145415
106
Actinobacteria
63
267783
136
Alphaproteobacteria
94
347287
121
Betaproteobacteria
56
266362
311
Gammaproteobacteria
126
483632
118
Deltaproteobacteria
25
102115
206
Epislonproteobacteria
18
33416
455
Bacteriodes
25
71531
286
Chlamydae
13
13823
560
Chloroflexi
10
33577
323
Cyanobacteria
36
124080
590
Firmicutes
106
312309
87
Spirochaetes
18
38832
176
Thermi
5
14160
974
Thermotogae
9
17037
684
Table 1. Phylogenetic Marker Candidates for Different Taxonomic Groups.
Monophyletic factor =
Clade number factor= 1- [(c-1)/n]2
Clade Size
Monophyletic Assumption
Square of Distance
s1
n
(n-S1)2
s2
0
S2 2
…
0
…
sc
0
Sc2
(sort by size)
Euclidean
distance
Compare the clade distribution to the ideal monophylic scenario
Monophyletic Value= 100 X Clade number factor X monophyletic factor
Figure 3. Monophyletic value calculation. In a phylogenetic tree, a list of taxa (total
number is n) can be divided in a number of monophyletic clades (clade number is c)
by a phylogenetic tree. S is the number of taxa in a given monophyletic clade.
Figure 5. The multiplication of universality, evenness and monophyletic value at
different taxonomic levels for 209 super-families.
Conclusions
We’ve established a protocol for the automatic identifications of phylogenetic markers
for any given phylogentic groups. The protocol uses BLAST and MCL clustering
algorithms to generate gene families for a selection of genomes. Phylogenetic trees are
built for the gene families and clades from the trees are automatically sampled and
evaluated for universality and evenness in terms of their distributions across genomes
of interest. HMM profiles are built for universally distributed single-copied genes that
form monophyletic clades on phylogenetic trees. HMM searches against the genomes
of interest are applied to help us to decide the the families that are potential
phylogenetic markers.
We’ve build gene families and captured potential markers at the follow taxonomic
levels: Bacteria, Archaea, Actinobacteria, Bacteroides, Chlamydae, Chloroflexi,
Cyanobacteria, Firmicutes, Spirochaetes, Thermi, Thermotogae, and five classes of
Proteobacteria. HMM profiles were built for 5133 families that can be potential
markers at the taxonomic levels we’ve examined. We’ve also carried out clustering
and tree building analysis for all the taxonomic specific marker families. As a result,
we’ve identified 209 super-families that each can be a potential phylogentic marker
for at least 5 taxonomic groups we’ve studied. We are currently using the novel
phylogenetic markers we’ve discovered to analyze metagenomic data, thus evaluating
their impact in community diversity and richness studies.