Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Graph partitioning in genomic data
analysis
Roland Barriot, Petra Langendijk-Genevaux, Yves Quentin, Gwennaele Fichant
« Génomique des systèmes intégrés » group
CLUSTERING AND BLOCK METHODS
Toulouse, July 2015
Laboratoire de Microbiologie et Génétique Moléculaires
– UMR5100
Outline
• Biological concepts and definitions
– Genomes, genes, gene functions
– Quest for Orthologs
• Strategy for inferring orthologs
– Graph representation
– Community detection and bi-clustering
• Application to ABC transporters
• Current challenges
2
Clustering and Block Methods - 2015 - Roland Barriot
Central dogma of molecular biology
• DNA →mRNA → protein
genes
chromosome
(DNA)
hereditary
information
alphabet = {A, T, C, G}
4 letters (nucleotides)
alphabet = {A, U, C, G}
molecular
machines
alphabet = 20 letters
(amino acids)
3
Clustering and Block Methods - 2015 - Roland Barriot
Sequence alignment
• Sequence comparisons and alignments
– Query = protein sequence from Escherichiachromosome
coli
(DNA)
Common
– Subject = protein sequence fromancestor
Salmonella typhimuriumSpeciation
Evolution
=
mutations
Today
organisms
sequence similarity
→ similarity score reflects sequence similarity (conservation)
• derived from a common ancestor? (orthologous sequences)
• similar sequence = similar function?
4
Clustering and Block Methods - 2015 - Roland Barriot
Genomes
The genome (DNA) is the hereditary information that defines the potential of an organism. It is
the sequence of nucleotides of the chromosomes of an organism.
Gene function is predicted by sequence similarity.
Full chromosome of
Mycoplasma genitalium
1.2 Mbp (small)
5
Clustering and Block Methods - 2015 - Roland Barriot
Duplication events
• Need an orthology 1:1 relationship to predict
gene function
6
Clustering and Block Methods - 2015 - Roland Barriot
Strategy for orthology 1:1 inference
• Infer putative links
– sequence similarity
→ graph
• Prune graph to remove false positives
– graph partitioning
→ communities
• Identify subclasses by genomic neighborhood
conservation
– bi-clustering
→ orthologs 1:1
7
Clustering and Block Methods - 2015 - Roland Barriot
Putative orthology 1:1 inference
sequence comparisons
Speciation
Duplication
Evolution
=
mutations
B
B2
Speciation
Duplication
C
A
B
B2
C
C2 C22
C2 C22
• all against all sequences systematic comparison
• B:C sim(B,C) > max(sim(B,B2), sim(C, C2),
sim(C,C22))
• B2:C22 sim(B2,C22) < sim(C2,C22)
8
Clustering and Block Methods - 2015 - Roland Barriot
Gene losses
sequence comparisons
Speciation
Duplication
B
B2
Speciation
BBH
Duplication
loss
A
B
B2
losses
C
C
C2 C22
C2 C22
• rounds of duplications and losses lead to false
positive links
9
Clustering and Block Methods - 2015 - Roland Barriot
False positive pruning
Graph representation: nodes = genes, edges = orthologs
B
C22
C
B
B2
C
B2
C2 C22
Graph partitioning or
Community detection
10
Clustering and Block Methods - 2015 - Roland Barriot
C2
Orthologous systems: co-evolution
proteins
genes
common ancestor
(extinct)
1 single ABC
transporter
chromosome
(DNA)
pentose
transport
Speciation (orthology)
Duplication (paralogy)
Evolution
Speciation
Duplication
Today
organisms
A
pentose
transport
Escherichia coli
11
B
pentose
transport
B2
xylose
transport
Streptococcus pneumoniae
C
C2
pentose
transport
C22
xylose
transport
Staphylococcus aureus
Clustering and Block Methods - 2015 - Roland Barriot
arabinose
transport
Gene order conservation illustration
12
Clustering and Block Methods - 2015 - Roland Barriot
Orthologous systems
A
B1
Crossing:
isortholog communities
genomic neighborhood
C1
D1
B2
C2
D2’
D2’’
13
Clustering and Block Methods - 2015 - Roland Barriot
Genomic neighbors
Clustering of communities shared by the same genomes
B1
C1
B2
D1
C2
g1 g2 g4
g3 g7 g5 g6
communities
14
Clustering and Block Methods - 2015 - Roland Barriot
Clustering of communities shared by the same genomes
Genomic neighbors
Bicluster 1
B1
Two core subfamilies
C1
B2
Bicluster 2
D1
C2
g7 g5 g4
g3 g1 g2 g6
communities
15
Clustering and Block Methods - 2015 - Roland Barriot
Application to ABC systems
ABC transporters
• large family of paralogous genes
• up to >100 in a genome
• up to 15% of the genes in a prokaryotic genome
SBP: Solute Binding
Protein
SBP
MSD
MSD
MSD
NBD
NBD
ATP
ADP + Pi
NBD
MSD
NBD
ATP
Exporter
MSD: Membrane
Spanning Domain
NBD: Nucleotide
Binding Domain
ADP + Pi
Importer
• in complete genomes of prokaryotes: ABCdb
– ~2,000 genomes
– ~350k genes encoding ABC systems
16
Clustering and Block Methods - 2015 - Roland Barriot
Pipeline for ABC systems reconstruction and annotation
ABC proteins
identification &
classification
Complete
genome
Assembly into
functional
systems
Sub-family
classification
• Sequence similarity search • Chromosome localization
• Profiles
• Sub-family compatibility
• Evolutionary rules
Functional
prediction
Evolutionary
studies
Example from Acidovorax citrulli AAC00-1:same subfamily (carbohydrates transport), same location
MSD
N
B
D
MSD
MSD
N
B
D
SBP
SBP
SBP
M
S
D
M
S
D
NBD
M
S
D
M
S
D
NBD
NBD
17
NBD
Clustering and Block Methods - 2015 - Roland Barriot
SBP
ABC systems in Lactococcus lactis
A_1 : import galactosides
A_2 : import oligopeptides
A_3 : résistance macrolides
A_4 : import acides aminés
A_5 : import di-saccharides
A_6 : multidrogues resistance
A_7 : export antibiotiques
A_8 : import sidérophores
A_9 : export peptides / antibiotiques
A_11 : import phosphate
A_12 et A_14 : ?
A_18 : import phosphonate
18
Clustering and Block Methods - 2015 - Roland Barriot
ABC systems in Bacillus subtilis
206 proteins forming 80 systems
19
Clustering and Block Methods - 2015 - Roland Barriot
ABC Subfamily 1: carbohydrate importers
• ~100 curated genomes
SBP
M
S
D
– 221 MSDs, 158 SBPs, 137 NBDs
MSD orthologs 1:1 graph
20
SBP orthologs 1:1 graph
NBD
M
S
D
NBD
NBD orthologs 1:1 graph
Clustering and Block Methods - 2015 - Roland Barriot
Communities detected
• walktrap method
21
Clustering and Block Methods - 2015 - Roland Barriot
Bi-clustering results
• Iterative signature algorithm [Bergmann et al., 2003]
Subfamily of ABC carbohydrate importers
Autoinducer-2 in E. coli and S. typhimurium (Xavier and Bassler, 2005)
Galactose importer in E. coli (Harayama et al., 1983)
B.cereus, B. anthracis, M. loti, P.
multocida, S. flexneri
B. halodurans, F. nucleatum, H. influenzae,
M. loti, P. multocida, S. tuphimurium, S.
flexneri and V. vulnificus
110 reconstructed systems out of 125
22
Detect and isolate paralogous systems
within the subfamily
Clustering and Block Methods - 2015 - Roland Barriot
Current challenges
• Quest for orthologs :
– Big data: Pace of new genome release
– new vertices and edges
• Include more biological criteria
– communities should not contain more than once
each genome
• Co-clustering
– cluster in parallel each domain
– problem: graphs do not have the same vertices
23
Clustering and Block Methods - 2015 - Roland Barriot
Quest for Orthologs
24
Clustering and Block Methods - 2015 - Roland Barriot
Current dataset
• ~2,000 genomes
– 350k genes coding ABC systems
• Subfamily 1
– MSDs: 2,836 vertices, 700k edges
– SBPs: 2,407 vertices, 350k edges
– NBDs: 1,670 vertices, 400k edges
• Graphs change with each new genome release!
• >1 new genome / day
25
Clustering and Block Methods - 2015 - Roland Barriot
Include more biological criteria
Communities should not contain more than once
each genome
26
Clustering and Block Methods - 2015 - Roland Barriot
Co-clustering
Partition graphs in parallel by making use of other
partner graph topologies
27
Clustering and Block Methods - 2015 - Roland Barriot
Conclusions
• Various graph topologies
• Need efficient methods
– graph partitioning
– bi-clustering
• Need accurate methods
– essential for gene function prediction
28
Clustering and Block Methods - 2015 - Roland Barriot
Acknowledgements
Gwennaele Fichant
Yves Quentin
Petra Langendijk-Genevaux
Mathias Weyder
29
Clustering and Block Methods - 2015 - Roland Barriot
Available genomes and knowledge
• Model organism Escherichia coli: ~30% of genes with unknown function
count
Genome size
Genome size (in genes) distribution in ~2500 bacteria
30
Clustering and Block Methods - 2015 - Roland Barriot