Download Research Projects Tao Jiang`s Lab Algorithms and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Research Projects
Tao Jiang’s Lab
Algorithms and Computational Biology Laboratory
Department of Computer Science and Engineering
University of California, Riverside
March, 2013
Project Overview

Predicting Operons by a Comparative Genomics Approach (DOE GtL)

Evolutionary Dynamics of Myb Gene DNA-binding Domains (NSF ITR)

Prediction of HNF4 Binding Sites and Target Genes in Human and Mouse Genomes (NIH/NSF)

Efficient Selection of Unique and Popular Oligos for Large EST Databases (USDA/NSF)
 Oligonucleotide Fingerprinting of Ribosomal RNA Genes and
Microorganism Classification (NSF BDI/NIH)
 Efficient Haplotyping Algorithms for Pedigree Data and
Gene Association Mapping (NSF CCF and NIH)
 High Throughput Ortholog Assignment via Genome
Rearrangement (NSF IIS)
 Genome-Wide Inference of mRNA Isoforms and Estimation
of Their Expression Levels from RNA-Seq short reads
 Metagenomic Data Analysis
Predicting operons by a
comparative genomics approach
Xin Chen
Collaboration: Ying Xu (ORNL)
Fund: DOE GtL
This project aims at predicting candidate operons in the
genome Synechococcus sp. WH8102, based on a comparative
genomics approach. These candidate operons may provide us
with helpful information for the construction of protein-protein
interaction networks and functional pathways.
Operon structures
Operons represent a basic organizational unit of genes in the
complex hierarchical structure of biological processes in a cell.
They are mainly used to facilitate efficient implementation of
transcriptional regulation, especially in bacteria.
Biological characteristics of genes in an operon include:
• sharing certain regulatory elements
• arranged in tandem on the same strand
• separated by short distances
• well conserved across phylogenetically related species
• their functions are usually related
Existing methods for operon
prediction
•
•
•
•
•
•
Overbeek et al. (1999): gene pairs of close bidirectional best hits
Salgado et al. (2000): close gene distances and gene functional classes
Ermolaeva et al. (2001): the likelihood of conserved genes being an operon
Carven et al. (2002): a probabilistic learning approach on whole genome
Sabatti et al. (2002): a Bayesian classification scheme on gene microarray
Zheng et al. (2002): based on information from metabolic pathways
Our approach based on
comparative genomics
Genome sequences
with annotation genes
Pairwise comparison
running blastp program
with E-value = 1e-20
Gene matches
(homolog information)
Cluster conserved,
nearby genes
Scoring
Candidate operons
Constraints:
1. neighbor genes separated by 100 bases or less
2. genes in an operon located in the same strand
3. gene sets conserved across two or more genomes
4. full matching required for a candidate operon
5. promoter and terminator to be considered
6. pathway information to be considered
List of ranked
operons output
A score is given by:
1. product of E-values of gene
matches involved in an operon
2. intergenic distances in an operon
to be considered
3. predictive reliability of promoter or
terminator to be considered
Comparative analysis is based on the idea that functional segments
tend to evolve at lower rate than nonfunctional segments, making well
conserved regions likely to be of very interest (Overbeek et al., 1999).
Implementation details
• Data preparation: three genome data downloaded from
Genome b
ORNL website (http://compbio.ornl.gov/channel/index.html).
b1
b2
• Pairwise comparison: blastp with E-value <1e-20, a
bipartite gene matching graph. Same COG ID.
b3
b4
Genome a
b5
b6
Genome c
b7
a1
a2
c1
a3
c2
a4
c3
a5
a6
a7
a8
c4
c5
c6
c7
• Gene clustering:
– neighbor genes separated by 100 bases or less
– genes in an operon located in the same strand
– gene sets conserved across two or more
genomes
– full matching required for a candidate operon
• Scoring: product of E-values of all gene matches involved,
operons with lower scores output earlier
The gene matching graph for
three cyanobacterial genomes
Three genomes with their gene numbers:
• Synechococcus sp. WH8102 (2520)
• Prochlorococcus marinus sp. MED4 (1700)
• Prochlorococcus marinus sp. MIT9313 (2267)
The numbers of gene matching pairs:
• 1593 between syn_wh and par_med
• 2242 between syn_wh and par_mit
• 1579 between par_med and par_mit
Predicted operons in
Synechococcus sp. WH8102
A total of 242 operons output from Synechococcus sp. WH8102:
• 126 operons shared with both other two genomes
• 26 operons shared with pmar_med only
• 90 operons shared with pmar_mit only
( See operons at http://www.cs.ucr.edu/~xinchen/operons.htm )
Several observations on the
putative operons
•
•
•
•
The average size of putative operons is 2.88, very close to 3;
The two most frequent intergenic distances are –4 and –1 overlap;
All operons in Synechococcus sp. WH8102 are on the positive strand;
Matching genes have the same COG IDs across three genomes.
Z. Su, P. Dam, X. Chen, V. Olman, T. Jiang, B. Palenik, and Y. Xu. GIW’2003.
X. Chen, Z. Su, Y. Xu, and T. Jiang. GIW’2004 (the best paper award).
X. Chen, Z. Su, P. Dam, B. Palenik, Y. Xu, and T. Jiang. Nucleic Acids Research, 2004.
Ongoing work
• Look for a way of predicting promoters and
terminators upstream and downstream of candidate
operons.
• Find a method to validate/score putative operons by
promoter/terminator results.
• Incorporate additional information like intergenic
distances and predicted promoters into the scoring
system.
• Pathway information to be considered.
Evolutionary Dynamics of Myb
Gene DNA-binding Domains
Li Jia
Collaboration: Michael Clegg (Botany)
Fund: NSF ITR
Motivation
Natural selection on changes of “regulatory genes”
that regulate the timing or rate of development,
must be required for evolution.
(Britten and Davidson, 1969 and 1971)
Natural selection on transcription factors should provide one of
predominant mechanisms for the generation of novel phenotypes.
The Crucial Role of TFs
Organism
Genes coding for transcriptional
regulators
Total number of genes
Total number
Percentage in total gene
number
A. Thaliana
~25,000
~1,500
~5%
O. Sativa
~50,000
~200
~4%
C. Elegans
~18,000
~700
~5%
D. Melanogaster
~15,000
~800
~6%
H. Sapiens
~35,000
~3,000
~9%
M. Musculus
~30,000
~1,800
~6%
......
Signaling molecules
T
F
s
......
WHEN?
WHERE?
Target genes
HOW?
R2R3-MYB
Structure:
Flexible domain
R2R3-MYB
Helix1
R2
DNA-binding domain
Activation domain
R3
Helix2
Functions:
MYB
Helix3
1)
Secondary metabolism
2)
Cell shape
3)
Disease resistance
1)
Stress response
Target
genes
Differentiation
Proliferation
Metabolism
OBJECTIVE
to unveil molecular dynamics that
underlines the evolution of TFs (Myb)
R3
Helix3
R2
Helix2
R3
Helix1
R2
Helix1
R3
Helix2
R2
Helix3
Infer Positive Selection Sites
(based on dN/dS analysis in the duplication
history of R2R3-Myb gene family)
synonymous vs
A. Thaliana
R2
Helix1
20
nonsynonymous
mutation rates
Helix2
Helix3
R3
Helix1
Helix2
Helix3
16
14
12
10
8
6
4
2
96
91
86
81
76
71
66
61
56
51
46
41
36
31
26
21
16
11
6
0
1
positive selection counts
18
Amino acid position
Jia, Clegg and Jiang (2003) Plant Mol. Biol.
Positive Selection Sites
A. Thaliana (dicot)
Sites
Counts
Percentage
Counts/site
1
531
100%
5.4
R2 domain
Helix1
Helix2
Helix3
14
7
10
173
83
8
33%
16%
1.5%
12.4*
11.6*
0.8
R3 domain
Helix1
Helix2
Helix3
14
7
10
119
33
1
22%
6%
0.2%
8.5**
4.7*
0.1
Full R2R3 region
O. Sativa (monocot)
Category
Full R2R3 region
Helix
1
R2
Helix
domain
2
Helix
3
Helix
1
Helix
R3
2
domain
Helix
3
Count
sites
103
15
7
10
14
6
10
indica
52
12
14
0
16
2
0
japonica
380
61
73
0
197
9
0
Percentage
indica
100%
23%
27%
0%
31%
4%
0%
japonica
100%
16%
19%
0%
52%
2%
0%
Count/site
japonic
indica
a
0.5
3.7
0.8**
4.1**
2.0**
10.4**
0.0
0.0
1.1*
14.1**
0.3
1.5**
0.0
0.0
Jia, Clegg and Jiang (2003) Plant Mol. Biol.
Jia, Clegg and Jiang (2004) Plant Physiol..
Co-evolved -Helices
japonica
r (R2, R3)
indica
Arabidopsis
0.69**
0.68**
0.69**
r (R2-1, R3-1)
N/A
0.15
0.11
r (R2-2, R3-2)
0.40**
0.62**
0.65**
r (R2-3, R3-3)
0.38
0.29
0.2
Jia, Clegg and Jiang (2004) Plant Physiol..
SUMMARY
1)
Positive selection sites
positive selection pressure works through the
first and second helices of the R2R3 repeats
rather than the third helices due to their
structural characteristics
2)
Co-evolution patterns
the functional importance of the pairing-correlations
between the related secondary structures in preserving
the conformation of the specific protein folding-pocket (the second helices)
APPLICATIONS:
determine protein-DNA interaction regions of transcription factors based on their
primary codon sequences
genetically modify MYB structure to improve economically important traits
Prediction of HNF4 Binding Sites and Target
Genes in Human and Mouse Genomes
Chuhu Yang
Collaboration: F.M. Sladek (Cell Biology, Neuroscience)
Fund: NIH/NSF
HNF4—an important TF
• An important TF that regulates the expression of many
genes, especially some liver-specific genes; it also plays
an important role in the process of development.
• It has been demonstrated to regulate the expression of
over 60 genes.
• Researchers anticipate to find more HNF4 target genes.
Related to many human diseases such as
Diabetes, hemophilia, hepatitis etc.
Atherosclerosis
Diabetes
Hemophilia
Thrombosis
Coagulation
Factors
Anti-thrombin
Apolipoproteins
HNF4
EPO
Hypoxia
MCAD
MCAD deficiency
OTC
OTC deficiency
PEPCK
L-PK
HNF1
CYP genes
ACO
HBV
BPG
Drug
Metabolism
Cancer
HNF4 is highly conserved in
many different organisms
1
Zn+
+
Human
DNA binding
1
Rat/mouse
Transactivation
100
%
93
97.4%
69%
100
%
87.2%
64%
90
%
61.4%
14%
1
Drosophila
464
93%
1
Xenopus
Ligand?
22%
% = amino acid identity
88%
464
464
666
Our previous work
• Collected 71 HNF4 binding sequences from literature.
• Developed software based on an optimized (or permuted) Markov
model and trained it with the 71 known sequences.
• Searched –500 to +100 regions (relative to transcription start sites)
of all the human genes in UCSC database.
• Predicted 840 potential HNF4 binding sites in the human genome.
• Verified in vitro 77 new HNF4 binding sequences, resulting in a total
of 137 HNF4 binding sequences.
•This work has been summarized in a paper, which was published in
Bioinformatics (Vol. 18 Suppl. 2 2002).
Current work
 Search the promoter regions of all the human genes with 137 HNF4
binding sequences for potential HNF4 target genes in human.
 Search the promoter regions of all the mouse genes with 137 HNF4
binding sequences for potential HNF4 target genes in mouse.
 Compare HNF4 target genes in both human and mouse genomes.
 Do in vivo experiment to verify potential HNF4 target genes.
Future work
 Optimize current software so that it can predict
HNF4 binding sites more accurately.
 Study the functions of all HNF4 target genes,
cluster them into different functional groups and
study the relationship between different groups.
 Set up regulatory networks of all HNF4 target
genes in human and mouse genomes.
 Sequence weighting: A new approach to
constructing PSSM (or PWM) for motif finding from
Chip and gene expression data.
Efficient Selection of Unique and
Popular Oligos for Large EST
Databases
Jie Zheng
Collaboration: Sefano Lonardi and Timothy J. Close (Botany)
Funding: USDA / NSF
Problems of Oligo Selection
(for the Barley EST data in HarvEST)
• Unique Oligo Problem
– Selection of oligos each of which appears (exactly) in one
EST sequence but does not appear (exactly or
approximately) in any other EST
• Popular Oligo Problem
– Selection of oligos that appear (exactly or approximately) in
many ESTs
Applications
• Unique oligos
– PCR primer designs
– Microarray probe designs
• Popular oligos
– Useful in screening genomic libraries
(such as BAC libraries) for gene-rich regions
Methods
• Basic idea
– Separate dissimilar strings as early as possible to
reduce the search space
• Algorithm for unique oligos
– Group similar oligos by hashing 11-mer seeds, and
disqualify oligos similar to oligos in other ESTs
• Algorithm for popular oligos
– Cluster similar oligos by hashing 20-mer cores and
comparing regions outside cores
– Identify centers in clusters
Performance
• Input Data:
– 46145 Barley EST sequences of about 28 Millions
base pairs from the HarvEST database
• Time and Space:
– A couple of hours on a 1.2GHz CPU, 1GB RAM
machine
• Accuracy in simulation
– Relative error is below 2%
Oligonucleotide Fingerprinting of
Ribosomal RNA Genes (OFRG)
and Microorganism Classification
Andres Figueroa and Zheng Liu
Collaboration: J. Borneman (Plant Pathology) and M. Chrobak (CSE)
Fund: NSF BDI/NIH R01
Basic Idea
• rRNA genes (rDNA) can be used as an ID of species,
especially microorganisms.
• Use microarray technology to identify the rDNAs of
the microbes in a community. Oligonucleotide probes
are designed to hybridize with the (unknown) rDNA
clones in a sample.
• Analyze the hybridization result to obtain fingerprints.
Project Flowchart
Taxonomic tree
Sample: soil, mouse gut,
plant tissue, etc.
Cluster
Extract rDNA
Fingerprints
Sample rDNA
Fingerprint assignment
PCR
Normalized signal
intensities
Mixture of rDNA
Clone: Ligate and transform
Clone library
Normalization
Signal intensities
PCR
Hybridization with probes
Individual rDNA
Array
Print
Taxonomic tree
Project Structure
Expr. data
Genomic DBs
rDNA
sequence
DB
Web-based
integrated
platform
OFRG
management
DB
Probe set design
Label
unknown
clone
Clustering
Binarize
fingerprints
Future Work
• Complete rDNA sequence database (done)
• Create the OFRG management database (done)
• Intensity normalization/binarization using control
information (partially done)
• Extend to [0,t], for t = 2,3,4,…
• Combine tools into an integrated platform
• A higher throughput system based on
microbeads and polony sequencing
technologies (NIH)
Polony (polymerase colony)
Polony hybridizing with different
probes
Efficient Haplotyping
Algorithms for Pedigree Data
Lan Liu, Bob Wang and Jing Xiao
Collaboration: Jing Li (CWRU) and Tim Chen (USC)
Fund: NSF CCR/NIH R01
An Example Pedigree:
The British Royal Family
Elizabeth II of
the United Kingdom
Diana,
Prince Charles,
Camilla,
Princess of Wales Prince of Wales Duchess of Cornwall
Prince William
of Wales
Prince Henry of
Wales
Prince Philip,
Duke of Edinburgh
Captain
Commander
Princess Anne,
Mark Phillips Princess Royal Timothy Laurence
Peter Phillips
Zara Phillips
Sarah
Prince Andrew,
Duke of York Margaret Ferguson
Princess
Beatrice of York
Princess
Eugenie of York
Prince Edward,
Earl of Wessex
Sophie
Rhys-Jones
Lady Louise
Windsor
MRHC Problem
Find a minimum
recombinant
haplotype
configuration
from a given
pedigree with
genotype data
Assumptions:
• Mendelian law
(no mutations)
• Recombination
events are rare
(1 2) (1 2)
(1 2) (1 2)
(1 2) (2 2)
…
…
(1 1)
(1 2)
(2 2)
...
(1 2)
(2 2)
(2 2)
…
(1 1) (1 2)
(1 2) (1 2)
(2 2) (1 2)
...
...
(1 2)
(1 2)
(1 2)
...
(1 1)
(1 2)
(2 2)
…
Input
(unphased data)
1|2 1|2
1|2 2|1
2|1 2|2
…
…
1|1
1|2
2|2
...
1|2
2|2
2|2
…
1|1
1|2
2|2
...
1|2
1|2
2|1
...
1|2
1|2
2|1
...
1|1
2|1
2|2
…
Output
(phased data)
Motivations







Haplotype is more biologically meaningful than genotype since each haplotype of
a child is inherited from one parent. Haplotype data are more informative and
more valuable in determining the association between diseases and genes and in
study of human histories.
The human genome project gave us the consensus genotype sequence of
humans, but in order to understand the genetic effects on many complex
diseases such as cancers, diabetes, osteoporoses, the genetic variations are more
important, which can be represented by haplotypes.
Current techniques collect genotype data. Computational methods deriving
haplotypes from genotypes are highly demanded.
The ongoing international HapMap project.
It’s generally believed that with parents/pedigree information, we could get more
accurate haplotype and frequency estimations than from data w/o such
information.
Family-based association studies have been widely used. We would expect more
family-based gene mapping methods that assume accurate haplotype
information.
Not only computation intensive, model-based statistical methods may use
assumptions that may not hold in real datasets.
Results
• MRHC is NP-hard
• Heuristic: block-extension algorithm
• Exact algorithms: member-based and locus-based dynamic
programming
• ILP algorithm for MRHC with missing alleles
• Software: PedPhase
• Special cases:
– Efficient algorithms for ZRHC based on systems of
linear equations and low stretch spanning trees
– locus-based dynamic programming for loopless pedigree
• A datamining approach to gene association mapping
• Several results on genome-wide TagSNP selection via
linkage disequilibrium
ILP Formulation

Objective function:
m 1
  (r
Non- Founders j 1
j
i ,1
 ri ,j2 )
Subject to
Genotype constraints: (0 means missing allele)
tj
tj
k 1
k 1
{0,0}  { f i ,jk  1 ,  mij,k  1}
{mrj ,0}  { f i ,jr  mij,r  1}
{mrj , mrj }  { f i ,jr  mij,r  1}
{mrj , msj }  { f i ,jr  f i ,js  mij,r  mij,s  f i ,jr  mij,r  f i ,js  mij, s  1}
Mendelian law of inheritance constraints:
f i ,jk  f f j,k  g ij,1  0
f i ,jk  m fj ,k  g ij,1  1
Constraints for the r variables:
Test Results on Real Data
The ZRHC Problem

Problem definition
Given a pedigree and the genotype information for
each member, find a recombination-free haplotype
configuration for each member that obeys the
Mendelian law of inheritance.
Some Constraints
4
5
1
2
12
12
12
12
12
12
12
12
6
12
3
4
12
12
12
12
5
6
11
11
12
12
4
5
12
12
12
12
12
21
12
6
4
5
12
6
21
12
21
21
21
The Constraints as Linear Equations
Note: The variables represent phase and the
equations are over F(2) (in fact, addition mod 2).
The Final Linear System
 O(mn) equations on O(mn) variables.
 Standard Gaussian elimination gives rise to an O  m3 n3 
time algorithm.
A Faster Algorithm for ZRHC
• We have recently devised a faster algorithm
for ZRHC with running time O  mn2  n3 log2 n loglog n 
O  n
O  mn 
Transform
O  mn 
Matrix
O  mn  Matrix
Reduce redundancy
O  n log 2 n log log n 
O  n
Matrix
Some Open Problems
•
•
•
•
Faster (and reliable) method than ILP for large pedigrees
The k-RHC problem for small k
Probabilistic models for k-RHC (Xiao Jing)
Incorporation of population models into pedigrees
– Combine with the parsimony model
– Combine with the perfect phylogeny model
– Population of trios?
• Dealing with mutations, errors, and missing data
• Association mapping on/using pedigree data?
A High-Throughput Combinatorial Approach to
Genome-Wide Ortholog Assignment
Zheng Fu, Wilson Shi, Vincent Peng
Collaboration: Liqing Zhang (Virginia Tech)
Fund: NSF IIS
Joint work with X. Chen, J. Zheng, V. Vacic, P. Nan, Y. Zhong, and S. Lonardi
Orthology
• Homolog
同源
– Gene family
• Duplication
复制
mouse
chicken
frog
– Paralog 旁系同源
• Speciation
分支
– Ortholog 直系同源
(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)
Orthology
• Homolog
同源
– Gene family
• Duplication
复制
– Paralog 旁系同源
• Speciation

b
mouse
chicken
frog
分支
– Ortholog 直系同源
(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)
Orthology
• Homolog
同源
– Gene family
• Duplication
复制
– Paralog 旁系同源
• Speciation

b
mouse
chicken
frog
分支
– Ortholog 直系同源
(from http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html)
Orthology – the more complicated picture
A
Speciation 1
Gene duplication 1
B
C
Speciation 2
Gene duplication 2
A1
G1
B1
C1
G2
Outparalogs evolved via a duplication
prior to a given speciation event.
B2
C2
C3
G3
Inparalogs evolved via a duplication
posterior to a given speciation event.
True exemplar is the direct descendant of the ancestral gene of
a given set of inparalogs. A main ortholog pair is defined as the
two true exemplar genes of two co-orthologous gene sets.
Significance
• Orthologous genes in different species are
evolutionary and functional counterparts.
• Many methods use orthologs in a critical
way:
–
–
–
–
–
–
Function inference
Protein structure prediction
Motif finding
Phylogenetic analysis
Pathway reconstruction
and more ...
• Identification of orthologs, especially
exemplar genes, is a fundamental and
challenging problem.
Ortholog Assignment Methods
•
BBH: Best Bidirectional Hit (by BLASTn / BLASTp)
•
COG: Cluster of Orthologous Groups
(Tatusov et al., Science, 278: 631-637, 1997; Nucleic Acids Res., 28:33-36, 2000)
•
TOGA: TIGR Orthologous Gene Alignments
(Lee et al., Genome Res, 12: 493-502, 2002)
•
INPARANOID: Identify Orthologs & Inparalogs
(Remm et al., J Mol Biol. 314:1041-1052, 2001)
•
OrthoMCL: a Markov Cluster algorithm
(Li et al., Genome Res, 13: 2178-2189, 2003 )
•
Reconciled Tree: Gene tree v.s. species tree
(Yuan et al., Bioinformatics, 14:285-289, 2001)
•
OrthoParaMap: Synteny regions
(Cannon et al., BMC Bioinformatics 4(1):35, 2003 )
•
Shared Genomic Synteny: Synteny anchors and Synteny blocks
(Zheng et al., Bioinformatics 21:703-710, 2004 )
•
SOAR: System of Ortholog Assignment by Reversal
(Chen et al., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2005)
Molecular Evolution
• Local mutation
– Base
substitution
– Base insertion
– Base deletion
• Global rearrangement
and duplication
– Inversion/Reversal
– Translocation
– Transposition
– Fusion/Fission
– Duplication/Loss
A complete ortholog assignment system
should make use of information from both
levels of molecular evolution.
重排
Example
The ancestral genome
a1
b
c
a2
d
e
f
g
Speciation
reversal
b
a1
duplication
c
a2
d
e
f
g
a1
b
c
a1
c
a2
d

Genome
e
d
e
f
a3
g
fission
duplication
b
a2
a4
f
g
a1
b
c
a2
d

Genome
e
f
a3
Given the evolutionary scenario in terms of gene order, main ortholog
pairs and inparalogs could be identified in a straightforward way.
g
The Parsimony Approach
简约
• Identify homologs using BLASTp.
• Reconstruct the evolutionary scenario on the basis
of the parsimony principle: postulate the minimum
possible number of rearrangement events and
duplication events in the evolution of two closely
related genomes since their splitting so as to assign
orthologs.
• Ortholog assignment problem could be formulated
as a problem of finding a most parsimonious
transformation from one genome into the other,
without explicitly inferring their ancestral genome.
RD (Reversal-Duplication) Distance
• RD distance: RD (, )  R(, )  D(, )
–
denotes the number of rearrangement events
in a most parsimonious transformation
–
denotes the number of gene duplications in a
most parsimonious transformation
R ( ,  )
D ( ,  )
  (b  a1  c  a2  d  e  a4  f  g )
  ( a1  b  c) ( a2  d  e  f  a3  g )
RD (, )  4
The Key Algorithmic Problem -SRDD
• Two related (unichromosomal) genomes
– No inparalogs, i.e. no post-speciation duplications
– No gene losses
– Equal gene content
• Signed Reversal Distance with Duplicates
– Given two related genomes
– Only reversals have occurred
– How to find a shortest sequence of reversals
• Almost untouched in the literature
– Duplicated genes are present
– Generalizes the problem of sorting by reversal
Sorting By Reversals Problem
• Goal: Given a permutation, find a shortest
series of reversals that transforms it into the
identity permutation (1 2 … n )
• Input: Permutation p
• Output: A series of reversals r1, … rt
transforming p into the identity permutation
such that t is minimum
Sorting by Reversals Problem
• Goal: Given a permutation, find a shortest
series of reversals that transforms it into the
identity permutation (1 2 … n )
• Input: Permutation p
• Output: A series of reversals r1, … rt
transforming p into the identity permutation
such that t is minimum
Sorting by Reversals: Example
• t =d(p ) - reversal distance of p
• Example :
p = 3 4 2 1 5 6
4 3 2 15 6
4 3 2 1 5 6
1 2 3 4 5 6
So d(p ) = 3
7
7
7
7
10
10
8
8
9 8
9 8
9 10
9 10
The MCSP Problem
•
Minimum Common Substring Partition
G: 3
1
2
-1
4
H: -4
1
2
3
1
• This may help eliminate many duplicates, but is
different from syntenic blocks.
• Give two related genomes G and H , we have
( L(G, H )  1) / 2  d (G, H )  L(G, H )  1
An Outline of MSOAR
Dataset A
Dataset B
Homology search:
1. Apply all-vs.-all comparison by BLASTp
2. Only select the blast hits with
similarity score above cutoff
3. Keep the top five bi-directional best hits
Assign orthologs by minimizing RD distance:
1. Apply suboptimal rules
2. Apply minimum common substring
partition partition
3. Maximum cycle
graphdecomposition
decomposition
4. Detect inparalogs by identifying “noise”
gene pairs
List of orthologous
gene pairs output
Real Data
• Homo sapiens:
– Build 36.1 human genome assembly (UCSC hg18, March
2006)
– 20161 protein sequences in total
• Mus musculus:
– Build 36 mouse genome assembly (UCSC mm8, February
2006)
– 19199 protein sequences in total
MSOAR vs Inparanoid
• Validation: Official gene symbols extracted from the UniProt
release 6.0 (September 2005)
• For 20161 human protein sequences and 19199 mouse protein
sequences, MSOAR assigned 14362 orthologs between Human
and Mouse, among which 11050 are true positives, 1748 are
unknown pairs and 1508 are false positives, resulting in a
sensitivity of 92.26% and a specificity of 87.99%.
• The comparison between MSOAR and Inparanoid
Mol. Biol., 2001)
(Remm et al., J.
MSOAR vs INPARANOID
Human chromosome 20
STK35
Stk35
TGM3
TGM6
Tgm3
Tgm6
SNRPB
Snrpb
ZNF343
Tmc2
TMC2
NOL5A
Nol5a
IDH3B
Idh3b
Mouse chromosome 2
The ortholog pair SNRPB (Human) and Snrpb (Mouse) are not bi-directional
best hits, which could be missed by the sequence-similarity based ortholog
assignment methods like Inparanoid.
Validation by HCOP
• The HGNC Comparison of Orthology Predictions
(HCOP) is a tool that integrates and displays the
human-mouse orthology assertions made by Ensembl,
Homologene, Inparanoid, PhIGS, MGD and HGNC.
(http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/hcop.pl)
Distribution of the number of supports from HCOP
The Number of Orthologs
Assigned by MSOAR
6000
5000
4000
3000
2000
1000
0
0
1
2
3
4
The Number of Supports
5
6
Future Work
– More efficient algorithms for MCSP and MCD. The best
approximation algorithm for MCSP has ratio O(k) (Kolman and
Walen’06). Can the ratio be improved to O(1)?
– Refine the evolutionary model for MSOAR (transposition,
tandem duplication, gene loss, etc.) How would the DCJ model
fit in?
– Ortholog assignment for multiple genome comparison. The
median problem.
– More explicit treatment of one-to-many and many-to-many
orthology relationship.
– Take advantage of other sources of genomic information such
as unique sequence tags, syntenic blocks, etc.
Genome-Wide Inference of mRNA
Isoforms and Estimation of Their
Expression Levels from RNA-Seq Reads
Wei Li (UCR/Harvard) and Jianxing Feng (Tsinghua/Tongji)
This method was also
reported in SLIDE later
(Li et al., PNAS, 2011).
Separating Metagenomic Short
Reads into Genomes via Clustering
Olga Tanaseichuk and James Borneman (UCR)
Metagenomics
• Genomics
– Study of an organism's
genome
– Relies upon cultivation
and isolation
– > 99% of bacteria
cannot be cultivated
• Metagenomics
▫ Study of all organisms in an environmental sample by
simultaneous sequencing of their genomes
▫ Makes it possible to study organisms that can’t be isolated or
difficult to grow in a lab
Metagenomic Projects
The Acid Mine Drainage Project
The Tinto River in Spain (Credit - Carol Stoker)
The Sargasso Sea Project The Human-Microbiome Project
A coral reef off the coast of Malden Island in Kiritibati
• A large scale sequencing in
an environmental setting
• Identified >1 million of
putative genes (10 times >
than in all databases at that
• Simple community: 5 dominant
time)
species (3 bacteria and 2 archaea) • ~1800 species
• Motivation: to understand
mechanisms by which the
microbes tolerate the extremely
acid environments
• Microbial community living in a
host
• 100 trillion microbes
• 100 times more microbial than
human genes
• Is there a core human
microbiome?
• How changes in microbiome
correlate with human health?
DNA Sequencing
• Next Generation Sequencing (NGS)
–
–
–
–
High-throughput
Cost- and time-effective
No cloning (reduced clonal biases)
Shorter read length compared to
Sanger reads (~1000 bps)
• Roche/454 (~450 bps)
• Illumina/Solexa (35-100 bps)
• ABI SOLiD (35–50 bps)
– Due to rapid progress, sequencing
lengths will increase
Goals of Metagenomics
•
•
•
•
•
Phylogenetic diversity
Metabolic pathways
Genes that predominate in a given environment
Genes for desirable enzymes
...
Ultimate goal: complete genomic sequences
Problem Formulation
• Given metagenomic reads, separate reads from
different species (or groups of related species)
Difficulties
• Repeats in genomic sequences
• Sequencing errors
genomics
• Unknown number of species and
abundance levels
• Common repeats in different genomes
due to homologous sequences
metagenomics
Existing Approaches
• Similarity-Based
– Similarity search against databases of known
genomes or genes/proteins
• Composition-Based
– Binning based on conserved compositional
features of genomes
• Abundance-Based
– Separate genomes by abundance levels
Algorithm: Overview
• Purpose: separating short paired-end reads from
different genomes in a metagenomic dataset
• Two-phase heuristic algorithm
– short reads
– similar abundance levels
– arbitrary abundance levels (in combination with
AbundanceBin [Wu and Ye, RECOMB, 2010])
Algorithm: Definitions and Observations
Unique l-mers (occur only once)
Repeated l-mers (occur > once)
Observation 1: Most of the l-mers in a
bacterial genome are unique
l ~ 20, for most of complete genomes
The ratio of unique l-mers to
distinct l-mers
Algorithm: Definitions and Observations
Unique l-mers
Repeated l-mers
Observation 2: Most l-mers in a
metagenome are unique
for l ~ 20 and genomes separated by
sufficient phylogenetic distances
Algorithm: Definitions and Observations
Repeated l-mers
Individual
repeats
Common
repeats
Observation 3: Most of the repeats in a
metagenome are individual
for l ~ 20 and genomes separated by
sufficient phylogenetic distances
Flowchart
Arbitrary Abundance Levels
• Significant abundance ratios is defined by
the expected misclassification rate (>3%)
Experimental Results: Overview
• Lack of NGS metagenomic benchmarks
• Lack of algorithms in the literature to separate short NGS reads from
different genomes
• Datasets
– Tests on variety of synthetic datasets with different number of
genomes, phylogenetic distances and abundance ratios
– Performance on a real metagenomic dataset from gut bacteriocytes
of a glassy-winged sharpshooter
• Comparison
– We modify the Velvet assembler [Zerbiono and Birney, Renome
Research, 2008] to work as a genome separator (clusters in Phase I
are replaced by sets of l-mers from the Velvet contigs)
– With CompostBin on longer reads
Experimental Results
• 182 synthetic datasets of 4 categories
– 79 experiments for the same genus
– 66 – same family
– 29 – same order
– 8 – same class
• Read length: 80 bps
• Coverage depth: ~15-30
• Equal abundance levels
• 2-10 genomes in each dataset
• Simulation: Metasim [Richter et al., PloS ONE, 2008]
• Phylogeny: NCBI taxonomy
Experimental Results
Experimental Results: Genomes with
Different Abundance Levels
Experimental Results: Comparison
with CompostBin
• Simulated paired-end Sanger reads from [Chatterji et al.,
RECOMB, 2008]
– Handling longer reads (1000 bps)
• Cut long reads into short reads of 80 bps
• Linkage information is recovered in Phase II
– Handling lower coverage depth (~3-6)
• Choose higher threshold K to separate repeats and
unique l-mers in preprocessing
• Simulated paired-end Illumina reads
– 80 bps, high coverage depth (~15-30)
Experimental Results: Comparison
with CompostBin
Test1
Test2
Test3
Test4
Test5
Test
6
Test7
Test8
Test9
Abundance ratio
1:1
1:1
1:1
1:1
1:1
1:1
1:1:8
1:1:8
1:1:1:1:2:14
Phylogenetic
distance
Species
Genus
Genus
Family
Family
Order
Family
Order
Order
Phylum
Species, Order, Family
Phylum, Kingdom
Experimental Results: Real Dataset
• Gut bacteriocytes of glassy-winged sharpshooter,
Homalodisca coagulata
– Consists of reads from:
• Baumannia cicadellinicola
• Sulcia muelleri
• Miscellaneous unclassified reads
• Sanger reads
• Performance is measured on the ability to separate reads
from B.cicadellinicola and S.muelleri
• Performance
– TOSS: Sensitivity: ~92% and error rate ~1.6%
– CompostBin: Error rate: ~9%
Implementation of TOSS
• Implemented in C
• Running time and memory depend on
– Number and length of reads
– Total length of the genomes
• For 80 bps reads -- 0.5 GB of RAM per 1 Mbps
– 2-4 genomes, total length 2-6 Mbps – 1-3 h, 2-4 GB of RAM
– 15 genomes, total length 40 Mbps – 14 h, 20 GB of RAM
Questions? Comments?
Contact: Tao Jiang
Department of Computer Science and Engineering
University of California – Riverside
[email protected]
www.cs.ucr.edu/~jiang