Download Recostructing the Evolutionary History of Complex Human Gene

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia , lookup

Gene wikipedia , lookup

Gene expression profiling wikipedia , lookup

History of genetic engineering wikipedia , lookup

Copy-number variation wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Metagenomics wikipedia , lookup

Human genetic variation wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy wikipedia , lookup

Genome (book) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

Gene desert wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Pathogenomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression programming wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Reconstructing the Evolutionary
History of Complex Human Gene
Clusters
Y. Zhang, G. Song, T. Vinar,
E. D. Green, A. Siepel, W. Miller
RECOMB 2008
Speaker: Fabio Vandin
Outline






Motivation
Problem definition
Simple algorithm for simple model
SIS algorithm for complex model
Results on human genome
Evaluation of SIS algorithm
Gene cluster


Group of related genes
Probably formed by
duplications:


followed by functional
diversification
with deletions, cause
human genetic diseases
“History” of a gene cluster?
?
80%
85%
89%
93%
98%
Percentage of similarity in self-alignment in the human genome
Dot-plots of self-alignments
Human UGT2 cluster
Data Preparation

Atomic segments:




Self-alignment (forward and reversecomplement): pairs A ≈ B
Transitive closure property:
A ≈ B, B ≈ C
A≈C
Maximize each alignment
Collapsible segments:
Computational Problem


Input: sequence of atomic
(non-collapsible) segments
Output: (most probable) sequence of
events (or the number of events) such
that if we unwind these events in the
input sequence, we obtain a sequence
containing only a single atomic segment
Simple Model

Assumptions:
Only duplications (possibly with reversal
and tandem duplications)
2. No duplication inside the originating
region
3. The most important:
1.
Simple Model (2)


Given assumptions 1 e 2, each event is
of “identical” (reversed) regions
of consecutive atomic segments (
copied to
)
Problem statement
Simple Model (3)

Candidate definition

Is this definition reasonable?
Simple Model (4)

Algorithm

Analysys
Simple Model: limitations

Assumptions:



“large scale deletions are likely to occur”
“atomic boundary reuses violating
assumption are not uncommon”
Same number of events, but multiple
way of reconstructing the history

NB: not solved with SIS, but you can
assign probability..
Stochastic Model

Event


:
Duplication (possibly with reversal)
Deletion (with restrictions)

History:

Target distribution of histories:

is the number of reused atomic
boundaries and
Algorithm for Stochastic Model


Sample histories from the target distribution
and compute the mean value (of the
function we are interested in):

sample

given

e.g.,
from
, estimate
gives the number of events
Problem: how to sample from the target
distribution?
Sequential Importance Sampling (SIS)

Goal: compute

Target distribution:

Trial distribution:

Sample’s weight:

Output:
SIS for gene cluster history

Duplication
SIS for gene cluster history (2)

Deletion


only without atomic boundary reuse
only if “the atomic segment pair flanking a
deletion site appears elsewhere” in the seq.
Application to Human Gene Clusters

human genome assembly hg18 self
alignment: 457 duplicated regions






alignments: at least 500 bp, >= 70% identity
segments separated by no more than 500 Kbp
only long and non-trivial regions considered
165 biomedically interesting clusters (~111
Mbp)
5 divergence thresholds: GA, OWM, NWM, LG,
DOG
825 combinations of gene cluster and
divergence threshold
Application to Human Gene Clusters (2)
Application to Human Gene Clusters (3)
Application to Human Gene Clusters (4)



“… help prioritize the selection of notably
interesting gene clusters for more
detailed comparative genomics studies”
“… to compare cluster dynamics in
certain lineages to observed phenotypic
differences among primates”
“… such sequence data should reveal
differences among primate species of
possible relevance for selecting species
for further biomedical studies”
Evaluation of the SIS Algorithm

Estimate parameters from the human genome
for the events (e.g., duplications):


39% of duplication “reversed”, deletions=2% of
duplications
Starting from 500 Kbp sequence, generate 10
genome clusters for N= 10,20,…,100 events
Evaluation of the SIS Algorithm (2)
Discussion

Main Limitations:




details of data preparation
choice of the parameters/distributions
evaluation of the SIS algorithm
Future Directions:




include other types of events
understand the stochastic model
how to evaluate a model?
how to evaluate an algorithm?