Download poster

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of neurodegenerative diseases wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome (book) wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Human genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microevolution wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene desert wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Predicting Regulatory Elements from Genome-Wide 3-way Alignments of Human, Mouse, and Rat
James Taylor, David King, Diana Kolbe, Laura Elnitski, Pallavi Eswara, Jia Li, Webb Miller, Ross Hardison, and Francesca Chiaromonte
Center for Comparative Genomics and Bioinformatics, The Pennsylvania State University
Regulatory Potential (RP) scores are a method to identify cis-regulatory elements using statistical
models based on frequencies of short patterns in multiple species alignments.
The RP score for an alignment is calculated using the log ratio of transition probabilities from two fixed
order Markov models. The recognition model is trained on a set of experimentally verified human
regulatory elements which align in all the species being considered. The background model is trained
on a randomly selected set of ancient repeats (approximately the same size as the available regulatory
elements).
The following plots show 2-way and 3-way RP scores on several regions in the human genome. Some known cis-regulatory elements have been annotated. Elements in
the training set have also been marked. RefSeq Genes, Repeat Masker repeats, and human/mouse/rat conservation annotations are also included for reference.
Three Way RP Score
Two Way RP Score
0.6
Sequence Classes
0.4
Cumulative Fraction
0.8
1.0
DISCRIMINATION
Regulatory Training Set
Independent Enhancers Set
0.2
Exons
Noncoding Nonrepetitive
Ancient Repeats
Spliced ESTs
Human/Mouse/Rat Conservation
17680000
17685000
0.0020
17690000
17695000
17700000
17705000
0.1
17715000
0.0000
MYOD1
MYOD1 encodes a helix-loop-helix transcription factor and is critical to skeletal muscle lineage determination and damaged tissue repair. It has an enhancer ~20kb
upstream which is contained in our training set, but also has another enhancer ~5kb upstream which has a high RP score.
WNT2
116460000
116470000
116480000
116490000
1165000000
0.0020
116510000
Enhancer
116530000
116520000
Silencer
0.0000
The cumulative distributions for 3-way RP scores on
samples of different genomic regions demonstrate
the score’s discrimination capability. The regulatory
training set scores highly, but a disjoint set of
independently verified enhancers also scores very
well, showing that the score recognizes regulatory
elements well. Exons also score highly but are still
distinguishable from regulatory elements. Samples
of noncoding nonrepetitive DNA and ancestral
repeats score the lowest.
0.0
0.0
17710000
Enhancer
Enhancer (In training set)
WNT2
WNT2 is expressed in the thalamus and is in a family of genes in oncogenesis and development of several organs and systems, and some evidence suggests a link to
autism. There are two verified regulatory elements near the third exon, both with significant RP scores.
BETA GLOBIN GENE CLUSTER
5210000
5220000
5230000
5240000
5245000
5260000
0.0020
-0.1
Repeat Masker
MYOGENIC FACTOR 3
MODEL SELECTION
In extending the RP score computation from pairwise to multiple species alignments the key challenge
was selecting sufficiently simple models for our limited training data. Adding additional species
exponentially increases the complexity of input while reducing available training data due to the greater
frequency of unsequenced DNA. We addressed this by starting with a state space of all possible
alignment columns and collapsing them. First states are clustered based on their frequency in the
training data using hierarchical clustering. When the number of states is small enough we switch to a
second agglomeration algorithm based on discrimination. For each possible pair of symbols that can be
collapsed we train models with that collapse and score the training data with it. To avoid considering
order selection at this stage, we train the models with the average transition probabilities over several
orders. Whichever collapse maximizes the separation of the regulatory and neutral training data (merit
) is selected as the best alphabet at that stage, and the agglomeration continues. Based on the decrease
in merit across agglomeration stages we select a set of candidate alphabets. Finally we use leave-oneout cross-validation to select the best of these and the order of the markov models.
Genes
5270000
HS2
HS3
HS3.1
0.2
Regulatory Potential Score
0.0000
PERFORMANCE IMPROVEMENT WITH RAT
We have found that despite the short evolutionary distance
between rat and mouse, the RP score from human/mouse/
rat alignments performs better than the RP score from
human/mouse alignments. Not only is the range of the
score higher (as seen in the plots to the right) but leaveone-out cross validation using 2-way and 3-way alignments
shows that the 3-way score is able to correctly classify a
greater fraction of the training set elements.
HBB
HBD
HBE1
HBG2
HBG1
CROSS VALIDATION COMPARISON
2-way score
3-way score
Regulatory Correct
139
184
Regulatory Incorrect
134
89
Neutral Correct
222
201
Neutral Incorrect
38
59
68%
72%
Success Rate
CALIBRATION
Performance studies were conducted on alignments encompassing the beta globin complex, a 70kb
region containing experimentally well-studied regulatory regions. Ten sites were chosen as controls,
comprising promoters, upstream sequences, and distal regulatory elements known collectively as the
locus control region that are important for the regulation of beta globins. Optimum thresholds for RP
scores were determined by a minimum number of false predictions and a maximum number of true
predictions across the scoring range. At the optimum threshold, a minimum error rate was obtained
that is related to the average failure rate of a prediction. For the 3-way RP score this is ~26%. This
compares favorably to predictions based on conservation between human and mouse, which show a
minimum error rate of ~53%.
This gene cluster contains the genes that encode beta globin. Upstream of the genes (in the locus control region) are several hypersensitive sites which the RP score
recognizes.
HUNTINGTON
3106000
0.0030
3107000
3108000
3109000
3110000
3111000
Promoter (in training set)
0.0000
HD
HD is a widely expressed gene important to normal development. A mutation in this gene is the cause of Huntington’s Disease. This plot shows the promoter region
of HD. Other high scoring sites are also shown which may be worth investigating further.