Download Identification and analysis of functional elements in 1% of the human

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Exome sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

DNA sequencing wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Molecular cloning wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genomic library wikipedia , lookup

Community fingerprinting wikipedia , lookup

Transcript
Genome-Wide Mapping of in
Vivo Protein-DNA
Interactions
Johnson et al (Science 2007)
Presented by Leo J. Lee
Mar. 19, 2008
CSC 2417
1
Outline
• Background on ChIP based methods to
study protein-DNA interactions
• Salient features of ChIPSeq
• Overview of the experimental protocol
• Data analysis pipeline used in the paper
• Important biological findings/contributions
• General discussions
Mar. 19, 2008
CSC 2417
2
Protein-DNA interaction
• DNA is the information carrier of almost all living
organisms.
• Protein is the major building block of life.
• Interaction between DNA and protein play vital
roles in the development and normal function of
living organisms, and disease if something goes
wrong.
• An important mechanism of protein-DNA
interaction is via direct binding, i.e., a protein
binds to a particular fragment of the DNA.
Mar. 19, 2008
CSC 2417
3
Chromatin Immunoprecipitation
(ChIP)
• ChIP is a method to
investigate protein-DNA
interaction in vivo.
• The output of ChIP is
enriched fragments of
DNA that were bound by a
particular protein.
• The identity of DNA
fragments need to be
further determined by a
second method.
Mar. 19, 2008
CSC 2417
4
ChIP-chip (or ChIP-on-chip)
• ChIP-chip uses microarray technology to
determine the identity of DNA fragments
produced by ChIP.
• Typically a control sample (genomic DNA
without going through ChIP) is used to
properly define relative enrichment of
specific sequences in the ChIP DNA.
• It is the dominant high-throughput
technique before the arrival of ChIPSeq.
Mar. 19, 2008
CSC 2417
5
ChIPSeq Workflow
ChIP
Size Selection
(200-700bp for Exp 1; 150-300bp for Exp 2)
Solexa Sequencing
Mapping onto Genome
Mar. 19, 2008
CSC 2417
6
ChIPSeq vs. ChIP-chip
• The experimental design of ChIPSeq is
considerably simpler.
• ChIPSeq typically can achieve higher
genomic coverage than ChIP-chip (also
depends on read length vs. probe length).
• The data from ChIPSeq is arguably
cleaner and easier to process.
• Costs are comparable (?).
Mar. 19, 2008
CSC 2417
7
Nice things about NRSF (REST)
• Considerable knowledge on NRSF has been
accumulated from previous studies, which
provides a set of true positives and negatives.
• Yet there is still room to make new discoveries,
as illustrated in the paper.
• The DNA motif bound by NRSF (called NRSE)
is long and well-specified.
• There is a high-quality antibody that recognizes
NRSF efficiently.
Mar. 19, 2008
CSC 2417
8
ChIPSeq Workflow
ChIP
Size Selection
(200-700bp for Exp 1; 150-300bp for Exp 2)
Solexa Sequencing
Mapping onto Genome
Mar. 19, 2008
CSC 2417
9
Sequence Mapping & Filtering
• Only sequence reads mapped to a unique position
on the human genome are kept (about 50%).
• Two mismatches were allowed to accommodate
polymorphism (and sequencing error).
• The resulting sequence read distributions are
processed by a peak locator algorithm to find the
local concentration of sequence hits and its peak.
• A minimum five fold enrichment over the control
sampled is required.
Mar. 19, 2008
CSC 2417
10
ChIPSeq Peak Locator Algorithm
• Merge enriched regions within 500bp of one
another.
• Apply a triangular 5-point smoothing and identify
the peak as the coordinate with the greatest
number of overlapping reads.
Mar. 19, 2008
CSC 2417
11
Selecting a read count threshold
• A ROC curve was
obtained by analyzing
true positives and
negatives.
• A sequence read
threshold of 13 was
selected to reach
98% specificity and
87% sensitivity.
Mar. 19, 2008
CSC 2417
12
Precision of ChIPSeq
• Evaluated against the
center of high-scoring
canonical NRSE
motifs.
• 94% of these strong
motifs fall within 50bp
of the called
experimental peak.
Mar. 19, 2008
CSC 2417
13
Comprehensiveness of ChIPSeq
• Virtually all strong
canonical NRSE motif
instances are
detectably occupied.
• Most of the sites
previously studies by
transfection analysis
are also detected.
Mar. 19, 2008
CSC 2417
14
Motif Visualization
Mar. 19, 2008
CSC 2417
15
Motif Discovery
• Two new kinds of motifs are discovered:
– A noncanonical motif with variable spacing
between the left and right half sites of the
canonical motif
– Half-site motifs
• The enrichment of both kinds of motifs are
highly statistically significant.
• The authors are able to tell a nice
evolutionary story about them.
Mar. 19, 2008
CSC 2417
16
GO enrichment analysis
• As expected, NRSF-bound loci are highly
enriched in gene ontology (GO) terms related to
neurons and their development.
• A group of genes encoding transcription factors
that are critical in driving islet cell development
in pancreas are newly discovered.
• Sequence counts for this group are modest but
comfortably above the threshold of 13.
• The authors are able to provide strong
arguments on the significance of this discovery.
Mar. 19, 2008
CSC 2417
17
Discussions
What makes this a Science paper?
Mar. 19, 2008
CSC 2417
18