* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download transcription factor binding site
Comparative genomic hybridization wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Human genome wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Non-coding DNA wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Genome editing wikipedia , lookup
Pathogenomics wikipedia , lookup
DNA sequencing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Epigenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human Genome Project wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Exome sequencing wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Genomic library wikipedia , lookup
ChIP-seq data analysis Harri Lähdesmäki Department of Computer Science Aalto University January 8, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation: transcription factor binding site (TFBS) prediction ! ! ! ! Last time we studied computational methods for predicting transcription factor (TF) binding sites To understand transcriptional regulation in general, we would like to analyze/predict TF binding sites in any cell type/biological condition/. . . A critical limitation of most of the standard TFBS prediction methods is that they are condition independent That is, TF binding predictions would be the same for all cell types/biological conditions/. . . ! ! How can one model and explain variation between hundreds of different cell types? Some methods do exist which make condition dependent binding predictions but typically those methods require some condition dependent data anyway ! We will get back to these later . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ChIP-seq ! So, for any given condition, how do we find the genomic locations where DNA binding proteins bind? ! Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the current state-of-the-art method ! The basic principle: ! ! Use a specific antibody to detect a protein of interest (or a post-translationally modified histone protein in nucleosomes) ChIP-seq procedure enriches DNA fragments that are bound to these proteins or nucleosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ChIP-seq protocol ! ChIP-seq steps: ! Crosslink DNA-binding proteins with DNA in vivo Shear the chromatin into small fragments (200-600 bp) amenable for sequencing (sonication) Immunoprecipitate the DNA-protein complex with a specific antibody Reverse the crosslinks Assay enriched DNA to determine the sequences bound by the protein of interest ! ! ! ! Figure from (Visel et al., 2009) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ChIP-seq protocol again Biometrics, March 2011 152 REVI Figure from (Zhang 2011)to its in. vivo Figure 1. Details of a ChIPseq experiment. A DNA-binding proteinetis al., cross-linked . genomic . . . .DNA . .targets, . . . and . . . . . the chromatin (a complex of DNA and protein) is isolated (1). The DNA with the bound. proteins is. extracted from the cells, . . . . . . . . . . . . . . . a Roche/454, Life/APG, Polonator and is sheared by sonication into fragments of average length 500–1,000 bp (2). DNA fragments that are cross-linked to the Emulsion PCR protein of interest are enriched by immunoprecipitation with an antibody that specifically binds that protein (3–4). After the One DNA bead. Clonal tothe thousands copies occurs in microreactors in an emulsion immunoprecipitation step,molecule the DNA per is separated from amplification the protein (5), resultingof suspension of IP-enriched DNA is size selected on a gel, and only smaller fragments (e.g., 100–300 bp) are retained (6). Then, the size-selected, IP-enriched DNA is sequenced to generate millions of short reads, each of which represents or end (7–8). (In an alternative PCReither a fragment start Break Template “paired end” experiment that is rarely used for ChIP-seq, a read isamplification generated from each emulsion end of each DNA fragment.) After dissociation read sequences are aligned to a reference genome, read positions can be used to infer binding site positions. (8) shows two binding sites, with the right-hand site more enriched in DNA fragments. Fragments that do not align with a binding site reflect biases like nonspecific immunoprecipitation, misalignment, etc. This figure appears in color in the electronic version of this article. . . . . . . . 100–200 NGS sequencing – recap ! immunoprecipitated “treatment” sample, and then using an Primer, template, analysis method that considers the treatment profile reladNTPs and polymerase tive to the control profile (Kharchenko et al., 2008; Nix, Courdy, and Boucher, 2008; Rozowsky et al., 2009). Control data can be used to help identify false positives, assess numerical background models, and estimate a threshb Illumina/Solexa Template immobilization Solid-phase amplification One DNA molecule per cluster old for segmenting a read density or enrichment profile in order to identify a subset of significantly enriched regions. Analysis methods are described as “two-sample” when a control data set is available and as “single sample” when only treatment data are available. As with ChIPchip, there are various ways to generate control samples, Chemica linked to c Helicos BioSciences: one-pass sequ Single molecule: primer immobilize Cluster growth Sample preparation DNA (5 µg) Template dNTPs and polymerase 100–200 million molecular clusters Bridge amplification Billions of primed, single-molecule t Figure from (Metzker, 2010) d Helicos BioSciences: two-pass sequencing e Pacific Biosciences, Life/Visigen, LI-COR Biosciences Single molecule: template immobilized Single molecule: polymerase immobilized . Billions of primed, single-molecule templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thousands of primed, single-molecule templates Figure 1 | Template immobilization strategies. In emulsion PCR (emPCR) (a), a reaction mixture consisting Nature Review NGS sequencing – recap ! Sequencing by synthesis, R E V I E W Sresulting in from millions to billions of reads a Illumina/Solexa — Reversible terminators F F F G F C c Helicos BioSciences — Reversible terminators F F F T F A C F G Incorporate all four nucleotides, each label with a different dye C F C F G F T F F G F C G T F F F Wash, fourcolour imaging Cleave dye and terminating groups, wash C G A C G C G T G A F G A F G A F T A Incorporate single, dye-labelled nucleotides T C F C F F C F F C G G C F G F G Each cycle, add a different dye-labelled dNTP T C F F F G G F C F F F C C C G C C G C F G Wash, onecolour imaging T C Cleave dye and inhibiting groups, cap, wash Repeat cycles G Repeat cycles b d C A T G C T A G C T A G Top: CATCGT Bottom: CCCCCC Top: CTAGTG Bottom: CAGCTA . . . . . Figure from (Metzker, 2010) . . . . . . . . . . . . . . . Figure 2 | Four-colour and one-colour cyclic reversible termination methods. a | The four-colour cyclic reversible 23,101 . . . . 3`-O-azidomethyl . . . . . reversible . . . terminator . . . .chemistry . . (BOX . 1) .using termination (CRT) method uses Illumina/Solexa’s Nature Reviews | Genetics solid-phase-amplified template clusters (FIG. 1b, shown as single templates for illustrative purposes). Following imaging, a cleavage step removes the fluorescent dyes and regenerates the 3`-OH group using the reducing agent tris(2-carboxyethyl)phosphine (TCEP)23. b | The four-colour images highlight the sequencing data from two clonally amplified templates. c | Unlike Illumina/Solexa’s terminators, the Helicos Virtual Terminators33 are labelled with the same dye and dispensed individually in a predetermined order, analogous to a single-nucleotide addition method. Following total internal reflection fluorescence imaging, a cleavage step removes the fluorescent dye and inhibitory groups using TCEP to permit the addition of the next Cy5-2`-deoxyribonucleoside triphosphate (dNTP) analogue. The free sulphhydryl groups are then capped with iodoacetamide before the next nucleotide addition33 (step not shown). d | The one-colour images highlight the sequencing data from two single-molecule templates. Short read alignment – recap One-base-encoded probe ! An oligonucleotide sequence in which one interrogation base is associated with a particular dye (for example, A in the first position corresponds to a green dye). An example of a one-base degenerate probe set is ‘1-probes’, which indicates that the first nucleotide is the interrogation base. The remaining bases consist of either degenerate (four possible bases) or universal bases. Caenorhabditis elegans genome. From a single HeliScope run using only 7 of the instrument’s 50 channels, approximately 2.8 Gb of high-quality data were generated in 8 days from >25-base consensus reads with 0, 1 or 2 errors. Greater than 99% coverage of the genome was reported, and for regions that showed >5-fold coverage, the consensus accuracy was 99.999% (J. W. Efcavitch, personal communication). Key first steps in ChIP-seq data analysis include ! ! adapter removal, quality control, etc. mapping short reads to a reference genome Sequencing by ligation. SBL is another cyclic method that differs from CRT in its use of DNA ligase35 and either one-base-encoded probes or two-base-encoded probes. In its simplest form, a fluorescently labelled probe hybridizes to its complementary sequence adjacent to the primed template. DNA ligase is then added to join the dye-labelled probe to the primer. Non-ligated probes are washed away, followed by fluorescence 36 | JANUARY 2010 | VOLUME 11 www.nature.com/reviews/genetics ! Alignment is conceptually simple: given each of the short reads at a time, align the read locally to a reference genome using e.g. Smith-Waterman algorithm, BLAST or even exact string matching ! Given a large number reads generated in a single experiment, S-W and BLAST would be too slow in practice Several computationally more efficient methods have been developed, which use e.g. methods based on ! ! ! Hash tables Suffix/prefix trees (Burrows-Wheeler transform) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . REVIEWS Poisson model A probability distribution that is often used to model the number of random events in a fixed interval. Given an average number of events in the interval, the probability of a given number of occurrences can be calculated. alignment, as all subsequent results are based on the aligned reads. Owing to the large number of reads, the use of conventional alignment algorithms can take hundreds or thousands of processor hours; therefore, a new generation of aligners has been developed57, and more are expected soon. Every aligner is a balance between accuracy, speed, memory and flexibility, and no aligner can be best suited for all applications. Alignment for ChIP–seq should allow for a small number of mismatches due to sequencing errors, SNPs and indels or the difference between the genome of interest and the reference genome. This is simpler than in RNA–seq, for example, in which large gaps corresponding to introns must be considered. Popular aligners include: Eland, an efficient and fast aligner for short reads that was developed by Illumina and is the default aligner on that platform; Mapping and Assembly with Qualities (MAQ)58, a widely used aligner with a more exhaustive algorithm and excellent capabilities for detecting SNPs; and Bowtie59, an extremely fast mapper that is based on an algorithm that was originally developed for file compression. These methods use the quality score that accompanies each base call to indicate its reliability. For the SOLiD di-base sequencing technology, in which two consecutive bases are read at a time, modified aligners have been developed60,61. Many current analysis pipelines discard non-unique tags, but studies involving the repetitive regions of the genome27,62–64 require careful handling of these non-unique tags. Strand specificity and read density visualization ! A “data view” of protein-DNA binding Protein or nucleosome of interest 5 Positive strand 3 3 Negative strand 5 5 ends of fragments are sequenced Identification of enriched regions. After sequenced reads are aligned to the genome, the next step is to identify regions that are enriched in the ChIP sample relative to the control with statistical significance. Several ‘peak callers’ that scan along the genome to identify the enriched regions are currently available24,26,38,48,65–70. In early algorithms, regions were scored by the number of tags in a window of a given Short reads size and then assessed by a set of criteria based on facare aligned tors such as enrichment over the control and minimum tag density. Subsequent algorithms take advantage of the directionality of the reads71. As shown in FIG. 5, the fragments are sequenced at the 5` end, and the locaDistribution of tags Reference genome tions of mapped reads should form two distributions, is computed one on the positive strand and the other on the negative strand, with a consistent distance between the peaks of the distributions. In these methods, a smoothed profile of each strand is constructed65,72 and the combined profile is calculated either by shifting each distribution Profile is generated from combined tags towards the centre or by extending each mapped position For example, each mapped Peak identification into an appropriately oriented ‘fragment’ and then location is extended can be performed adding the fragments together. The latter approach with a fragment of on either profile estimated size should result in a more accurate profile with respect to the width of the binding, but it requires an estimate of the fragment size as well as the assumption that Fragments are added fragment size is uniform. Given a combined profile, peaks can be scored in several ways. A simple fold ratio of the signal for the ChIP sample relative to that of the control sample around the peak (FIG. 3B) provides important information, but it is not adequate. A fold ratio of 5 estimated from 50 and 10 tags (from the ChIP and control experiments, respecFigure 5 | Strand-specific profiles at enriched sites. DNA fragments from a . | Genetics . . . tively) . . has . a different . . . statistical . . . significance . . . to the.same. Nature Figure experiment from (Park, 2009) chromatin immunoprecipitation are sequenced from theReviews 5` end. ratio estimated from, for example, 500 and 100 tags. Therefore, the alignment of these tags to the genome results in two peaks . (one . on . . . A.Poisson . . model . . for. the. tag . distribution . . . is. an effective . . each strand) that flank the binding location of the protein or nucleosome of interest. approach that accounts for the ratio as well as the absoThis strand-specific pattern can be used for the optimal detection of enriched 27 lute tag numbers , and it can also be modified to account regions. To create an approximate distribution of all fragments, each tag location can for regional bias in tag density due to the chromatin be extended by an estimated fragment size in the appropriate orientation and the structure, copy number variation or amplification bias67. number of fragments can be counted at each position. 676 | O CTOBER 2009 | VOLUME 10 . . . www.nature.com/reviews/genetics Single-end vs. paired-end sequencing )''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[ ! Sequencing by synthesis always proceeds from 5’ to 3’ ! So-called single-end sequencing technology sequences only one of the ends of a DNA fragment ! Either reverse or forward strand is sequenced ! Current used sequencing technologies can read only a limited number of nucleotides from the 5’ end, e.g., from 36 to 200 nucleotides ! Because DNA fragments have size variation, we do not get information about the whole fragment ! Paired-end sequencing chemistry, where both end of a DNA fragment is sequenced, is becoming more popular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identification of binding sites from ChIP-seq data ! Given a read density (or read densities on both strands) along genome, the actual data analysis task involves identification of the protein binding sites ! Given the above information, we should expect to see two “signal peaks” on opposite DNA strands within a proper distance ! But how much signal (how many reads) are considered enough to call a protein-DNA interaction site? What affects the signal strength? ! ! ! ! ! ! Protein binding in the first place Sequencing efficiency and depth Chromatin accessibility Fragmentation efficiency Mappability (i.e., uniqueness) of a local genomic region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ChIP-seq controls ! The best way to assess significance of a signal at putative binding sites is to use a control for ChIP-seq ! ! Input-DNA: sequencing data of the genomic DNA. More/Less accessible regions are sequenced more/less ChIP-seq experiment with an unspecific antibody which does not detect any specific protein ! ChIP-seq controls can be used to account for many of the biases which affect the signal strength ! Input-DNA is currently considered the best control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R E V I E WDetecting S A ! binding sites from ChIP-seq data Early methods used a single cut-off for signal strength or a Ba Not statistically significant log-fold enrichment (for a given putative binding Enrichment ratio: 1.5 region/window) 15 ChIP A Considering all statistically significant peaks Fraction of peaks recovered Considering all statistically significant peaks Fraction of peaks recovered score = log # ChIP-seqControl reads 10in a window # Input DNA reads in a window Bb Statistically significant 1 Ba Considering only peaks with fold enrichment 0above a threshold Enrichment ratio: 4 Not statistically significant Enrichment ratio: 1.5 0 1 Considering only peaks with fold enrichment above a threshold ChIP 15 Control 10 ChIP Control 20 5 Enrichment ratio: 1.5 150 100 Figure1 from (Park, 2009) Fraction of reads sampled from the data Bb Statistically significant Figure ! 3 | Depth of sequencing. A | To determine whether enough tags have been sequenced, a simulation can be carried out to characterize the fraction of the peaks that would be recovered if a smaller numberNature of tags had been Reviews | Genetics Enrichment Enrichment sequenced. In many cases, new statistically significant are discovered at a steady rate with an increasing number ratio:peaks 4 ratio: 1.5 of tags (solid curve) — that is, there is no saturation of binding sites. However, when a minimum threshold is imposed for the enrichment ratio between chromatin (ChIP)150 and input DNA peaks, the rate at which new ChIP immunoprecipitation 20 peaks are discovered slows down (dashed curve) — that is, saturation of detected binding sites can occur when only sufficiently prominent binding positions are considered. For a given data set, multiple curves corresponding to . . . . . . . . . . . . . . 5 the threshold at which the curve. becomes different thresholds can be examined to identify sufficiently flat to meet .the . Control 100 . . . . . . . . . . . . . . . . . . desired saturation criteria (defined by the intersection of the orange lines on the graph). We refer to such a threshold as 0 the minimum saturation1 enrichment ratio (MSER). The MSER can serve as a measure for the depth of sequencing 0 Fraction of reads sampled from in thea data achieved data set: a high MSER, for example, might indicate that the data set was undersampled, as only the more prominent peaks were saturated (see REF. 48 for details). Ba | A peak that is not statistically significant — the Figure 3 | Depth of sequencing. A | To determine whetherthe enough been sequenced,isalow simulation enrichment ratio between ChIPtags and have control experiments (1.5) andcan thebe number of tag counts (shown under carried out to characterize the the fraction of the peaks that would be recovered if a smaller number of tags had been Nature Reviews | Genetics peaks) is also low. Bb | Two ways in which a peak can be statistically significant. On the left, although the number of sequenced. In many cases, newtag statistically peaks are discovered at a steady rate with an increasing number counts issignificant low, the enrichment ratio between the ChIP and control experiments is high (4). On the right, the peaks of tags (solid curve) — that is, there no same saturation of binding sites. However, when a minimum threshold imposed haveisthe enrichment ratio as those in a but have a larger number of is tag counts;for this example shows that the enrichment ratio between chromatin (ChIP) andprominent input DNApeaks peaks,becoming the rate atstatistically which newsignificant and that there might not continuedimmunoprecipitation sequencing might lead to less peaks are discovered slows down (dashed curve) — that is, point saturation of detected binding sites can occur when only Figure S6. Workflow chart of MACS. ! necessarily be a saturation after which no further binding sites are discovered. sufficiently prominent binding positions are considered. For a given data set, multiple curves corresponding to different thresholds can be examined to identify the threshold at which the curve becomes sufficiently flat to meet the desired saturation criteria (defined by the intersection of the orange lines on the graph). We refer to such a threshold as the minimum saturation enrichment Thecontinued MSER can serve a measure for the depth sequencing ! ratio more and (MSER). more sites to beasfound at a steady largeofsample size increases the statistical power and achieved in a data set: a high MSER, for example, might indicate that the data set was undersampled, as only the more pace with additional sequencing (FIG. 3A, lower curve). causes features that have small effect sizes to attain staprominent peaks were saturated (see REF. 48 for details). Ba | A peak that is not statistically significant — the In another study 38, human RNA polymerase II targets tistical significance. In the study discussed above48, we enrichment ratio between the ChIP and control experiments is low (1.5) and the number of tag counts (shown under wereinshown saturate but forsignificant. signal transducer that each ChIP–seq data set could be annothe peaks) is also low. Bb | Two ways whichto a peak can quickly, be statistically On the left,proposed although the number of and activator ofthe transcription 1 (STAT1), the number tated a minimal tag counts is low, the enrichment ratio between ChIP and control experiments is high (4). On thewith right, the peakssaturated enrichment ratio (MSER) rise steadily. that,example — a shows point that at which saturation occurs — to give a sense have the same enrichment ratioofastargets those incontinued a but haveto a larger number This of tagsuggests counts; this continued sequencing might lead to lessin prominent peaksthere becoming statistically there might not at least some cases, might not be asignificant satura- and of that the sequencing depth achieved. We also found that necessarily be a saturation point after which no further are discovered. tion point that can bebinding used tosites determine the number there is a linear relationship between the number of Current state-of-the-art methods are probabilistic in nature . . . . . Model-based Analysis of ChIP-Seq (MACS) A commonly used method for detecting TF binding sites from ChIP-seq data: MACS (Zhang et al, 2008) Workflow: of tags to be sequenced if peaks are found based on reads and the MSER, when properly scaled. This makes statistical significance. it possible to predict how many more reads are needed However, does exist a fixed the when a particular more and more sites continued to be foundaatsaturation a steady point large sample sizeifincreases statistical powerlevel and of MSER is desired. Although threshold is imposed on the fold enrichment between these concepts and tools should be tested on more pace with additional sequencing (FIG. 3A, lower curve). causes features that have small effect sizes to attain stathe peaks in theIIChIP experiment the peaksIn inthe thestudy data sets, they provide 48 In another study 38, human RNA polymerase targets tisticaland significance. discussed above , wea framework for understanding — that is, saturation when depth-of-sequencing issues in ChIP–seq experiments. were shown to saturate quickly,control but for experiment signal transducer proposed thatoccurs each ChIP–seq data set could be annoonly prominent peaks (as defined by minimum fold and activator of transcription 1 (STAT1), the number tated with a minimal saturated enrichment ratio (MSER) enrichment) are considered. Whenat all peaks are occurs Multiplexing. For small genomes, including those of of targets continued to rise steadily. This suggests that, — a point which saturation — to give a sense even peaks with small enrichment Saccharomyces cerevisiae, at least in some cases, thereconsidered, might not be a saturaof the sequencing depthcan achieved. We also found that Caenorhabditis elegans and become statistically significant as more tags accumulate D. melanogaster, theofnumber of reads generated in tion point that can be used to determine the number there is a linear relationship between the number (FIG. 3B) therefore the number of significant peaksproperly a sequencing unitmakes (for example, one of eight lanes on of tags to be sequenced if peaks are and found based on reads and the MSER, when scaled. This may continue to rise with more sequencing. This is an Illumina Genome Analyzer) may be several times statistical significance. it possible to predict how many more reads are needed similar whatifhappens genome-wide association greater than the number of reads needed to provide However, a saturation point doestoexist a fixed in when a particular level of MSER is desired. Although studies and other genomic these investigations which sufficient coverage of the genome at a suitable depth threshold is imposed on the fold enrichment between concepts in and toolsashould be tested on more the peaks in the ChIP experiment and the peaks in the data sets, they provide a framework for understanding control experiment — that is, saturation occurs when depth-of-sequencing in ChIP–seq . . Figure issues from (Zhang et al.,experiments. 2008) . . . . . . . . . . . . . . . 674 |prominent O CTOBER 2009 10 by minimum fold . . . www.nature.com/reviews/genetics . . . . . . . . . . . . . . . only peaks| VOLUME (as defined enrichment) are considered. When all peaks)''0DXZd`ccXeGlYc`j_\ijC`d`k\[%8cci`^_kji\j\im\[ are Multiplexing. For small genomes, including those of considered, even peaks with small enrichment can Saccharomyces cerevisiae, Caenorhabditis elegans and become statistically significant as more tags accumulate D. melanogaster, the number of reads generated in (FIG. 3B) and therefore the number of significant peaks a sequencing unit (for example, one of eight lanes on may continue to rise with more sequencing. This is an Illumina Genome Analyzer) may be several times . . . . . Model-based Analysis of ChIP-Seq (MACS) ! Find model peaks: ! ! mfoldlow and mfoldhigh : high confidence fold-enrichment parameters MACS slides 2 × bandwidth windows across the genome to find regions with tags (=reads) more than mfoldlow (but smaller than mfoldhigh to avoid artefacts) enriched relative to a random tag genome distribution ! bandwidth = assumed sonicated fragment size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model-based Analysis of ChIP-Seq (MACS) Model the shift size of ChIP-seq tags 0.5 0.4 0.3 Tag percentage (%) 0.0 0.1 0.2 0.4 0.3 0.2 0.1 100 200 300 −200 −100 0 100 2.8 . . . . . . . . . . . . . . . 2.7 . . Average tag number in control / kb 500 . . . . . 200 Location with respect to FKHR motif (bp) (d) 600 Figure from (Zhang et al., 2008) −300 2.6 0 2.5 −100 2.4 Tag percentage (%) Watson tag Crick tags d = 122 bp 0.0 −200 Location with respect to the center of Watson and Crick peaks (bp) (c) Volume 9, Issue 9, Article R137 0.6 0.5 Watson tags Crick tags d = 126 bp −300 400 ! 300 ! Take 1000 high-quality peaks randomly Separate Watson and Crick tags http://genomebiology.com/2008/9/9/R137 Genome Biology 2008, Aligns the tags by the mid point Find d: distance between the modes of the Watson and Crick peaks in the alignment (a) (b) 200 ! 100 ! Control tag number (normalized) / 10 kb ! . . . . . . . . . . . . . . . . . . 300 Model-based Analysis of ChIP-Seq (MACS) ! Shift all the tags by d/2 toward the 3’ ends to the most likely protein-DNA interaction sites ! Remove redundant tags: ! ! ! ! ! Sometimes the same tag can be sequenced repeatedly, more than expected from a random genome-wide tag distribution Such tags might arise from biases during ChIP-DNA amplification and sequencing library preparation (PCR duplicates) These are likely to add noise to the final peak calls MACS removes duplicate tags in excess of what is warranted by the sequencing depth (binomial distribution p-value < 10−5 ) For example, for the 3.9 million ChIP-seq tags, MACS allows each genomic position to contain no more than one tag and removes all the redundancies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model-based Analysis of ChIP-Seq (MACS) ! Identifying the most likely binding sites ! ! For experiments with a control, MACS linearly scales the total control tag count to be the same as the total ChIP tag count Counting process: if reads were sampled independently from a population with given, fixed fractions of genomic locations, the read counts x in each genomic location/window would follow a multinomial distribution (binomial for a single location), which can be approximated by the Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multinomial and binomial distributions ! n independent trials ! Each trial leads to a success for exactly one of k categories ! Each category has a fixed success probability pk ! Multinomial gives probability of any particular combination of numbers of successes for the k categories ! Binomial is a special case of 2 categories; can be approximated with a Poisson for larger n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model-based Analysis of ChIP-Seq (MACS) ! Let x denote the number of sequencing reads in a window (genomic region) ! Analysing each genomic window independently, if all genomic regions would be equally likely, then λxBG −λBG x ∼ Poisson(·|λBG ) = e , x! x = 0, 1, 2, . . . where λBG is the rate of observing reads in the control sample ! Because ChIP-seq data has several bias sources which vary across the genome, it is better to model the data using a local/dynamic Poisson λlocal = max(λBG , [λ1K ], λ5K , λ10K ) ! λX K is estimated from the X K window of the control sample (e.g. input-DNA) centered at a specific location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model-based Analysis of ChIP-Seq (MACS) Assessing statistical significance of x reads (in a genomic region i) using hypothesis testing ! ! The p-value is the probability of observing x many reads or more, assuming the null hypothesis is true: p − value = k=x The location with the highest fragment pileup (summit) is http://genomebiology.com/2008/9/9/R137 Genome Biology 2008, predicted as the precise binding location The ratio between the ChIP-seq tag count and λlocal is reported as the fold enrichment . . . . . . . . . . . . Tag percentage (%) 0.3 Tag percentage (%) . . . . bp. d .= 122 . . . 0.0 0.0 The read count in ChIP versus control in 10 kb windows across the genome. Each dot represents a 10 kb window; red dots are windows containing ChIP peaks and black (d) dots are (c) windows containing control peaks used for FDR calculation. −300 −200 −100 0 100 200 300 −300 −200 −100 2.7 Average tag number in control / kb 200 400 600 800 1,000 −20 −10 FoxA1 ChIP−Seq tag number / 10 kb 70 Spatial resolution for FoxA1 Ch . . . . . . . . . . . . . . . Spatial resolution (nt) . . MACS Without local lambda Without tag shifting 45 MACS Without local lambda Without tag shifting 65 60 10 (f) Figure from (Zhang et al., 2008) . 0 Distance to FoxA1 peak center (k Motif occurrence in peak centers for FoxA1 ChIP-Seq 55 100 2.3 0 FKHR motif (%) 0 Location with respect to FKHR motif 2.8 600 500 400 300 200 100 0 Control tag number (normalized) / 10 kb Location with respect to the center of Watson and Crick peaks (bp) (e) . 0.1 0.1 0.2 Model-based Analysis of ChIP-Seq (MACS) . . . . . . . . 0.5 . 0.4 . 0.3 Watson tags . Crick . . tags . . . 0.4 . . 0.2 d = 126 bp 0.6 (b) 0.5 (a) ! Volume 9, Issue 9, Ar 2.6 ! Poisson(x|λlocal ) 2.5 ! ∞ ! 2.4 ! H0 : the ith location is not a binding site H1 : the ith location is a binding site . . . . . . . . . . . . 40 ! . . . . . . . . . . ChIP-seq peak: Illustration ! An illustration of a strong TF binding site Figure from http://www.nature.com/nmeth/journal/v6/n4/images/nmeth.f.247-F2.jpg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple testing ! Multiple testing problem occurs when a test needs to be applied many times ! Nominal p-values are valid for a single test ! Hypothesis testing will lead to many false positives if the p-values are not corrected for multiple testing ! Multiple testing is an issue in many bioinformatics application e.g. ! ! ! ! ! Differential gene expression analysis TF binding site prediction Detection of protein binding sites from ChIP-seq Detecting disease associated genomic variant ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple testing ! Type I error ! ! ! Null hypothesis H0 is true but it is rejected in favour of H1 E.g.: a genomic region i is not a binding site but it is declared as a binding site Assume m independent tests for which the null hypothesis is true, then the probability that any of the hypothesis will be rejected is α = 1 − (1 − α)m i.e., the probability of making one or more type I errors ! This is also called the family-wise error rate (FWER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bonferroni correction (1) (m) ! Let H0 , . . . , H0 be a collection of hypotheses and p1 , . . . , pm the corresponding p-values ! Let I0 be the subset of the m0 = |I0 | (unknown) true null hypotheses ! Recall the Bonferroni correction: ! ! Given the original significance level α and the number of statistical tests m, then Bonferroni correction will reject only those null hypothesis i for which pi ≤ α/m For the Bonferroni correction FWER ≤ α because ⎛ ⎞ $ α⎠ ! ' α( α ⎝ FWER = P pi ≤ ≤ P pi ≤ = m0 ≤= α m m m I0 ! I0 The Bonferroni correction is conservative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . False discovery rate ! False discovery rate (FDR) is defined as the proportion of false positives among all positives #false positives #false positives + #true positives Formally FDR is defined as the expectation of the above quantity The Benjamini-Hochberg (BH) step-up procedure is commonly used in bio applications Let α be given and p(1) , p(2) , . . . , p(m) be the ordered (from smallest to largest) list of the m p-values, then the BH procedure works as follows FDR = ! ! ! 1. Find the largest k such that p(k) ≤ 2. Then reject all H(i) for i = 1, . . . , k ! k mα For BH, the probability of expected proportion of false positives ≤ α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiple correction in MACS ! For a ChIP-seq experiment with controls, MACS empirically estimates the false discovery rate (FDR) ! At each p-value, MACS uses the same parameters to find ChIP peaks over control and control peaks over ChIP (that is, a sample swap) ! The empirical FDR is defined as empirical FDR = #control peaks #ChIP peaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Differential binding ! MACS can also be applied to differential binding between two conditions by treating one of the samples as the control ! Differential binding analysis in MACS will only work with two samples, i.e., one biological replicate per condition ! Empirical FDR control will not work in such a scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary ! ChIP-seq is a powerful way to detect TF binding sites (or e.g. enriched regions of histone modifications; next lecture) ! ChIP-seq approaches are limited in that ! ! Only a subset of all TFs have a chip-grade antibody A single experiment will profile a single protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References ! ! ! ! ! Metzker ML (2010) Sequencing technologies - the next generation, Nat Rev Genet. 11(1):31-46. Park PJ (2009) ChIP-seq: advantages and challenges of a maturing technology, Nat Rev Genet. 10(10):669-80. Axel Visel, Edward M. Rubin & Len A. Pennacchio (2009) Genomic views of distant-acting enhancers,” Nature 461, 199-205. Zhang Y et al. (2008), Model-based analysis of ChIP-Seq (MACS), Genome Biol. 9(9):R137. Zhang X et al. (2011) PICS: probabilistic inference for ChIP-seq, Biometrics, 67(1): 151-163. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .