* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download RNA-Seq Analysis Practicals
Pathogenomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic library wikipedia , lookup
Metagenomics wikipedia , lookup
Human Genome Project wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Human genome wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Transgenerational epigenetic inheritance wikipedia , lookup
Oncogenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epitranscriptome wikipedia , lookup
Epigenetics wikipedia , lookup
Epigenetic clock wikipedia , lookup
Behavioral epigenetics wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Secreted frizzled-related protein 1 wikipedia , lookup
Epigenetics of depression wikipedia , lookup
Epigenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics in stem-cell differentiation wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
DNA methylation wikipedia , lookup
Visualising and Exploring BS-Seq Data Simon Andrews [email protected] @simon_andrews 1 Starting Data Read 1 Read 2 Read 3 Genome L001_bismark_bt2_pe.deduplicated.bam CHG_OB_L001_bismark_bt2_pe.deduplicated.txt.gz CHG_OT_L001_bismark_bt2_pe.deduplicated.txt.gz CHH_OB_L001_bismark_bt2_pe.deduplicated.txt.gz CHH_OT_L001_bismark_bt2_pe.deduplicated.txt.gz CpG_OB_L001_bismark_bt2_pe.deduplicated.txt.gz CpG_OT_L001_bismark_bt2_pe.deduplicated.txt.gz 2 Starting Data • Mapped Reads – BAM file format – Contains positions of mapped reads – Useful for looking at coverage biases • Methylation Calls – Delimited text file – Individual calls for single cytosines – Useful for looking at methylation patterns 3 Checking Coverage 4 Coverage Outliers Around 600x average genome density 5 Coverage Outliers 6 Coverage Outliers • Normally the result of mis-mapping repetitive sequences not in the genome assembly • Centromeric / temomeric sequences are common • Can be a significant proportion of all data • Can throw off calculations of overall methylation • Should be removed 7 Which data to use? • Methylation contexts – CpG: Only generally relevant context for mammals – CHG: Only known to be relevant in plants – CHH: Generally unmethylated • Methylation strands – CpG methylation is generally symmetric – Normally makes sense to merge OT / OB strands 8 Quantitating Methylation Total methylated calls = 15 Total unmethylated calls = 10 Methylation level = (15/(15+10))*100 = 60% 9 Quantitating Methylation Total methylated calls = 20 Total unmethylated calls = 5 Methylation level = (20/(20+5))*100 = 80% 10 Quantitating Methylation Total methylated calls = 12 Total unmethylated calls = 14 Methylation level = (12/(12+14))*100 = 46%? 11 Quantitating Methylation 100% 100% 100% 100% 0% Methylation level = (100+100+100+100+0)/5 = 80% 12 Quantitating Methylation 100% 0% 100% 100% 0% 0% 100% 0% 100% 0% Methylation level = (100*5)+(0*5)/10 = 50% ? Methylation level = (100*5)/5 = 100% 13 Quantitating Methylation 57% 33% Common = 300/6 = 50% in both 14 More Complex Methods • Factors to consider during quantitation – Surrounding methylation levels • Assume that methylation doesn’t change over very short distances – Coverage of individual bases • Down-weight very low or high values – Density of CpGs • Apply the level to all bases within a region 15 Where to make measures • Per base – Very large number of measures – Poor accuracy for individual bases • Unbiased windows – Tiled over whole genome – Need to decide how they will be defined • Targeted regions – Which regions – What context 16 Unbiased analysis • What basis do you use for creating the windows? – Fixed size? • Uneven CpG distribution can be problematic – Fixed CpG count? • Different resolution • Need to tailor the window sizes to the data density • Can relate windows to features later 17 Fixed size windows 300 CpG content /5kb 150 0 Differences between replicates 1kb 2kb 5kb 10kb 25kb 50kb Must balance technical considerations and biology 18 Fixed CpG windows • More even noise profile – Fairer statistics – More comparable methylation values – Length differences are not very dramatic 50 CpG window lengths 19 Viewing quantitated methylation 20 Large sample numbers 21 Large sample numbers 22 Targeted Quantitation • Measure over features – CpG islands • Be careful where you get your locations – Promoters • Should probably split into CpG island and non-CpG island – Gene bodies • Filter by biotype to remove small RNA genes? 23 Viewing comparisons 24 Viewing comparisons 25 Viewing differences 26 Viewing differences 27 Trends • • • • Effects at individual loci can be subtle Want to find more generalised effect Collate information across whole genome Look for general trend 28 Considerations for trend plots • What features to use – Fixed vs relative scale • How much context – Variable scales • How to calculate base measures – What window size – Aligned vs unaligned windows • Missing values • Scale 29 Simple Example Methylation profile centred on CpG islands +/- 10kb 30 More complex example Methylation profile over genes +/- 5kb 31 Clustering 100 Percentage Methylation 90 80 70 60 Probe A 50 Probe B 40 Probe C 30 20 10 0 Sample 1 Sample 2 Sample 3 32 Clustering • Correlation Clustering – – – – Focusses on the differences between conditions Absolute values not important Look for similar trends Show median normalised values • Euclidean Clustering – Focusses on absolute differences between conditions – Look for similar levels – Show raw values 33 Clustering 34