Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Exploring the package TopHat-CuffDiff Jean-François Taly Bioinformatics Core Facilities Group meeting October 2nd 2012 1 RNAseq expression data analysis 1. TopHat for mapping reads to the reference – Reads directionality 2. CuffDiff for the differential enrichment – Statistics with version 2.0.0 or 2.0.1 3. Enrichment threshold – Which transcripts are present in mitochondria? 2 MitomiR project miRNP ? Regulation on mitochondrial translation miRNAs miRNAs mRNAs PNPASE Mito proteins Question 1 : Are Nuclear DNA-encoded miRNAs imported to mitochondria ? Slide from MitomiR_3 EU0183 MitomiR project miRNP ? miRNAs Regulation on mitochondrial translation miRNAs proteins mRNAs Question 2 : Do miRNAs exist in the mitochondrial genome? Slide from 4 MitomiR_ EU0183 One cell, two DNAs Nucleus -23 chromosome pairs -human DNA : 2.9 billion DNA base pairs -20,000 and 25,000 human protein-coding genes -»Junk » DNA or non coding DNA -Noncoding functional RNA (tRNA, rRNA,miRNA…) The human genome may encode over 1000 miRNAs, which may target about 60% of mammalian gene Mitochondria -Circular DNA -human (ADNmt) mitochondria genome = 16.6 kb -13 for subunits of respiratory complexes I, III, IV and V -22 for mitochondrial tRNA -2 for rRNA *One mitochondrion can contain two to ten copies of its DNA * Exceptions to the universal genetic code (UGC) in mitochondria From Lung et al. , 2006 MitomiR_5 EU0183 RNAseq libraries • Short insert size: searching for miRNAs – – – – No poly-A selection No fragmentation Size selected: 18-36 nt stranded • Long insert size: searching for lncRNAs – – – – No poly-A selection Fragmented Size selected: 200 nt stranded 6 2 Conditions • Total fraction (tot) – Full cell lysate • Mitochondrial fraction (mit) – RNA extracted from mitochondria 7 RNAseq expression data analysis 1. TopHat for mapping reads to the reference – Reads directionality 2. CuffDiff for the differential enrichment – Statistics with version 2.0.0 or 2.0.1 3. Enrichment threshold – Which transcripts are present in mitochondria? 8 Stranded RNAseq: Vocabulary 5’ 3’ coding Forward Reverse coding 3’ 5’ Forward = 5’ end the closest from centromer in Human 50% of the genes are coding in the forward strand Forward / Reverse = Plus / Minus Coding / Template = Sense / Anti-sense http://www.biostars.org/post/show/3423/forward-and-reverse-strand-conventions/ 9 Orientation of reads? 5’ 3’ coding DNA template DNA 3’ 5’ Transcription 5’ 3’ RNA Reverse-transcription 5’ 3’ RNA cDNA 3’ 5’ First strand sequencing dUTP, NSR, NNSR Duplication Second strand sequencing Directional Illumina (Ligation) Standard SOLiD 5’ 3’ coding DNA cDNA 3’ 5’ 10 Proper TopHat option? --library-type : • fr-unstranded: Default, Standard Illumina Reads • fr-firststrand: dUTP, NSR, NNSR • fr-secondstrand: Directional Illumina (Ligation), Standard SOLiD We mapped the reads using the unstranded and the secondstrand for comparisons 11 How can we evaluate directionality? • Reads mapping in the F strand should be aligned with genes coding in F as well. • Bitwise FLAG of the BAM file: – How many reads in forward? samtools view -c -F 16 accepted_hits.bam – How many reads in reverse? samtools view -c -f 16 accepted_hits.bam Total number of reads Percentage of Forward Mapping (PFM) --library-type fr-secondstrand 173,219,584 55% default 173,196,005 55% 12 How can we evaluate directionality? (2) • Gene by gene default --library-type frsecondstrand – Bitwise FLAG + gene strand annotation Transcripts in the (+) strand Transcripts in the (-) strand Transcripts in both strands Number of transcripts 82,782 80,648 163,430 Average PFM 77% 24% 51% Median PFM 92% 1% 55% Number of transcripts 82,868 80,693 163,561 Average PFM 77% 24% 51% Median PFM 92% 1% 54% A small number of genes received a huge amount of miss-mapped reads! 13 Example of miss-aligned reads • AC097532.1: chr2:133038647-133038738 – miRNA automatically annotated in E67 but retired from E68; – CIGAR string of some reads is 26kb long; – 11,000,115 reads mapped (6% of total); – 8,205,667 mapped to the position 133,038,644; – NCBI blast of the major sequence: • hit on the opposite strand but with 100% coverage and 100% identity to the 28S ribosomal RNA. 14 RNAseq expression data analysis 1. TopHat for mapping reads to the reference – Reads directionality 2. CuffDiff for the differential enrichment – Statistics with version 2.0.0 or 2.0.1 3. Enrichment threshold – Which transcripts are present in mitochondria? 15 CuffDiff needs a special GTF • CuffDiff needs a GTF with the 2 following tags: – tss_id: The ID of this transcript's inferred start site. – p_id: The ID of the coding sequence this transcript contains. • You can produce a compatible GTF with CuffCompare: cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf 16 CuffCompare + CuffDiff V2.0.2 CuffCompare + CuffDiff V2.0.2 Effect of CuffCompare CuffDiff V2.0.2 CuffDiff V2.0.2 17 CuffDiff V2.0.2 CuffDiff V2.0.2 Effect of CuffDiff Version CuffDiff V2.0.1 CuffDiff V2.0.1 18 Highly sensible statistics Reproducibility? Version effect? CuffCompare effect? Genome annotation effect? From 902 differentialy expressed genes with V2.0.1, we went to 15 with v2.0.2!!! 19 RNAseq expression data analysis 1. TopHat for mapping reads to the reference – Reads directionality 2. CuffDiff for the differential enrichment – Statistics with version 2.0.0 or 2.0.1 3. Enrichment threshold – Which transcripts are present in mitochondria? 20 Expression data reflects expectations qPCR(tot)/qP CR(mit) 21-07-2011 qPCR 29-07-2011 RNA seq ShortIS RNA seq LongIS Ensembl Ids Gene Length shortest ENSG00000198899 MT-ATP6 681 0.600 0.500 - 0.18 ENSG00000198840 MT-ND3 346 0.400 0.400 - 0.21 ENSG00000111640 GAPDH 390 416.000 362.000 - 7.1 ENSG00000089157 RLP0 402 611.000 446.000 - 8.6 Statistics may not be trustable but the fold change is! Define an enrichment threshold based on log2(FPKMtot/FPKMmit) Cytosol Vincinity of mitochodria Mitochondrial genes 21 Compartimented genes • Cytosolic genes: – UniProt: experimentaly observed in cytosol – Ensembl: no automatic annotations • Vincinity of mitochondria: – Paper from Kang et al. 2012 • Mitochondrial genes – The 37 genes in the chromosome 22 Log2(Fold Change) distributions for the long insert library 23 Summary SortIS DE Mean DE Median SeqNumb LongIS DE Mean DE Median SeqNumb All Cyt Ensembl67 Cyt UniProt Mitochondrial Kang2012 VicinityMit 1.7 0.41 - -0.6 - 2.05 0.46 - -0.65 - 2117 9 0 22 0 0.46 1.05 0.9 -2.21 1.94 0.5 1.14 0.96 -2.27 2.2 21030 1664 127 34 13 24 Significantly enriched genes Method Short Insert Long Insert CuffDiff V2.0.1 988 908 Threshold 309 714 Intersection 22 99 25 Back Up slides 26 Mithochondrial genome 27 Mithochondrial genome – first 3 genes 28 Short Ensembl Ids ENSG00000198695 ENSG00000198712 ENSG00000198727 ENSG00000198763 ENSG00000198786 ENSG00000198804 ENSG00000198840 ENSG00000198886 ENSG00000198888 ENSG00000198899 ENSG00000198938 ENSG00000209082 ENSG00000210049 ENSG00000210077 ENSG00000210082 ENSG00000210100 ENSG00000210107 ENSG00000210112 ENSG00000210117 ENSG00000210127 ENSG00000210135 ENSG00000210140 ENSG00000210144 ENSG00000210151 ENSG00000210154 ENSG00000210156 ENSG00000210164 ENSG00000210174 ENSG00000210176 ENSG00000210184 ENSG00000210191 ENSG00000210194 ENSG00000210195 ENSG00000210196 ENSG00000211459 ENSG00000212907 ENSG00000228253 Gene MT-ND6 MT-CO2 MT-CYB MT-ND2 MT-ND5 MT-CO1 MT-ND3 MT-ND4 MT-ND1 MT-ATP6 MT-CO3 J01415.1 J01415.2 J01415.3 J01415.4 J01415.5 J01415.6 J01415.7 J01415.8 J01415.9 J01415.10 J01415.11 J01415.12 J01415.13 J01415.14 J01415.15 J01415.16 J01415.17 J01415.18 J01415.19 J01415.20 J01415.21 J01415.22 J01415.23 J01415.24 MT-ND4L J01415.25 Length 525 684 1141 1042 1812 1542 346 1378 956 681 784 75 71 69 1559 69 72 68 68 69 73 66 66 69 68 70 68 65 69 59 71 69 66 68 954 297 207 Long FPKM mit FPKM tot log2(tot/mit) FPKM mit FPKM tot log2(tot/mit) 81 459 159 172 129 154 226 166 150 94 270 39041 179164 96298 1546 10163 75946 171524 11418 1932 20509 12550 9804 5078 5943 28619 5627 7569 43092 1175590 67641 157602 71836 45761 943 412 735 23 169 144 59 58 58 66 56 92 26 269 34034 80467 67810 642 12512 35617 97116 7479 1427 12667 7616 5234 1809 3392 32650 3232 10780 28770 395027 36817 115972 77279 30983 583 141 160 -1.81 -1.44 -0.15 -1.53 -1.15 -1.42 -1.77 -1.56 -0.71 -1.83 -0.01 -0.20 -1.15 -0.51 -1.27 0.30 -1.09 -0.82 -0.61 -0.44 -0.70 -0.72 -0.91 -1.49 -0.81 0.19 -0.80 0.51 -0.58 -1.57 -0.88 -0.44 0.11 -0.56 -0.69 -1.54 -2.20 1820 4063 2332 1559 2153 4186 2890 3400 1183 2357 2037 56409 257938 2524440 HIDATA 63087 2191 67897 7944 13615 1864 77355 74448 NOTEST 1800 1734 5572 11149 150713 735380 70081 603010 19871 121678 HIDATA 9230 36590 377 764 504 285 437 766 610 698 233 431 401 9045 55524 682409 27286 11058 455 22503 2424 3971 196 13629 11999 NOTEST 760 345 1972 4206 34863 208681 14281 124182 4777 15826 29151 1991 8531 -2.27 -2.41 -2.21 -2.45 -2.30 -2.45 -2.24 -2.28 -2.35 -2.45 -2.34 -2.64 -2.22 -1.89 0.00 -2.51 -2.27 -1.59 -1.71 -1.78 -3.25 -2.50 -2.63 NOTEST -1.24 -2.33 -1.50 -1.41 -2.11 -1.82 -2.29 -2.28 -2.06 -2.94 0.00 -2.21 -2.1029 Ensembl Ids ENSG00000198695 ENSG00000198712 ENSG00000198727 ENSG00000198763 ENSG00000198786 ENSG00000198804 ENSG00000198840 ENSG00000198886 ENSG00000198888 ENSG00000198899 ENSG00000198938 ENSG00000209082 ENSG00000210049 ENSG00000210077 ENSG00000210082 ENSG00000210100 ENSG00000210107 ENSG00000210112 ENSG00000210117 ENSG00000210127 ENSG00000210135 ENSG00000210140 ENSG00000210144 ENSG00000210151 ENSG00000210154 ENSG00000210156 ENSG00000210164 ENSG00000210174 ENSG00000210176 ENSG00000210184 ENSG00000210191 ENSG00000210194 ENSG00000210195 ENSG00000210196 ENSG00000211459 ENSG00000212907 ENSG00000228253 Gene MT-ND6 MT-CO2 MT-CYB MT-ND2 MT-ND5 MT-CO1 MT-ND3 MT-ND4 MT-ND1 MT-ATP6 MT-CO3 J01415.1 J01415.2 J01415.3 J01415.4 J01415.5 J01415.6 J01415.7 J01415.8 J01415.9 J01415.10 J01415.11 J01415.12 J01415.13 J01415.14 J01415.15 J01415.16 J01415.17 J01415.18 J01415.19 J01415.20 J01415.21 J01415.22 J01415.23 J01415.24 MT-ND4L J01415.25 Type protein_coding protein_coding protein_coding protein_coding protein_coding protein_coding protein_coding protein_coding protein_coding protein_coding protein_coding Mt_tRNA Mt_tRNA Mt_tRNA Mt_rRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_tRNA Mt_rRNA protein_coding protein_coding Status KNOWN KNOWN KNOWN KNOWN KNOWN KNOWN KNOWN KNOWN KNOWN KNOWN KNOWN NOVEL NOVEL NOVEL KNOWN NOVEL NOVEL NOVEL NOVEL NOVEL NOVEL NOVEL KNOWN NOVEL NOVEL NOVEL NOVEL NOVEL NOVEL NOVEL NOVEL KNOWN NOVEL NOVEL KNOWN KNOWN KNOWN Level 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 30 3 Cellular metabolism regulation (E2C slide) Glucose Glycolysis O2 2 ATP Pyruvate Glucose Mitochondrial dysfunction Glycolysis Aminoacids nucleotides Differentiation 2 ATP Pyruvate Lactate CO2 OXPHOS Lactate 36 ATP Warburg effect Proliferative cells Undifferentiated cells Biosynthesis efficiency Working cells Differentiated cells Energetic efficiency MCF7 MCF7 is a breast cancer cell line able to grow in OXPHOS conditions Cells grown in different metabolic condition might represent a unique way to distinguish RNA subpopulation expressed in mitochondria (ncRNA and … miRNA?) 31 Slide from Experimental design OXPHOS 0mM glucose Low Glucose High Glucose Stable MCF-7 cell lines J0 MCF7 oxphos MCF7 MCF7 oxphos MCF7 oxphos SHIFTS!!! OXPHOS HIGH Glucose J1 MCF7 oxphos MCF7 Oxphos shift to High Gluc Total cells and mito extraction TLDA RNA-seq MCF7 High Gluc AGB:CH3854 ATCC:HTB-22 MCF7 High Gluc HIGH Glucose Min 3 weeks Stable cell lines MCF7 High Gluc MCF7 High Gluc shit to OXPHOS MCF7 High Gluc OXPHOS Total cells and mito extraction N= 3 to 4 independent batches TLDA = Microfluidic miRNA qPCR 32 Exon Exon 1 Exon2 33