Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
RED-ML: a novel, effective RNA editing detection method based on machine learning Heng Xiong1,2*, Dongbing Liu1,2*, Qiye Li1,2, Mengyue Lei1,2, Liqin Xu1,2, Liang Wu1,2, Zongji Wang1,2, Shancheng Ren3, Wangsheng Li1,2, Min Xia1,2, Lihua Lu1,2, Haorong Lu1,2, Yong Hou1,2,4, Shida Zhu1,2,4, Xin Liu1,2, Yinghao Sun3, Jian Wang1,5, Huanming Yang1,5, Kui Wu1,2,4, Xun Xu1,2#, and Leo J Lee1,6# Data description ......................................................................................................................... 3 Sequencing ................................................................................................................................ 3 Transcriptome sequencing of two prostate samples on Hiseq 2000 platform ....................... 3 Transcriptome sequencing of Hela on Hiseq 2000 platform ................................................. 3 RNA sequencing of two prostate samples and YH on Ion Proton platform .......................... 4 RNA sequencing of Hela on Ion Proton platform ................................................................. 4 Whole genome sequencing of two prostate samples on Hiseq 2000 platform ...................... 5 Data processing and analysis ..................................................................................................... 6 RNA-seq analysis on Hiseq 2000 platform ........................................................................... 6 Data cleaning ..................................................................................................................... 6 BWA alignment ................................................................................................................. 6 Tophat2 alignment ............................................................................................................. 6 DNA-seq analysis on Hiseq 2000 platform ........................................................................... 6 RNA-seq analysis on Ion Proton platform ............................................................................ 7 Validation of RNA editing sites by Ion Proton sequencing .................................................. 7 Genomic variant evaluation................................................................................................... 7 Training data construction ..................................................................................................... 8 Adaptation of RES-Scanner .................................................................................................. 9 Feature importance analysis .................................................................................................. 9 Extra RED-ML results............................................................................................................. 10 Results based on Tophat2 .................................................................................................... 10 Other results ........................................................................................................................ 12 References ............................................................................................................................... 13 Data description Our RNA editing detection method RED-ML (RNA Editing Detection based on Machine Learning) was developed using the previously published Han Chinese individual RNA-seq and DNA-seq data (aka the YH dataset) [1]. To evaluate its performance, it has been further applied to RNA-seq data derived from two prostate tumor samples (CH24T, CH62T) and Hela cell line on Illumina Hiseq platform. We also had the DNA of the two prostate samples sequenced and employed the YH DNA-seq [1] to study the influence of genomic variants on RED. The RNA of all four samples were further subjected to Ion Proton sequencing in order to evaluate accuracy of RNA editing sites detected by RED-ML. Sequencing information of all datasets is listed in Tables S1-S3. Sequencing Transcriptome sequencing of two prostate samples on Hiseq 2000 platform 5 μg total RNA was used as the starting material and treated with DNase I for 30 min at 37℃ to remove residual DNA. rRNA was removed by Ribo-Zero™ Magnetic Gold Kit. The RNA was then fragmented using 5× first strand buffer. The first-strand cDNA synthesis was constructed with dNTPs, DTT, RNase Inhibitor and SuperScript® II Reverse Transcriptase. The first strand cDNA (ss-cDNA) were incubated with 5× second strand buffer, 40 mMdNTPs, 25U DNA PolymeraseI, 1U RNaseH to synthesize double strand cDNA (dscDNA). The ds-cDNA was repaired by T4 Polynucleotide Kinase, T4 DNA polymerase and Klenow fragment with dNTPs to create phosphorylated blunt-end termini. The end-repaired ds-cDNA was then ligated to synthetic A and DNA adaptors. The adaptor-ligated ds-cDNA was purified with Ampure XP beads to remove unincorporated adaptors. Uracil-N-Glycosylase was then used to digest the second strand and PCR was performed. The amplified PCR libraries were purified with Ampure XP beads. Sequencing was performed on illumina Hiseq2000 platform with 90bp pair-end reads. Transcriptome sequencing of Hela on Hiseq 2000 platform Total RNA of HeLa S3 cell population was extracted by an RNeasy plus mini kit (Qiagen) according to the manufacturer’s instructions. 5 ng total RNA were used to produce cDNA by following SMART-seq2 protocol [2]. Amplified cDNA products were purified by 1 × Agencourt AMPure XP beads (Beckman Coulter). A total of 2 ng purified cDNA products were used as the starting amount for library preparation by using TruePrepTM Mini DNA Sample Prep Kit (Vazyme Biotech) according to the instruction manual. Quality control of library was completed by Agilent 2100 Bioanlyzer and qPCR. Constructed library was sequenced on Hiseq 2000 platform by using 2 × 91bp paired-end reads. RNA sequencing of two prostate samples and YH on Ion Proton platform We used 2 μg total RNA as the starting material to enrich mRNA with Dynabeads® mRNA Purification Kit (#61006, Life Technologies) according to the manufacturer’s protocol. The mRNA was then fragmented using 5× first strand buffer and 0.1ng N6 random primer. The first-strand cDNA synthesis was constructed with dNTPs, DTT, RNase Inhibitor and SuperScript® II Reverse Transcriptase (#18064014, Life Technologies). The first strand cDNA (ss-cDNA) were incubated with 5× second strand buffer (#10812014, Life Technologies), 20 mMdNTPs, 25U DNA PolymeraseI (#P7050L, Enzymatics), 1U RNaseH (Y922L, Enzymatics) to synthesize double strand cDNA (dscDNA). The ds-cDNA was repaired by T4 Polynucleotide Kinase, T4 DNA polymerase and Klenow fragment with dNTPs to create phosphorylated blunt-end termini. The end-repaired ds-cDNA was then ligated to synthetic A and P adaptors. The adaptor-ligated ds-cDNA was purified with Ampure XP beads (#A63882, Beckman) to remove unincorporated adaptors. The purification libraries were size selected by agarose gel electrophoresis, followed by purification with QIAquick Gel Extraction Kit (#28706, Qiagen). The size selected libraries were inserted with templates of 150bp ~ 220bp, and then subjected to PCR in a final 25μl reaction solution containing 1U Platinum® Pfx DNA Polymerase (#C11708-021, Invitrogen), 1 × Pfx buffer, MgSO4, dNTPs, A primer and P primer. The amplified PCR libraries were purified with Ampure XP beads and eluted in TE buffer. Emulsion PCR was performed using the One Touch system (Life Technologies). Beads were prepared using the One Touch 2 Template Kit v3 (#4488318). Sequencing was performed by using Ion Proton 200 sequencing kit v3 (#4488315) on the P1 Ion chip. Data were collected using the Torrent Suite v4.0 software. RNA sequencing of Hela on Ion Proton platform Cell sorting with FACS. All samples were stained with 1ug/ml fluorescein diacetate (FDA) for 5min. The Non-polar, non-fluorescent FDA enters live cells where it is enzymatically hydrolyzed by acetylesterase to polar, fluorescent fluorescein which rapidly accumulates in the cytoplasm. These cells appear green when viewed under incident ultraviolet illumination. 100 live cells were selected by FACS and sorted into 4ul cell lysis buffer (Refer to the SMART-seq2 [2]). 2-3 replicates for each sample. RNA-seq. All the mRNA of 100 cells were amplified refer to the SMART-seq2. Briefly, the cells in the lysis buffer were lysised at 72 degree for 3min to release mRNA. Then mRNA were reverse transcripted by primer with oligo-dT, after which another TSO primer was added to the 5’ end of cDNA. After 18 cycles of PCR with primer that annealing to the cDNA, qualified cDNA were used to make libraries by TruePrepTM DNA Library Prep Kit V2(Vazyme)with modification. We changed the adapter and primer to make them fit the Ion Proton sequencer (Life technology, Sequences as Table S9). Adapter1 was annealed using Tn5MErev and Tn5ME-A-Ion, adapter2 was annealed using Tn5MErev and Tn5ME-P-Ion. Four primers (IonPrimer-P, IonPrimer-A, IonAdapter-P, IonAdapter-A) were used for library PCR. Qualified libraries were sent to the sequencer. 12 samples with different index were pooled on one chip, each library was sequenced to obtain at least 5 Million reads. Whole genome sequencing of two prostate samples on Hiseq 2000 platform The genomic DNA from tissues was prepared using the QIAamp DNA Mini Kit (Qiagen) following the manufacturer’s instructions. Prior to the library construction, 2-3ug of genomic DNA from each sample was fragmented using a Covarias sonication system to mean sizes of ~500bp. After fragmentation, libraries were constructed according to the Illumina Paired-End protocol. Briefly, the purified, randomly fragmented DNA was treated with a mix of T4 DNA polymerase, Klenow fragments, T4 polynucleotide kinase and a nucleotide triphosphate mix to repair the ends by blunting and phosphorylation. The blunted DNA fragments were subsequently 30-adenylated using the Klenow fragment (3’-5’exo) and ligated by T4 DNA ligase to BGI-designed PE Index Adaptors that had been synthesized with 50-methylcytosine in place of cytosine. After each step, the DNA was purified using the QIAquick PCR Purification Kit (Qiagen). All the constructed libraries were sequenced on HiSeq 2000 platform using 2×100-bp paired-end reads. Data processing and analysis RNA-seq analysis on Hiseq 2000 platform Data cleaning Sequencing reads were discarded if: 1) containing adaptor reads; 2) with low-quality reads (more than 5% bases with quality < 5); 3) containing more than 5% Ns. High quality paired-end reads were aligned to the UCSC human reference genome (hg19) using Tophat2 [3] and BWA [4]. BWA alignment When reads are aligned by BWA (recommended), one should first build a new reference which combines reference genome (hg19) and exonic sequences surrounding all known splice junctions, and the detail method is same as Ramaswami et al [5] and Wang et al [6]. SAMtools [7] (0.1.19) was used to sort the alignment file and to remove the PCR duplicate reads. Tophat2 alignment TopHat2 is a fast splice junction mapper for RNA-seq reads, so the cleaned reads were mapped to reference genome (hg19) directly with default parameters. Picard (v1.84; http://broadinstitute.github.io/picard/) was used to sort the alignment and to remove duplicate reads induced by PCR, and base quality score recalibration was carried out by GATK [8]. DNA-seq analysis on Hiseq 2000 platform Sequencing reads were discarded if: 1) containing adaptor reads; 2) with low-quality reads (more than 5% bases with quality < 5); 3) containing more than 5% Ns. High quality paired-end reads were aligned to the UCSC human reference genome (hg19) using BWA (0.5.9). Picard (v1.84; http://broadinstitute.github.io/picard/) was used to mark duplicate reads induced by PCR. Then local realignment of the BWA aligned reads and base quality score recalibration was carried out by GATK [8]. RNA-seq analysis on Ion Proton platform The raw sequencing reads obtained by Ion Proton sequencing were discarded if either: 1) read length < 20; or 2) more than 50% bases with base quality < 5. After that, clean reads were mapped to human reference genome (hg19) using tmap (https://github.com/iontorrent/TS/tree/master/Analysis/TMAP). Post-processing of BAM files included removal of PCR duplicates by Picard (v1.84; http://broadinstitute.github.io/picard/) and base quality score recalibration by GATK v2.8-1 [8]. Validation of RNA editing sites by Ion Proton sequencing Read supporting information was extracted from the TMAP aligned BAM file for each RNA editing site detected by RED-ML. If the read coverage of a RED-ML detected RNA editing site was >= 20X and the same variant base was observed with at least two supporting reads in Ion Proton sequencing, then the RNA editing site was regarded as validated. Figure S1. Validation of RNA editing sites identified by Peng et al [1] binned according to Ion Proton sequencing coverage depth. Validation rate increases with increased coverage, and it roughly saturates at 20X depth. Genomic variant evaluation We have developed a stringent yet convenient method to evaluate whether RED-ML can distinguish SNP and RNA editing sites. We used the RNA editing sites identified by RED-ML to do pileup on DNA BAM files. If a site has mapped reads >= 4 and variant-supporting reads >= 2 in the DNA BAM file, we regard this site as a candidate genomic variant. Training data construction Negative: 1. Sampling 150 sites with frequency less than 0.1 randomly. 2. Sampling 150 sites located in Homopolymer regions randomly. 3. Sampling 150 sites with the number of reads supporting variation allele less than 2 randomly. 4. Sampling 150 sites which exhibit strand bias randomly. 5. Sampling 150 sites located in read ends randomly. 6. Sampling 150 sites located in simple repeat regions randomly. 7. Sampling 150 sites which do not pass the binomial test randomly. 8. Sampling 300 sites which are filtered out by BWA and not validated by Ion Proton. 9. Sampling 1200 sites which are in dbSNP 138 randomly. 10. Including 110 (41139), 47(6136), 36(6314), 150(304952), 32(1684) sites which are not validated by Ion Proton sequencing. The numbers in the brackets correspond to numbers in the Venn diagram of Fig. S2. Positive: 1. Ion Proton validated sites in 2960 (1334). 2. Ion Proton validated sites in 6314, except those with frequency < 0.1 or not passing the binomial test (141). Figure S2. Overlaping detected RNA editing sites among three methods (RES-Scanner, Peng et al [1], Ramaswami et al [5]). Adaptation of RES-Scanner The biggest modification we made to adapt and optimize RES-scanner on the YH dataset is by using a different alignment strategy. By default, RES-scanner uses BWA to align RNA-seq reads and BLAT for re-alignment, the same as in Ramaswami et al. In order to further stabilize our positive training set, which is mostly obtained by overlapping the RNA editing sites detected by three different methods, we also incorporated the popular TopHat2 software in our alignment step. Specifically, we now use TopHat2 to align RNA-seq reads while BWA for re-alignment. We also slightly adjusted the thresholds of some hard filters, and full details are provided in Table S8. Feature importance analysis In order to show the relative importance of each feature, we normalize the magnitude of each feature on the training set before training a LR classifier. This is accomplished by normalizing the nonnegative features in the range of 0 and 1 and the remaining ones between [-1 1]. Then the relative importance of the each feature can be viewed by comparing the absolute values of the weights (detail in Table S10), which are plotted below in Fig. S3. Figure S3. The significance of features for RNA editing detection by RED-ML. The features were sorted orderly according to its weights. Extra RED-ML results Results based on Tophat2 Since the current version of RED-ML has also been optimized for TopHat2, we applied RED-ML to YH, CH24T, CH62T and Hela BAM files aligned by TopHat2. Basic properties of RNA editing sites detected by RED-ML based on Tophat2 are listed in Table S5.2, and the parallel results to BWA aligned BAM files are shown in the following figures (Figures S4-S8). Figure S4. Comparison of RED-ML and RES-Scanner (Tophat2). In order ensure a fair comparison, we have compared the RED-ML and RES-Scanner based on same alignment file (Tophat2, because of RES-Scanner also accepted Tophat2 alignment file [6] ) a. The numbers of editing sites identified by RED-ML are larger than those of RES-Scanner (Tophat2) in CH24T, CH62T and Hela. b. The Ion Proton validation rates of editing sites identified by RED-ML are higher than those by RES-Scanner (Tophat2) in CH24T, CH62T and Hela. RED-ML has a greater advantage since TopHat2 has been used in constructing our LR classifier while RES-Scanner has only been optimized for BWA. Figure S5. The number of RNA editing sites and Ion Proton validation rate under different thresholds. With the threshold increasing, the numbers of detected editing sites decrease and the Ion Proton validation rates increase. a, b and c show the CH62T and Hela samples, respectively. Figure S6. Comparison with known RNA editing database. (a)-(c): The overlap of detected sites with two curated RNA editing databases (DARNED and RADAR) in CH24T, CH62T and Hela samples were shown as Venn diagrams. Significant portions of RED-ML detected sites are in neither of the existing databases (37.5%, 36.5% and 59.2% for CH24T, CH62T and Hela samples respectively), probably because these are not normal tissues. Figure S7. Ion Proton validation rates for different classes of sites (defined in the main text) in the three samples. The number of Ion Proton validated sites in each class is also indicated on the top of each bar. There are no significant differences on the validation rates among the categories in all three samples, which demonstrate that RED-ML performed quite consistently independent of existing RNA editing databases. Figure S8. SNP evaluation. The percentage of genomic variants in detected RNA editing sites as quantified by matching DNA sequencing data. And the percentage of genomic variants in RED-ML detected RNA editing sites is quite low (no more than 2%). Other results Since the method of Ramaswami et al used very loose filters to detect RNA editing sites in Alu regions, it detected quite a large number of RNA editing sites (355,365), with the vast majority of them (97%) in Alu regions and a significant number (45,746) with low editing frequencies (<0.1). RED-ML is not able to detect RNA editing sites with very low frequencies by design, as explained in the main text. Nonetheless, it is still possible to adjust RED-ML’s threshold to match the number of RNA editing sites detected by Ramaswami et al and both methods achieved similar validation rates (both quite low, <0.6). However, such a threshold (~0.0035) is too small to be used for a LR classifier in practice. To make the comparison between RED-ML and Ramaswami et al more meaningful, we checked RNA editing sites detected in non-Alu regions for both methods, and the results were given in the main text. STAR [9] with default settings was used to align the RNA-seq data of CH24T. SAMtools was used to sort the alignment file and to remove the PCR duplicate reads. Then based on the resulting BAM file, RED-ML was used for RNA editing detection. A total of 246,879 sites were identified as RNA editing sites using the default threshold of 0.5, while the Ion Proton validation rate was 0.34. The percentages of A-to-I editing and editing in Alu regions were 0.79 and 0.63, respectively. HISAT2 [10] with default settings was used to align the RNA-seq data of CH24T. SAMtools was used to sort the alignment file and Picard was used to remove the PCR duplicate reads. Then based on the resulting BAM file, RED-ML was used for RNA editing detection. A total of 36,374 sites were identified as RNA editing sites, and the Ion Proton validation rate was 0.85. The percentages of A-to-I editing and editing in Alu regions were 0.93 and 0.78, respectively. When RED-ML was applied to ant RNA-seq and DNA-seq data from Li et al [9], we set Alu features to 0 for all candidate RNA editing sites and detected a total of 15,354 sites. There were 53 sites in Sanger datasets from Li et al [11], and all sites were validated. However, the proportion of A-to-I editing was only 60.5%. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Peng Z, Cheng Y, Tan BC, Kang L, Tian Z, Zhu Y, Zhang W, Liang Y, Hu X, Tan X et al: Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nature biotechnology 2012, 30(3):253-260. Picelli S, Faridani OR, Bjorklund AK, Winberg G, Sagasser S, Sandberg R: Full-length RNA-seq from single cells using Smart-seq2. Nature protocols 2014, 9(1):171-181. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology 2013, 14(4):R36. Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760. Ramaswami G, Lin W, Piskol R, Tan MH, Davis C, Li JB: Accurate identification of human Alu and non-Alu RNA editing sites. Nature Methods 2012, 9(6):579-581. Wang Z, Lian J, Li Q, Zhang P, Zhou Y, Zhan X, Zhang G: RES-Scanner: a software package for genome-wide identification of RNA-editing sites. GigaScience 2016, 5(1):37. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25(16):2078-2079. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics 2011, 43(5):491-498. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013, 29(1):15-21. Kim D, Langmead B, Salzberg SL: HISAT: a fast spliced aligner with low memory requirements. Nat Methods 2015, 12(4):357-360. Li Q, Wang Z, Lian J, Schiott M, Jin L, Zhang P, Zhang Y, Nygaard S, Peng Z, Zhou Y et al: Caste-specific RNA editomes in the leaf-cutting ant Acromyrmex echinatior. Nature communications 2014, 5:4943.