Download Validation of RNA editing sites by Ion Proton sequencing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
RED-ML: a novel, effective RNA editing detection method
based on machine learning
Heng Xiong1,2*, Dongbing Liu1,2*, Qiye Li1,2, Mengyue Lei1,2, Liqin Xu1,2, Liang Wu1,2, Zongji
Wang1,2, Shancheng Ren3, Wangsheng Li1,2, Min Xia1,2, Lihua Lu1,2, Haorong Lu1,2, Yong
Hou1,2,4, Shida Zhu1,2,4, Xin Liu1,2, Yinghao Sun3, Jian Wang1,5, Huanming Yang1,5, Kui Wu1,2,4,
Xun Xu1,2#, and Leo J Lee1,6#
Data description ......................................................................................................................... 3
Sequencing ................................................................................................................................ 3
Transcriptome sequencing of two prostate samples on Hiseq 2000 platform ....................... 3
Transcriptome sequencing of Hela on Hiseq 2000 platform ................................................. 3
RNA sequencing of two prostate samples and YH on Ion Proton platform .......................... 4
RNA sequencing of Hela on Ion Proton platform ................................................................. 4
Whole genome sequencing of two prostate samples on Hiseq 2000 platform ...................... 5
Data processing and analysis ..................................................................................................... 6
RNA-seq analysis on Hiseq 2000 platform ........................................................................... 6
Data cleaning ..................................................................................................................... 6
BWA alignment ................................................................................................................. 6
Tophat2 alignment ............................................................................................................. 6
DNA-seq analysis on Hiseq 2000 platform ........................................................................... 6
RNA-seq analysis on Ion Proton platform ............................................................................ 7
Validation of RNA editing sites by Ion Proton sequencing .................................................. 7
Genomic variant evaluation................................................................................................... 7
Training data construction ..................................................................................................... 8
Adaptation of RES-Scanner .................................................................................................. 9
Feature importance analysis .................................................................................................. 9
Extra RED-ML results............................................................................................................. 10
Results based on Tophat2 .................................................................................................... 10
Other results ........................................................................................................................ 12
References ............................................................................................................................... 13
Data description
Our RNA editing detection method RED-ML (RNA Editing Detection based on
Machine Learning) was developed using the previously published Han Chinese
individual RNA-seq and DNA-seq data (aka the YH dataset) [1]. To evaluate its
performance, it has been further applied to RNA-seq data derived from two prostate
tumor samples (CH24T, CH62T) and Hela cell line on Illumina Hiseq platform. We
also had the DNA of the two prostate samples sequenced and employed the YH
DNA-seq [1] to study the influence of genomic variants on RED. The RNA of all four
samples were further subjected to Ion Proton sequencing in order to evaluate accuracy
of RNA editing sites detected by RED-ML. Sequencing information of all datasets is
listed in Tables S1-S3.
Sequencing
Transcriptome sequencing of two prostate samples on Hiseq
2000 platform
5 μg total RNA was used as the starting material and treated with DNase I for 30 min
at 37℃ to remove residual DNA. rRNA was removed by Ribo-Zero™ Magnetic
Gold Kit. The RNA was then fragmented using 5× first strand buffer. The first-strand
cDNA synthesis was constructed with dNTPs, DTT, RNase Inhibitor and
SuperScript® II Reverse Transcriptase. The first strand cDNA (ss-cDNA) were
incubated with 5× second strand buffer, 40 mMdNTPs, 25U DNA PolymeraseI, 1U
RNaseH to synthesize double strand cDNA (dscDNA). The ds-cDNA was repaired by
T4 Polynucleotide Kinase, T4 DNA polymerase and Klenow fragment with dNTPs to
create phosphorylated blunt-end termini. The end-repaired ds-cDNA was then ligated
to synthetic A and DNA adaptors. The adaptor-ligated ds-cDNA was purified with
Ampure XP beads to remove unincorporated adaptors. Uracil-N-Glycosylase was then
used to digest the second strand and PCR was performed. The amplified PCR libraries
were purified with Ampure XP beads. Sequencing was performed on illumina
Hiseq2000 platform with 90bp pair-end reads.
Transcriptome sequencing of Hela on Hiseq 2000 platform
Total RNA of HeLa S3 cell population was extracted by an RNeasy plus mini kit
(Qiagen) according to the manufacturer’s instructions. 5 ng total RNA were used to
produce cDNA by following SMART-seq2 protocol [2]. Amplified cDNA products
were purified by 1 × Agencourt AMPure XP beads (Beckman Coulter). A total of 2 ng
purified cDNA products were used as the starting amount for library preparation by
using TruePrepTM Mini DNA Sample Prep Kit (Vazyme Biotech) according to the
instruction manual. Quality control of library was completed by Agilent 2100
Bioanlyzer and qPCR. Constructed library was sequenced on Hiseq 2000 platform by
using 2 × 91bp paired-end reads.
RNA sequencing of two prostate samples and YH on Ion
Proton platform
We used 2 μg total RNA as the starting material to enrich mRNA with Dynabeads®
mRNA Purification Kit (#61006, Life Technologies) according to the manufacturer’s
protocol. The mRNA was then fragmented using 5× first strand buffer and 0.1ng N6
random primer. The first-strand cDNA synthesis was constructed with dNTPs, DTT,
RNase Inhibitor and SuperScript® II Reverse Transcriptase (#18064014, Life
Technologies). The first strand cDNA (ss-cDNA) were incubated with 5× second
strand buffer (#10812014, Life Technologies), 20 mMdNTPs, 25U DNA PolymeraseI
(#P7050L, Enzymatics), 1U RNaseH (Y922L, Enzymatics) to synthesize double
strand cDNA (dscDNA). The ds-cDNA was repaired by T4 Polynucleotide Kinase, T4
DNA polymerase and Klenow fragment with dNTPs to create phosphorylated
blunt-end termini. The end-repaired ds-cDNA was then ligated to synthetic A and P
adaptors. The adaptor-ligated ds-cDNA was purified with Ampure XP beads
(#A63882, Beckman) to remove unincorporated adaptors. The purification libraries
were size selected by agarose gel electrophoresis, followed by purification with
QIAquick Gel Extraction Kit (#28706, Qiagen). The size selected libraries were
inserted with templates of 150bp ~ 220bp, and then subjected to PCR in a final 25μl
reaction solution containing 1U Platinum® Pfx DNA Polymerase (#C11708-021,
Invitrogen), 1 × Pfx buffer, MgSO4, dNTPs, A primer and P primer. The amplified
PCR libraries were purified with Ampure XP beads and eluted in TE buffer. Emulsion
PCR was performed using the One Touch system (Life Technologies). Beads were
prepared using the One Touch 2 Template Kit v3 (#4488318). Sequencing was
performed by using Ion Proton 200 sequencing kit v3 (#4488315) on the P1 Ion chip.
Data were collected using the Torrent Suite v4.0 software.
RNA sequencing of Hela on Ion Proton platform
Cell sorting with FACS.
All samples were stained with 1ug/ml fluorescein diacetate (FDA) for 5min. The
Non-polar, non-fluorescent FDA enters live cells where it is enzymatically hydrolyzed
by acetylesterase to polar, fluorescent fluorescein which rapidly accumulates in the
cytoplasm. These cells appear green when viewed under incident ultraviolet
illumination. 100 live cells were selected by FACS and sorted into 4ul cell lysis buffer
(Refer to the SMART-seq2 [2]). 2-3 replicates for each sample.
RNA-seq.
All the mRNA of 100 cells were amplified refer to the SMART-seq2. Briefly, the cells
in the lysis buffer were lysised at 72 degree for 3min to release mRNA. Then mRNA
were reverse transcripted by primer with oligo-dT, after which another TSO primer
was added to the 5’ end of cDNA. After 18 cycles of PCR with primer that annealing
to the cDNA, qualified cDNA were used to make libraries by TruePrepTM DNA
Library Prep Kit V2(Vazyme)with modification. We changed the adapter and primer
to make them fit the Ion Proton sequencer (Life technology, Sequences as Table S9).
Adapter1 was annealed using Tn5MErev and Tn5ME-A-Ion, adapter2 was annealed
using Tn5MErev and Tn5ME-P-Ion. Four primers (IonPrimer-P, IonPrimer-A,
IonAdapter-P, IonAdapter-A) were used for library PCR. Qualified libraries were sent
to the sequencer. 12 samples with different index were pooled on one chip, each
library was sequenced to obtain at least 5 Million reads.
Whole genome sequencing of two prostate samples on Hiseq
2000 platform
The genomic DNA from tissues was prepared using the QIAamp DNA Mini Kit
(Qiagen) following the manufacturer’s instructions. Prior to the library construction,
2-3ug of genomic DNA from each sample was fragmented using a Covarias
sonication system to mean sizes of ~500bp. After fragmentation, libraries were
constructed according to the Illumina Paired-End protocol. Briefly, the purified,
randomly fragmented DNA was treated with a mix of T4 DNA polymerase, Klenow
fragments, T4 polynucleotide kinase and a nucleotide triphosphate mix to repair the
ends by blunting and phosphorylation. The blunted DNA fragments were
subsequently 30-adenylated using the Klenow fragment (3’-5’exo) and ligated by T4
DNA ligase to BGI-designed PE Index Adaptors that had been synthesized with
50-methylcytosine in place of cytosine. After each step, the DNA was purified using
the QIAquick PCR Purification Kit (Qiagen). All the constructed libraries were
sequenced on HiSeq 2000 platform using 2×100-bp paired-end reads.
Data processing and analysis
RNA-seq analysis on Hiseq 2000 platform
Data cleaning
Sequencing reads were discarded if: 1) containing adaptor reads; 2) with low-quality
reads (more than 5% bases with quality < 5); 3) containing more than 5% Ns. High
quality paired-end reads were aligned to the UCSC human reference genome (hg19)
using Tophat2 [3] and BWA [4].
BWA alignment
When reads are aligned by BWA (recommended), one should first build a new
reference which combines reference genome (hg19) and exonic sequences
surrounding all known splice junctions, and the detail method is same as Ramaswami
et al [5] and Wang et al [6]. SAMtools [7] (0.1.19) was used to sort the alignment file
and to remove the PCR duplicate reads.
Tophat2 alignment
TopHat2 is a fast splice junction mapper for RNA-seq reads, so the cleaned reads
were mapped to reference genome (hg19) directly with default parameters. Picard
(v1.84; http://broadinstitute.github.io/picard/) was used to sort the alignment and to
remove duplicate reads induced by PCR, and base quality score recalibration was
carried out by GATK [8].
DNA-seq analysis on Hiseq 2000 platform
Sequencing reads were discarded if: 1) containing adaptor reads; 2) with low-quality
reads (more than 5% bases with quality < 5); 3) containing more than 5% Ns. High
quality paired-end reads were aligned to the UCSC human reference genome (hg19)
using BWA (0.5.9). Picard (v1.84; http://broadinstitute.github.io/picard/) was used to
mark duplicate reads induced by PCR. Then local realignment of the BWA aligned
reads and base quality score recalibration was carried out by GATK [8].
RNA-seq analysis on Ion Proton platform
The raw sequencing reads obtained by Ion Proton sequencing were discarded if either:
1) read length < 20; or 2) more than 50% bases with base quality < 5. After that, clean
reads were mapped to human reference genome (hg19) using tmap
(https://github.com/iontorrent/TS/tree/master/Analysis/TMAP). Post-processing of
BAM files included removal of PCR duplicates by Picard (v1.84;
http://broadinstitute.github.io/picard/) and base quality score recalibration by GATK
v2.8-1 [8].
Validation of RNA editing sites by Ion Proton sequencing
Read supporting information was extracted from the TMAP aligned BAM file for
each RNA editing site detected by RED-ML. If the read coverage of a RED-ML
detected RNA editing site was >= 20X and the same variant base was observed with
at least two supporting reads in Ion Proton sequencing, then the RNA editing site was
regarded as validated.
Figure S1. Validation of RNA editing sites identified by Peng et al [1] binned
according to Ion Proton sequencing coverage depth. Validation rate increases with
increased coverage, and it roughly saturates at 20X depth.
Genomic variant evaluation
We have developed a stringent yet convenient method to evaluate whether RED-ML
can distinguish SNP and RNA editing sites. We used the RNA editing sites identified
by RED-ML to do pileup on DNA BAM files. If a site has mapped reads >= 4 and
variant-supporting reads >= 2 in the DNA BAM file, we regard this site as a candidate
genomic variant.
Training data construction
Negative:
1. Sampling 150 sites with frequency less than 0.1 randomly.
2. Sampling 150 sites located in Homopolymer regions randomly.
3. Sampling 150 sites with the number of reads supporting variation allele less
than 2 randomly.
4. Sampling 150 sites which exhibit strand bias randomly.
5. Sampling 150 sites located in read ends randomly.
6. Sampling 150 sites located in simple repeat regions randomly.
7. Sampling 150 sites which do not pass the binomial test randomly.
8. Sampling 300 sites which are filtered out by BWA and not validated by Ion
Proton.
9. Sampling 1200 sites which are in dbSNP 138 randomly.
10. Including 110 (41139), 47(6136), 36(6314), 150(304952), 32(1684) sites
which are not validated by Ion Proton sequencing. The numbers in the
brackets correspond to numbers in the Venn diagram of Fig. S2.
Positive:
1. Ion Proton validated sites in 2960 (1334).
2. Ion Proton validated sites in 6314, except those with frequency < 0.1 or not
passing the binomial test (141).
Figure S2. Overlaping detected RNA editing sites among three methods
(RES-Scanner, Peng et al [1], Ramaswami et al [5]).
Adaptation of RES-Scanner
The biggest modification we made to adapt and optimize RES-scanner on the YH
dataset is by using a different alignment strategy. By default, RES-scanner uses BWA
to align RNA-seq reads and BLAT for re-alignment, the same as in Ramaswami et al.
In order to further stabilize our positive training set, which is mostly obtained by
overlapping the RNA editing sites detected by three different methods, we also
incorporated the popular TopHat2 software in our alignment step. Specifically, we
now use TopHat2 to align RNA-seq reads while BWA for re-alignment. We also
slightly adjusted the thresholds of some hard filters, and full details are provided in
Table S8.
Feature importance analysis
In order to show the relative importance of each feature, we normalize the magnitude
of each feature on the training set before training a LR classifier. This is accomplished
by normalizing the nonnegative features in the range of 0 and 1 and the remaining
ones between [-1 1]. Then the relative importance of the each feature can be viewed
by comparing the absolute values of the weights (detail in Table S10), which are
plotted below in Fig. S3.
Figure S3. The significance of features for RNA editing detection by RED-ML.
The features were sorted orderly according to its weights.
Extra RED-ML results
Results based on Tophat2
Since the current version of RED-ML has also been optimized for TopHat2, we
applied RED-ML to YH, CH24T, CH62T and Hela BAM files aligned by TopHat2.
Basic properties of RNA editing sites detected by RED-ML based on Tophat2 are
listed in Table S5.2, and the parallel results to BWA aligned BAM files are shown in
the following figures (Figures S4-S8).
Figure S4. Comparison of RED-ML and RES-Scanner (Tophat2). In order ensure
a fair comparison, we have compared the RED-ML and RES-Scanner based on same
alignment file (Tophat2, because of RES-Scanner also accepted Tophat2 alignment
file [6] ) a. The numbers of editing sites identified by RED-ML are larger than those
of RES-Scanner (Tophat2) in CH24T, CH62T and Hela. b. The Ion Proton validation
rates of editing sites identified by RED-ML are higher than those by RES-Scanner
(Tophat2) in CH24T, CH62T and Hela. RED-ML has a greater advantage since
TopHat2 has been used in constructing our LR classifier while RES-Scanner has only
been optimized for BWA.
Figure S5. The number of RNA editing sites and Ion Proton validation rate
under different thresholds. With the threshold increasing, the numbers of detected
editing sites decrease and the Ion Proton validation rates increase. a, b and c show the
CH62T and Hela samples, respectively.
Figure S6. Comparison with known RNA editing database. (a)-(c): The overlap of
detected sites with two curated RNA editing databases (DARNED and RADAR) in
CH24T, CH62T and Hela samples were shown as Venn diagrams. Significant portions
of RED-ML detected sites are in neither of the existing databases (37.5%, 36.5% and
59.2% for CH24T, CH62T and Hela samples respectively), probably because these
are not normal tissues.
Figure S7. Ion Proton validation rates for different classes of sites (defined in the
main text) in the three samples. The number of Ion Proton validated sites in each
class is also indicated on the top of each bar. There are no significant differences on
the validation rates among the categories in all three samples, which demonstrate that
RED-ML performed quite consistently independent of existing RNA editing
databases.
Figure S8. SNP evaluation. The percentage of genomic variants in detected RNA
editing sites as quantified by matching DNA sequencing data. And the percentage of
genomic variants in RED-ML detected RNA editing sites is quite low (no more than
2%).
Other results
Since the method of Ramaswami et al used very loose filters to detect RNA editing
sites in Alu regions, it detected quite a large number of RNA editing sites (355,365),
with the vast majority of them (97%) in Alu regions and a significant number (45,746)
with low editing frequencies (<0.1). RED-ML is not able to detect RNA editing sites
with very low frequencies by design, as explained in the main text. Nonetheless, it is
still possible to adjust RED-ML’s threshold to match the number of RNA editing sites
detected by Ramaswami et al and both methods achieved similar validation rates
(both quite low, <0.6). However, such a threshold (~0.0035) is too small to be used for
a LR classifier in practice. To make the comparison between RED-ML and
Ramaswami et al more meaningful, we checked RNA editing sites detected in
non-Alu regions for both methods, and the results were given in the main text.
STAR [9] with default settings was used to align the RNA-seq data of CH24T.
SAMtools was used to sort the alignment file and to remove the PCR duplicate reads.
Then based on the resulting BAM file, RED-ML was used for RNA editing detection.
A total of 246,879 sites were identified as RNA editing sites using the default
threshold of 0.5, while the Ion Proton validation rate was 0.34. The percentages of
A-to-I editing and editing in Alu regions were 0.79 and 0.63, respectively.
HISAT2 [10] with default settings was used to align the RNA-seq data of CH24T.
SAMtools was used to sort the alignment file and Picard was used to remove the PCR
duplicate reads. Then based on the resulting BAM file, RED-ML was used for RNA
editing detection. A total of 36,374 sites were identified as RNA editing sites, and the
Ion Proton validation rate was 0.85. The percentages of A-to-I editing and editing in
Alu regions were 0.93 and 0.78, respectively.
When RED-ML was applied to ant RNA-seq and DNA-seq data from Li et al [9], we
set Alu features to 0 for all candidate RNA editing sites and detected a total of 15,354
sites. There were 53 sites in Sanger datasets from Li et al [11], and all sites were
validated. However, the proportion of A-to-I editing was only 60.5%.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Peng Z, Cheng Y, Tan BC, Kang L, Tian Z, Zhu Y, Zhang W, Liang Y, Hu X,
Tan X et al: Comprehensive analysis of RNA-Seq data reveals extensive
RNA editing in a human transcriptome. Nature biotechnology 2012,
30(3):253-260.
Picelli S, Faridani OR, Bjorklund AK, Winberg G, Sagasser S, Sandberg R:
Full-length RNA-seq from single cells using Smart-seq2. Nature protocols
2014, 9(1):171-181.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL: TopHat2:
accurate alignment of transcriptomes in the presence of insertions,
deletions and gene fusions. Genome biology 2013, 14(4):R36.
Li H, Durbin R: Fast and accurate short read alignment with
Burrows-Wheeler transform. Bioinformatics 2009, 25(14):1754-1760.
Ramaswami G, Lin W, Piskol R, Tan MH, Davis C, Li JB: Accurate
identification of human Alu and non-Alu RNA editing sites. Nature
Methods 2012, 9(6):579-581.
Wang Z, Lian J, Li Q, Zhang P, Zhou Y, Zhan X, Zhang G: RES-Scanner: a
software package for genome-wide identification of RNA-editing sites.
GigaScience 2016, 5(1):37.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G,
Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence
Alignment/Map
format
and
SAMtools.
Bioinformatics
2009,
25(16):2078-2079.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C,
Philippakis AA, del Angel G, Rivas MA, Hanna M et al: A framework for
variation discovery and genotyping using next-generation DNA
sequencing data. Nature genetics 2011, 43(5):491-498.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P,
Chaisson M, Gingeras TR: STAR: ultrafast universal RNA-seq aligner.
Bioinformatics 2013, 29(1):15-21.
Kim D, Langmead B, Salzberg SL: HISAT: a fast spliced aligner with low
memory requirements. Nat Methods 2015, 12(4):357-360.
Li Q, Wang Z, Lian J, Schiott M, Jin L, Zhang P, Zhang Y, Nygaard S, Peng Z,
Zhou Y et al: Caste-specific RNA editomes in the leaf-cutting ant
Acromyrmex echinatior. Nature communications 2014, 5:4943.