* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Copy-number variation wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression profiling wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Metagenomics wikipedia , lookup
Transcript
MeRIP-PF Manual Released 2013-01-24 Yuli Li, Shuhui Song et al. Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing Contact: [email protected]; [email protected] 1 2 3 OVERVIEW ................................................................................................................................. 3 1.1 Background ................................................................................................................... 3 1.2 Summary of MeRIP-PF .................................................................................................. 3 1.3 Implementation............................................................................................................. 3 1.4 Availability ..................................................................................................................... 3 INSTALLATION............................................................................................................................ 4 2.1 Perl and R ...................................................................................................................... 4 2.2 BWA ............................................................................................................................... 4 2.3 SAMtools ....................................................................................................................... 4 2.4 BEDTools ........................................................................................................................ 4 USAGE ....................................................................................................................................... 5 3.1 Preparing reference files ............................................................................................... 5 3.2 Running MeRIP-PF ......................................................................................................... 6 3.2.1 Setting up the Config file ................................................................................... 7 3.3 Output files format........................................................................................................ 8 3.4 Produce wig plots ........................................................................................................ 10 3.4.1 Usage ............................................................................................................... 11 1 OVERVIEW 1.1 Background Next generation parallel sequencing technologies make m6A-specific methylated RNA immunoprecipitation followed by sequencing a popular strategy to study transcriptome-wide RNA modifications, while creating challenges for analysis , especially in peak-finding. However, there have been no any available tools or softwares for MeRIP-Seq data analysis yet. Here, we present a high-efficiency and easily-used analysis pipeline called MeRIP-PF, which is a publicly available open source and specially developed for MeRIP-Seq peak-calling with control samples. MeRIP-PF achieves m6A regions detection and annotation, and powerful graphical display which are useful for further study. 1.2 Summary of MeRIP-PF We integrate four modules, including mapping, testing, annotating and plotting into one program to complete the whole analysis. The pipeline requires two Fastq-formatted data, genome reference sequences of the corresponding species and several annotated BED files with the gene structure information. And it will output 4 results files, giving the complete information of the modification profile. 1.3 Implementation The pipeline program was written in Perl, and run in a Linux machine cluster; each node has 8 cores with 2.00G Hz processor and 16G RAM. MeRIP-PF requires the installation of Perl and R language program, and BWA, SAMtools and BEDTools. 1.4 Availability The MeRIP-PF package including an example dataset is available at http://software.big.ac.cn/MeRIP-PF.html 2 INSTALLATION 2.1 Perl and R Ensure that Perl is installed. Install R from http://www.r-project.org/ !!!Note: Perl and R need be installed at the root pathway. 2.2 BWA Download any version of BWA (e.g. bwa-0.6.2.tar.bz2) from http://sourceforge.net/projects/bio-bwa/files/ > bzip2 -d bwa-0.6.2.tar.bz2 > tar -xf bwa-0.6.2.tar > cd bwa-0.6.2/ > make 2.3 SAMtools Download installation file from http://sourceforge.net/projects/samtools/files/samtools/0.1.18/ > bzip2 -d samtools-0.1.18.tar.bz2 > tar -xf samtools-0.1.18.tar > cd samtools-0.1.18/ > make 2.4 BEDTools Download installation file from http://code.google.com/p/bedtools/downloads/list > gunzip BEDTools.v2.16.2.tar.gz > tar -xf BEDTools.v2.16.2.tar > cd BEDTools-Version-2.16.2/ > make clean > make all 3 USAGE 3.1 Preparing reference files You need prepare some BED files. Download from http://genome.ucsc.edu/cgi-bin/hgTables, taking the species of human for example. Downloading setting up: *clade: Mammal *genome: Human *assembly: Feb 2009(GRCh37/hg19) *group: Genes and Gene Prediction Tracks *track: RefSeq Gene *table: refGene *region: genome *output format: BED-browser extensible data *output file: hg19 *file type returned: plain text Click "get output"; Then, choose "Whole Gene", "Exons", "Introns", "5' UTR Exons", "Coding Exons" and "3' UTR Exons" in turn, then click "get BED", download; Name them "hg19_gene.bed", "hg19_exon.bed", "hg19_intron.bed", "hg19_cds.bed", "hg19_utr-5.bed", "hg19_utr-3.bed", respectively; Put files of "hg19_cds.bed, hg19_intron.bed, hg19_utr-5.bed, hg19_utr-3.bed" into one directory, and name it Protein_Coding/; Put file of "hg19_exon.bed, hg19_intron.bed" into another directory, and name it NonProtein_Coding/; Finally, change output format from "BED-browser extensible data" into "all fields from selected table", output file named "hg19_all_field" and download it. 3.2 Running MeRIP-PF perl MeRIP-PF.pl Sample.config The format of Sample.config file: **********INPUT FILES********** Genome Sequence in Fasta: XXX/mm9/bwa_index/chr19.fa Fastq File of Sample Control: XXX/DemoCell/demo-ctrl.fq Fastq File of Sample IP (m6A): XXX/DemoCell/demo-m6a.fq **********UCSC REFERENCE FILES******** File of Whole Transcripts: XXX/mm9/mm9_gene.bed Directory of Protein-Coding-Genes Reference: XXX/mm9/ Protein_Coding Directory of NonProtein-Coding-Genes Reference: XXX/mm9/ NonProtein_Coding File of Gene Function: XXX/mm9/mm9_all_field **********OUTPUT********* Output Directory: XXX/DemoCell/out ***********OPTIONS********** Peak Size: 200 Length of Fastq Sequence: 36 Reads Length after Clipping: 36 Fisher's Exact Test Cutoff: 0.05 FDR Cutoff: 0.05 PBS Jobs Running Queue Name: bioque Tracking Queue Name: bioque **********ADDITIONAL OPTIONS********** Directory of BWA Installation: XXX/bwa-0.6.2 Directory of Samtools Installation: XXX/samtools Directory of BEDTools Installation: XXX/BEDTools-Version-2.16.2/bin File of 'submit_scripts_to_PBS.pl': XXX/submit_scripts_to_PBS.pl File of 'Fisher_Test': XXX/Fisher_Test_Genome_Left.pl !!!NOTE: We here just analyze single-end sequencing data. If you have paired-end data, you may just use either end or combine them into one file as single-end. Users should prepare the config file at first; the way of setting up is showed below. 3.2.1 Setting up the Config file !!!NOTE: Directories in ADDITIONAL OPTIONS and OUTPUT should be absolute paths. **********INPUT FILES********** Genome Sequence in Fasta: filepath/filename #filepath/filename: the genome sequence of the species in the format of fasta Fastq File of Sample Control: filepath/filename #filepath/filename: the fastq file of your control sample Fastq File of Sample IP (m6A): filepath/filename #filepath/filename#: the fastq file of your m6A sample **********UCSC REFERENCE FILES******** File of Whole Transcripts: filepath/filename #filepath/filename: filepath/hg19_gene.bed Directory of Protein-Coding-Genes Reference: filepath #filepath: filepath /protein/ Directory of NonProtein-Coding-Genes Reference: filepath #filepath: filepath/non-protein/ File of Gene Function: filepath/filename #filepath/filename: filepath /hg19_all_field ***********OPTIONS********** Peak Size: integer #integer: If the length of your library fragment is ~100bp, you can set this option 200. Length of Fastq Sequence: integer #integer: the length of your sequencing reads Fisher's Exact Test Cutoff: float #float: the cutoff of p-value by fisher's exact test FDR Cutoff: float #float: the cutoff of q-value by Benjamini–Hochberg method PBS Jobs Running Queue Name: QueueName1 #QueueName: the queue name of PBS jobs running Tracking Queue Name: QueueName2 #QueueName: the queue name of tracking, which could be same as QueueName1 **********ADDITIONAL OPTIONS********** Directory of BWA Installation: filepath #filepath: Path/bwa-0.6.2 Directory of Samtools Installation: #filepath: Path/samtools-0.1.18 Directory of BEDTools Installation: filepath #filepath: Path/BEDTools-Version-2.16.2 File of 'submit_scripts_to_PBS.pl': filepath/submit_scripts_to_PBS.pl #filepath/filename: Path/submit_scripts_to_PBS.pl **********OUTPUT********* Output Directory: filepath #filepath: make a new directory for output files and temp files 3.3 Output files format FILE1: Reads_Overview.txt This file supplies the basic status of the two sequencing data, including reads mapping status, transcriptome-wide distribution of m6A peaks, reads distribution among different regions of transcripts in Control Sample and MeRIP Sample respectively, and Control Sample gene expression regardless of reads mapped to junctions. FILE2: Peak_All.xls This will generate a tab-key-separated file containing the information of peak location Column1=Chromosome: chromosome on which the peak resides Column2=PeakStart: position from which the peak starts Column3=PeakEnd: position with which the peak ends Column4=PeakSize: range the peak spans Column5-Column6-Column7: genomic region with which the peak overlaps Column8: transcript in which the peak locate FILE3: Gene_List.xls This file offers the peak annotation information in term of genes (transcripts). Column1= GeneID Column2= Peak_Cnt: count of peaks located in this gene Column3= Peak_Start: positions with which every peak starts Column4= Peak_Size: range every peak spans Column5= Peak_Region: genomic region (cds, intergenic, intron, utr3, utr5) with which every peak resides Column6= Fraction: the fraction that the peaks overlap with corresponding genomic regions Column7= CL_Rds_Cnt: reads count of every peak located in this gene in Sample Control Column8= CL_RPM: RPM (reads per million) of every peak located in this gene in Sample Control Column9= IP_Rds_Cnt: reads count of every peak located in this gene of Sample m6A Column10= IP_RPM: RPM (reads per million) of every peak located in this gene of Sample m6A Column11= Enrichment: the enrichment score of every peak(the ratio of MeRIP sample reads to non-IP sample reads within the area of a peak, each normalized to the number of uniquely mapped reads within the sample) Column12= Strand: the strand in which the gene locates Column11= Chr: chromosome with which this gene resides Column12= GeneName !!!NOTE: Columns 3-11 are all semicolon-separated when Column2>1. FILE4: Plot_Fig.pdf This file shows Transcriptome-wide distribution of m6A peaks (Figure 1A). Pie charts show the percentage of non-IP reads (top) and m6A peaks (bottom) within distinct regions of RNA; NP stands for non-protein coding genes, while PR stands for protein coding genes. Distribution of m6A peaks along mRNA (Figure 1B). 5’UTRs, CDSs and 3’UTRs of every transcript are separately binned into regions spanning 1% of their total lengths; Y-coordinates represent percentage of m6A peaks located in every bin. Correlation between gene expression level and m6A peak enrichment (Figure 1C). Plotted is the peak enrichment value relative to the abundance of the transcript within the input RNA. B NP_Exonic CDS Intergenic 6 NP_Intronic Percentage of m A Peaks (%) A 3'UTR 5'UTR PR_Intronic 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 5'UTR Reads in Control Sample CDS 3'UTR C NP_Exonic 256 5'UTR CDS 3'UTR NP_Intronic 6 Peak Enrichment Intergenic PR_Intronic 128 64 32 16 8 4 2 1 0.01 m A Peaks 0.1 1 10 100 1000 RPKM Figure 1 The output file of “Plot_Fig.pdf”. 3.4 Lite version of MeRIP-PF: MeRIP-PF_Lite In this version, you can take BED files as input, which means that you can choose alternative mapping softwares besides BWA, such as Bowtie 2, TopHat 2, GSNAP and so on. Usage: Perl MeRIP-PF_Lite.pl Config_Lite N1 N2 N1: the number of total raw reads in control sample. N2: the number of total raw reads in MeRIP sample. The format of Config_Lite is just as below: **********INPUT FILES********** Genome Sequence in Fasta: XXX/mm9/bwa_index/chr19.fa Bed File of Sample Control: XXX/DemoCell/demo-ctrl.bed Bed File of Sample IP (m6A): XXX/DemoCell/demo-m6a.bed **********UCSC REFERENCE FILES******** File of Whole Transcripts: XXX/mm9/mm9_gene.bed Directory of Protein-Coding-Genes Reference: XXX/mm9/ Protein_Coding Directory of NonProtein-Coding-Genes Reference: XXX/mm9/ NonProtein_Coding File of Gene Function: XXX/mm9/mm9_all_field **********OUTPUT********* Output Directory: XXX/DemoCell/out ***********OPTIONS********** Peak Size: 200 Length of Fastq Sequence: 36 Fisher's Exact Test Cutoff: 0.05 FDR Cutoff: 0.05 PBS Jobs Running Queue Name: bioque Tracking Queue Name: bioque **********ADDITIONAL OPTIONS********** Directory of BWA Installation: XXX/bwa-0.6.2 Directory of Samtools Installation: XXX/samtools Directory of BEDTools Installation: XXX/BEDTools-Version-2.16.2/bin File of 'submit_scripts_to_PBS.pl': XXX/submit_scripts_to_PBS.pl File of 'Fisher_Test': XXX/Fisher_Test_Genome_Left.pl 3.5 Produce wig plots We also provide a program that can once produce wig plots of all the transcripts with m6A peaks, which is helpful for further study. Figure 2 shows An example of transcripts in wig plot. Y-coordinates show the read coverage of every position in transcripts. Different rectangles stand for different regions of transcripts, and blank ones are intronic regions. Red triangle indicates the peak position. Fig 2 Wig plots 3.5.1 Usage perl bed2wig_plot.pl <OPTIONS> OPTIONS: -d1 the output directory -d2 the MeRIP-PF output directory -pm1 the program of submit_to_PBS.pl -pm2 the program of bed2wig_per_gene.pl -que the queue -tool BEDTools absolute pathway (XXX/BEDTools-Version-2.16.2/bin/) !!!NOTE: Options above are all necessary for wig plotting. In wig plots, the absolute positions of transcripts are as x-coordinates, and reads coverage of every base are as y-coordinates.