Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Copy-number variation wikipedia , lookup

Gene wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Metagenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
MeRIP-PF Manual
Released 2013-01-24
Yuli Li, Shuhui Song et al.
Beijing Institute of Genomics,
Chinese Academy of Sciences, Beijing
Contact: [email protected]; [email protected]
1
2
3
OVERVIEW ................................................................................................................................. 3
1.1
Background ................................................................................................................... 3
1.2
Summary of MeRIP-PF .................................................................................................. 3
1.3
Implementation............................................................................................................. 3
1.4
Availability ..................................................................................................................... 3
INSTALLATION............................................................................................................................ 4
2.1
Perl and R ...................................................................................................................... 4
2.2
BWA ............................................................................................................................... 4
2.3
SAMtools ....................................................................................................................... 4
2.4
BEDTools ........................................................................................................................ 4
USAGE ....................................................................................................................................... 5
3.1
Preparing reference files ............................................................................................... 5
3.2
Running MeRIP-PF ......................................................................................................... 6
3.2.1
Setting up the Config file ................................................................................... 7
3.3
Output files format........................................................................................................ 8
3.4
Produce wig plots ........................................................................................................ 10
3.4.1
Usage ............................................................................................................... 11
1 OVERVIEW
1.1 Background
Next generation parallel sequencing technologies make m6A-specific methylated RNA
immunoprecipitation followed by sequencing a popular strategy to study transcriptome-wide
RNA modifications, while creating challenges for analysis , especially in peak-finding. However,
there have been no any available tools or softwares for MeRIP-Seq data analysis yet.
Here, we present a high-efficiency and easily-used analysis pipeline called MeRIP-PF, which is a
publicly available open source and specially developed for MeRIP-Seq peak-calling with control
samples. MeRIP-PF achieves m6A regions detection and annotation, and powerful graphical
display which are useful for further study.
1.2 Summary of MeRIP-PF
We integrate four modules, including mapping, testing, annotating and plotting into one program
to complete the whole analysis. The pipeline requires two Fastq-formatted data, genome
reference sequences of the corresponding species and several annotated BED files with the gene
structure information. And it will output 4 results files, giving the complete information of the
modification profile.
1.3 Implementation
The pipeline program was written in Perl, and run in a Linux machine cluster; each node has 8
cores with 2.00G Hz processor and 16G RAM. MeRIP-PF requires the installation of Perl and R
language program, and BWA, SAMtools and BEDTools.
1.4 Availability
The MeRIP-PF package including an example dataset is available at
http://software.big.ac.cn/MeRIP-PF.html
2 INSTALLATION
2.1 Perl and R
Ensure that Perl is installed.
Install R from http://www.r-project.org/
!!!Note: Perl and R need be installed at the root pathway.
2.2 BWA
Download any version of BWA (e.g. bwa-0.6.2.tar.bz2) from
http://sourceforge.net/projects/bio-bwa/files/
> bzip2 -d bwa-0.6.2.tar.bz2
> tar -xf bwa-0.6.2.tar
> cd bwa-0.6.2/
> make
2.3 SAMtools
Download installation file from http://sourceforge.net/projects/samtools/files/samtools/0.1.18/
> bzip2 -d samtools-0.1.18.tar.bz2
> tar -xf samtools-0.1.18.tar
> cd samtools-0.1.18/
> make
2.4 BEDTools
Download installation file from http://code.google.com/p/bedtools/downloads/list
> gunzip BEDTools.v2.16.2.tar.gz
> tar -xf BEDTools.v2.16.2.tar
> cd BEDTools-Version-2.16.2/
> make clean
> make all
3 USAGE
3.1 Preparing reference files
You need prepare some BED files. Download from http://genome.ucsc.edu/cgi-bin/hgTables,
taking the species of human for example.

Downloading setting up:


*clade: Mammal
*genome: Human
*assembly: Feb 2009(GRCh37/hg19)
*group: Genes and Gene Prediction Tracks
*track: RefSeq Gene
*table: refGene
*region: genome
*output format: BED-browser extensible data
*output file: hg19
*file type returned: plain text
Click "get output";
Then, choose "Whole Gene", "Exons", "Introns", "5' UTR Exons", "Coding Exons" and "3' UTR
Exons" in turn, then click "get BED", download;




Name them "hg19_gene.bed", "hg19_exon.bed", "hg19_intron.bed", "hg19_cds.bed",
"hg19_utr-5.bed", "hg19_utr-3.bed", respectively;
Put files of "hg19_cds.bed, hg19_intron.bed, hg19_utr-5.bed, hg19_utr-3.bed" into one
directory, and name it Protein_Coding/;
Put file of "hg19_exon.bed, hg19_intron.bed" into another directory, and name it
NonProtein_Coding/;
Finally, change output format from "BED-browser extensible data" into "all fields from
selected table", output file named "hg19_all_field" and download it.
3.2 Running MeRIP-PF
perl MeRIP-PF.pl Sample.config
The format of Sample.config file:
**********INPUT FILES**********
Genome Sequence in Fasta: XXX/mm9/bwa_index/chr19.fa
Fastq File of Sample Control: XXX/DemoCell/demo-ctrl.fq
Fastq File of Sample IP (m6A): XXX/DemoCell/demo-m6a.fq
**********UCSC REFERENCE FILES********
File of Whole Transcripts: XXX/mm9/mm9_gene.bed
Directory of Protein-Coding-Genes Reference: XXX/mm9/ Protein_Coding
Directory of NonProtein-Coding-Genes Reference: XXX/mm9/ NonProtein_Coding
File of Gene Function: XXX/mm9/mm9_all_field
**********OUTPUT*********
Output Directory: XXX/DemoCell/out
***********OPTIONS**********
Peak Size: 200
Length of Fastq Sequence: 36
Reads Length after Clipping: 36
Fisher's Exact Test Cutoff: 0.05
FDR Cutoff: 0.05
PBS Jobs Running Queue Name: bioque
Tracking Queue Name: bioque
**********ADDITIONAL OPTIONS**********
Directory of BWA Installation: XXX/bwa-0.6.2
Directory of Samtools Installation: XXX/samtools
Directory of BEDTools Installation: XXX/BEDTools-Version-2.16.2/bin
File of 'submit_scripts_to_PBS.pl': XXX/submit_scripts_to_PBS.pl
File of 'Fisher_Test': XXX/Fisher_Test_Genome_Left.pl
!!!NOTE:

We here just analyze single-end sequencing data. If you have paired-end data, you may just
use either end or combine them into one file as single-end.

Users should prepare the config file at first; the way of setting up is showed below.
3.2.1 Setting up the Config file
!!!NOTE: Directories in ADDITIONAL OPTIONS and OUTPUT should be absolute paths.
**********INPUT FILES**********
Genome Sequence in Fasta: filepath/filename
#filepath/filename: the genome sequence of the species in the format of fasta
Fastq File of Sample Control: filepath/filename
#filepath/filename: the fastq file of your control sample
Fastq File of Sample IP (m6A): filepath/filename
#filepath/filename#: the fastq file of your m6A sample
**********UCSC REFERENCE FILES********
File of Whole Transcripts: filepath/filename
#filepath/filename: filepath/hg19_gene.bed
Directory of Protein-Coding-Genes Reference: filepath
#filepath: filepath /protein/
Directory of NonProtein-Coding-Genes Reference: filepath
#filepath: filepath/non-protein/
File of Gene Function: filepath/filename
#filepath/filename: filepath /hg19_all_field
***********OPTIONS**********
Peak Size: integer
#integer: If the length of your library fragment is ~100bp, you can set this option 200.
Length of Fastq Sequence: integer
#integer: the length of your sequencing reads
Fisher's Exact Test Cutoff: float
#float: the cutoff of p-value by fisher's exact test
FDR Cutoff: float
#float: the cutoff of q-value by Benjamini–Hochberg method
PBS Jobs Running Queue Name: QueueName1
#QueueName: the queue name of PBS jobs running
Tracking Queue Name: QueueName2
#QueueName: the queue name of tracking, which could be same as QueueName1
**********ADDITIONAL OPTIONS**********
Directory of BWA Installation: filepath
#filepath: Path/bwa-0.6.2
Directory of Samtools Installation:
#filepath: Path/samtools-0.1.18
Directory of BEDTools Installation: filepath
#filepath: Path/BEDTools-Version-2.16.2
File of 'submit_scripts_to_PBS.pl': filepath/submit_scripts_to_PBS.pl
#filepath/filename: Path/submit_scripts_to_PBS.pl
**********OUTPUT*********
Output Directory: filepath
#filepath: make a new directory for output files and temp files
3.3 Output files format
FILE1: Reads_Overview.txt
This file supplies the basic status of the two sequencing data, including reads mapping status,
transcriptome-wide distribution of m6A peaks, reads distribution among different regions of
transcripts in Control Sample and MeRIP Sample respectively, and Control Sample gene
expression regardless of reads mapped to junctions.
FILE2: Peak_All.xls
This will generate a tab-key-separated file containing the information of peak location
Column1=Chromosome: chromosome on which the peak resides
Column2=PeakStart: position from which the peak starts
Column3=PeakEnd: position with which the peak ends
Column4=PeakSize: range the peak spans
Column5-Column6-Column7: genomic region with which the peak overlaps
Column8: transcript in which the peak locate
FILE3: Gene_List.xls
This file offers the peak annotation information in term of genes (transcripts).
Column1= GeneID
Column2= Peak_Cnt: count of peaks located in this gene
Column3= Peak_Start: positions with which every peak starts
Column4= Peak_Size: range every peak spans
Column5= Peak_Region: genomic region (cds, intergenic, intron, utr3, utr5) with which every
peak resides
Column6= Fraction: the fraction that the peaks overlap with corresponding genomic regions
Column7= CL_Rds_Cnt: reads count of every peak located in this gene in Sample Control
Column8= CL_RPM: RPM (reads per million) of every peak located in this gene in Sample Control
Column9= IP_Rds_Cnt: reads count of every peak located in this gene of Sample m6A
Column10= IP_RPM: RPM (reads per million) of every peak located in this gene of Sample m6A
Column11= Enrichment: the enrichment score of every peak(the ratio of MeRIP sample reads to
non-IP sample reads within the area of a peak, each normalized to the number of
uniquely mapped reads within the sample)
Column12= Strand: the strand in which the gene locates
Column11= Chr: chromosome with which this gene resides
Column12= GeneName
!!!NOTE: Columns 3-11 are all semicolon-separated when Column2>1.
FILE4: Plot_Fig.pdf
This file shows

Transcriptome-wide distribution of m6A peaks (Figure 1A). Pie charts show the percentage of
non-IP reads (top) and m6A peaks (bottom) within distinct regions of RNA; NP stands for
non-protein coding genes, while PR stands for protein coding genes.

Distribution of m6A peaks along mRNA (Figure 1B). 5’UTRs, CDSs and 3’UTRs of every
transcript are separately binned into regions spanning 1% of their total lengths;
Y-coordinates represent percentage of m6A peaks located in every bin.

Correlation between gene expression level and m6A peak enrichment (Figure 1C). Plotted is
the peak enrichment value relative to the abundance of the transcript within the input RNA.
B
NP_Exonic
CDS
Intergenic
6
NP_Intronic
Percentage of m A Peaks (%)
A
3'UTR
5'UTR
PR_Intronic
4.5
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
5'UTR
Reads in Control Sample
CDS
3'UTR
C
NP_Exonic
256
5'UTR
CDS
3'UTR
NP_Intronic
6
Peak Enrichment
Intergenic
PR_Intronic
128
64
32
16
8
4
2
1
0.01
m A Peaks
0.1
1
10
100
1000
RPKM
Figure 1 The output file of “Plot_Fig.pdf”.
3.4 Lite version of MeRIP-PF: MeRIP-PF_Lite
In this version, you can take BED files as input, which means that you can choose alternative
mapping softwares besides BWA, such as Bowtie 2, TopHat 2, GSNAP and so on.
Usage:
Perl MeRIP-PF_Lite.pl Config_Lite N1 N2
N1: the number of total raw reads in control sample.
N2: the number of total raw reads in MeRIP sample.
The format of Config_Lite is just as below:
**********INPUT FILES**********
Genome Sequence in Fasta: XXX/mm9/bwa_index/chr19.fa
Bed File of Sample Control: XXX/DemoCell/demo-ctrl.bed
Bed File of Sample IP (m6A): XXX/DemoCell/demo-m6a.bed
**********UCSC REFERENCE FILES********
File of Whole Transcripts: XXX/mm9/mm9_gene.bed
Directory of Protein-Coding-Genes Reference: XXX/mm9/ Protein_Coding
Directory of NonProtein-Coding-Genes Reference: XXX/mm9/ NonProtein_Coding
File of Gene Function: XXX/mm9/mm9_all_field
**********OUTPUT*********
Output Directory: XXX/DemoCell/out
***********OPTIONS**********
Peak Size: 200
Length of Fastq Sequence: 36
Fisher's Exact Test Cutoff: 0.05
FDR Cutoff: 0.05
PBS Jobs Running Queue Name: bioque
Tracking Queue Name: bioque
**********ADDITIONAL OPTIONS**********
Directory of BWA Installation: XXX/bwa-0.6.2
Directory of Samtools Installation: XXX/samtools
Directory of BEDTools Installation: XXX/BEDTools-Version-2.16.2/bin
File of 'submit_scripts_to_PBS.pl': XXX/submit_scripts_to_PBS.pl
File of 'Fisher_Test': XXX/Fisher_Test_Genome_Left.pl
3.5 Produce wig plots
We also provide a program that can once produce wig plots of all the transcripts with m6A peaks,
which is helpful for further study.
Figure 2 shows

An example of transcripts in wig plot. Y-coordinates show the read coverage of every
position in transcripts. Different rectangles stand for different regions of transcripts, and
blank ones are intronic regions. Red triangle indicates the peak position.
Fig 2 Wig plots
3.5.1 Usage
perl bed2wig_plot.pl <OPTIONS>
OPTIONS:
-d1
the output directory
-d2
the MeRIP-PF output directory
-pm1
the program of submit_to_PBS.pl
-pm2
the program of bed2wig_per_gene.pl
-que
the queue
-tool
BEDTools absolute pathway (XXX/BEDTools-Version-2.16.2/bin/)
!!!NOTE:

Options above are all necessary for wig plotting.

In wig plots, the absolute positions of transcripts are as x-coordinates, and reads coverage of
every base are as y-coordinates.