Download Exercise 1: RNA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Deoxyribozyme wikipedia , lookup

Primary transcript wikipedia , lookup

RNA world wikipedia , lookup

Epitranscriptome wikipedia , lookup

RNA wikipedia , lookup

Nucleic acid tertiary structure wikipedia , lookup

Genomics wikipedia , lookup

History of RNA biology wikipedia , lookup

RNA silencing wikipedia , lookup

Non-coding RNA wikipedia , lookup

Metagenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Tutorial/Exercise 5, Friday May 14th 2010
Introduction
In addition to using high throughput sequencing technology for DNA sequencing, with for example
the goal of discovering nucleotide variants, it can also be applied to RNA sequencing – in this case,
it can be used for discovery and for quantification. RNA sequencing has the potential to also
provide information about absolute abundance, as well as relative abundance, in contrast to
microarrays (which provide information about relative abundance). In this exercise, we will be
mapping and visualizing RNA-seq data, and identifying splice reads.
Dataset
For this exercise, we will use an RNA-Seq dataset from the budding yeast, S. cerevisiae. The data
file is:
S_cerevisiae_RNA_Seq.fastq
This file contains ~16 million sequence reads generated from polyA+ purified mRNA.
Approach
For mapping and visualization of the RNA-Seq data, our approach will be to use the bowtie
software to map the sequence reads to the genome, and to generate a sorted bamfile, which we
can then visualize in GenomeView. For looking at splice reads, we will use the tophat software,
which seeks to identify reads that cannot be mapped directly to the genome, and looks to see if
they can be mapped to two discontinuous regions of the genome. Today, in the interests of time,
we will not use cufflinks to calculate transcript abundance, but you should look at the cufflinks
website (see the resources section at the end of this document), as well as read the manuscript.
Note, rather than use rpkm, as we discussed in yesterday’s lecture, they use fpkm. Can you think
why they use fpkm instead of rpkm?
Exercise
First, open a new terminal window, and change your location to the directory with the data for
today’s exercise:
cd Desktop/S_cerevisiae_RNA
To begin our analysis, just like yesterday when we were using bwa for mapping, we first have to
create an index on our genome for bowtie to use to map the sequence reads:
/data/software/bowtie-0.12.5/bowtie-build S288c.fa S288c
It will take a minute or two to complete (and you will see a lot of messages scroll up your screen).
When it’s finished, list you directory contents – you should see a number of new files with the
suffix ebwt, which will be used by bowtie for the read mapping.
To map the data using bowtie, we will now use the following command (note, this should be
entered on a single line – only press enter after you type the whole command; also note, your
computer might become a little sluggish, as we are asking bowtie to use 2 threads, meaning it will
use both processors, which speeds it up by a factor of 2):
/data/software/bowtie-0.12.5/bowtie --solexa1.3-quals --threads 2 --sam S288c
S_cerevisiae_RNA_Seq.fastq S_cerevisiae_RNA_Seq.sam
This will take a few minutes – when the above command is running, you should take the
opportunity to look through the web pages for bowtie, tophat and cufflinks, indicated in the
resources section of this document.
What percentage of reads mapped? Why might some reads not map?
When the command has finished running, you will have generated a sam file, which contains the
mapping for the reads in the fastq file. We now want to convert this into an indexed, sorted
bamfile, so that we can, for example, visualize the data:
samtools view -bS -t index.txt -o S_cerevisiae_RNA_Seq.bam S_cerevisiae_RNA_Seq.sam
samtools sort S_cerevisiae_RNA_Seq.bam S_cerevisiae_RNA_Seq_sorted
samtools index S_cerevisiae_RNA_Seq_sorted.bam
Now that you have a sorted bamfile, you can visualize the data using GenomeView – as a reminder,
there are instructions on its visualization at the end of this document. Because this is RNA
sequencing, rather than DNA sequencing, you will have to scroll to the right, away from the end of
the chromosome, before you can see anything useful.
Based on what you observe, what can you say about the RNA sequencing protocol?
Take a look at some genes – for example, go to chromosome_6, position 90,000. What do you
observe here? What do you think it means?
Also on chromosome 6, go to position 54,000. What do you observe about this particular gene?
How does this manifest in the RNA sequencing data?
Browse around the data - if you look at the mapped data for annotated genes, do the reads map only
within the boundaries of the genes? What can you infer based on the answer to this question?
Splice junctions
We are now going to try to find reads that map to splice junctions, using tophat. Tophat uses
bowtie underneath to do initial mapping, and from this, builds a database of possible splice
junctions, and then maps currently unmapped reads against them to confirm their mapping.
TopHat takes longer to run that simply running bowtie, so rather than running tophat on all 16
million reads, which will take too long for this exercise, we will first create a file containing just a
subset of the reads. To do this, we use the unix head command. You provide this command with
the number of lines from a file that you want, which it will take from the top of the file, and then
will redirect the output to a new file. In this case, we will ask for 500,000 reads, which are
contained in 2 million lines of the starting fastq file:
head -2000000 S_cerevisiae_RNA_Seq.fastq > S_cerevisiae_RNA_Seq_trunc.fastq
There is an issue on the machines we are using with the tophat installation, so we have to reinstall
it. First, go to:
http://tophat.cbcb.umd.edu/
and click on the link to TopHat 1.0.13 (BETA), which will download a gzipped tar file with the
source code. Once it is downloaded, right click on the file on the Downloads window in Firefox,
and ask it to “Open Containing Folder”. You should see the file (tophat-1.0.13.tar.gz) in the
window that opens. Drag that file to the Desktop, then right click on it, and ask to “Extract here”.
A tophat-1.0.13 folder should be created. Now, open a new Terminal window and do:
cd Desktop/tophat-1.0.13
then type:
./configure CFLAGS=”-m32” CXXFLAGS=”-m32”
then type:
make; make install
You should see a bunch of stuff scroll up your screen while the software compiles, which takes a
couple of minutes, at the end of which we should hopefully have a working tophat program. Close
this terminal window, and return to your previous terminal window. We now also have to add the
path to the tophat program, as well as to the bowtie program to our PATH variable:
export PATH=$PATH:/data/software/bowtie-0.12.5/:/home/user/Desktop/tophat-1.0.13/bin/
we can now run tophat itself:
tophat -p 2 --solexa1.3-quals -a 4 -o spliced_reads S288c S_cerevisiae_RNA_Seq_trunc.fastq
We will now want to visualize the data output in the sam file, by creating a sorted and indexed
bamfile. However, there is a bug in tophat that means that its output cannot be correctly
processed unless we fix the file by getting rid of a particular line in it that begins with ‘@PG’. To do
this, we will use the grep command, in conjunction with the -v option, which tells grep to exclude
lines containing certain content:
grep -v PG spliced_reads/accepted_hits.sam > spliced_reads/accepted_hits.fixed.sam
and then rename the file back to the original name:
mv spliced_reads/accepted_hits.fixed.sam spliced_reads/accepted_hits.sam
Now we can convert the fixed sam file to an indexed, sorted bam file, using the following
commands:
samtools view -bS -t index.txt -o accepted_hits.bam spliced_reads/accepted_hits.sam
samtools sort accepted_hits.bam accepted_hits_sorted
samtools index accepted_hits_sorted.bam
Now look at this file (accepted_hits_sorted.bam.bai) in GenomeView, in conjunction with the other
data that you already loaded - if you go back to chromosome 6, position 54,000, what do you now see
there?
Discussion
There are a lot of tools for analyzing RNA-Seq data, as well as for visualizing the data, and you
should certainly check them out. Note, as we discovered today, because many of these tools are
written by different authors, there are often errors or subtle incompatibilities, and it is often a
case of either trial and error, googling error messages, or emailing the tool authors to solve these
problems (I had to use all three to get past some of the road blocks in preparing this exercise).
Also note, as discussed yesterday, it is really useful to learn a scripting language, such as Perl or
Python, to “glue” together various tools into a pipeline that you can use over and over again. This
not only has the advantage of not needing to type in complex command lines over and over again,
but can also document what parameters you use for the tools. The O’Reilly “Learning Perl” or
“Learning Python” books are a good place to start. Finally, much of this type of software requires
that you use the unix command line. It is definitely useful to learn enough unix to be able to get
around the command line with reasonable ease.
Resources
Bowtie: http://bowtie-bio.sourceforge.net/
TopHat: http://tophat.cbcb.umd.edu/
Cufflinks: http://cufflinks.cbcb.umd.edu/
Visualizing Data
GenomeView
In Firefox, go to:
http://genomeview.sourceforge.net/
Click on the "Launch" button on the top left.
In the dialog that appears, click "Ok".
Wait a minute or so (it appears that nothing is happening, but don’t worry, it just takes a little),
then click "Run" on the next dialog.
When the software has opened, in the menus, go to:
File->Load Features
Select “Local file” (it should already be selected), then click OK. Navigate to Desktop/XXX, then
select:
S288c.fasta
Repeat these steps to load:
saccharomyces_cerevisiae.gff
S_cerevisiae_RNA_Seq_sorted.bam.bai
The gff file loads a lot of different “tracks” of annotations – most of them are not useful for this
exercise, and take up a lot of space, meaning that you can’t really see the sequence data, that are
scrolled off the bottom of the screen. To stop most of these showing up on the display, you can
turn them off, by clicking on the green ticks on the panel on the right hand side to convert them
into red crosses. Click on the green ticks to turn off everything except:
Gene Structure
Ruler
gene
Short reads
As you’re turning these off, you’ll see the sequence read data scroll into view from below.
To navigate the data, you can use the arrow keys. The left and right arrow keys will scroll the
display from left to right, while the up and down arrow keys will zoom in and out on the display.
To switch to a different chromosome, click the “Entry” box at the top, and select a different
chromosome. Explore the data.