* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Exercise 1: RNA
Survey
Document related concepts
Transcript
Tutorial/Exercise 5, Friday May 14th 2010 Introduction In addition to using high throughput sequencing technology for DNA sequencing, with for example the goal of discovering nucleotide variants, it can also be applied to RNA sequencing – in this case, it can be used for discovery and for quantification. RNA sequencing has the potential to also provide information about absolute abundance, as well as relative abundance, in contrast to microarrays (which provide information about relative abundance). In this exercise, we will be mapping and visualizing RNA-seq data, and identifying splice reads. Dataset For this exercise, we will use an RNA-Seq dataset from the budding yeast, S. cerevisiae. The data file is: S_cerevisiae_RNA_Seq.fastq This file contains ~16 million sequence reads generated from polyA+ purified mRNA. Approach For mapping and visualization of the RNA-Seq data, our approach will be to use the bowtie software to map the sequence reads to the genome, and to generate a sorted bamfile, which we can then visualize in GenomeView. For looking at splice reads, we will use the tophat software, which seeks to identify reads that cannot be mapped directly to the genome, and looks to see if they can be mapped to two discontinuous regions of the genome. Today, in the interests of time, we will not use cufflinks to calculate transcript abundance, but you should look at the cufflinks website (see the resources section at the end of this document), as well as read the manuscript. Note, rather than use rpkm, as we discussed in yesterday’s lecture, they use fpkm. Can you think why they use fpkm instead of rpkm? Exercise First, open a new terminal window, and change your location to the directory with the data for today’s exercise: cd Desktop/S_cerevisiae_RNA To begin our analysis, just like yesterday when we were using bwa for mapping, we first have to create an index on our genome for bowtie to use to map the sequence reads: /data/software/bowtie-0.12.5/bowtie-build S288c.fa S288c It will take a minute or two to complete (and you will see a lot of messages scroll up your screen). When it’s finished, list you directory contents – you should see a number of new files with the suffix ebwt, which will be used by bowtie for the read mapping. To map the data using bowtie, we will now use the following command (note, this should be entered on a single line – only press enter after you type the whole command; also note, your computer might become a little sluggish, as we are asking bowtie to use 2 threads, meaning it will use both processors, which speeds it up by a factor of 2): /data/software/bowtie-0.12.5/bowtie --solexa1.3-quals --threads 2 --sam S288c S_cerevisiae_RNA_Seq.fastq S_cerevisiae_RNA_Seq.sam This will take a few minutes – when the above command is running, you should take the opportunity to look through the web pages for bowtie, tophat and cufflinks, indicated in the resources section of this document. What percentage of reads mapped? Why might some reads not map? When the command has finished running, you will have generated a sam file, which contains the mapping for the reads in the fastq file. We now want to convert this into an indexed, sorted bamfile, so that we can, for example, visualize the data: samtools view -bS -t index.txt -o S_cerevisiae_RNA_Seq.bam S_cerevisiae_RNA_Seq.sam samtools sort S_cerevisiae_RNA_Seq.bam S_cerevisiae_RNA_Seq_sorted samtools index S_cerevisiae_RNA_Seq_sorted.bam Now that you have a sorted bamfile, you can visualize the data using GenomeView – as a reminder, there are instructions on its visualization at the end of this document. Because this is RNA sequencing, rather than DNA sequencing, you will have to scroll to the right, away from the end of the chromosome, before you can see anything useful. Based on what you observe, what can you say about the RNA sequencing protocol? Take a look at some genes – for example, go to chromosome_6, position 90,000. What do you observe here? What do you think it means? Also on chromosome 6, go to position 54,000. What do you observe about this particular gene? How does this manifest in the RNA sequencing data? Browse around the data - if you look at the mapped data for annotated genes, do the reads map only within the boundaries of the genes? What can you infer based on the answer to this question? Splice junctions We are now going to try to find reads that map to splice junctions, using tophat. Tophat uses bowtie underneath to do initial mapping, and from this, builds a database of possible splice junctions, and then maps currently unmapped reads against them to confirm their mapping. TopHat takes longer to run that simply running bowtie, so rather than running tophat on all 16 million reads, which will take too long for this exercise, we will first create a file containing just a subset of the reads. To do this, we use the unix head command. You provide this command with the number of lines from a file that you want, which it will take from the top of the file, and then will redirect the output to a new file. In this case, we will ask for 500,000 reads, which are contained in 2 million lines of the starting fastq file: head -2000000 S_cerevisiae_RNA_Seq.fastq > S_cerevisiae_RNA_Seq_trunc.fastq There is an issue on the machines we are using with the tophat installation, so we have to reinstall it. First, go to: http://tophat.cbcb.umd.edu/ and click on the link to TopHat 1.0.13 (BETA), which will download a gzipped tar file with the source code. Once it is downloaded, right click on the file on the Downloads window in Firefox, and ask it to “Open Containing Folder”. You should see the file (tophat-1.0.13.tar.gz) in the window that opens. Drag that file to the Desktop, then right click on it, and ask to “Extract here”. A tophat-1.0.13 folder should be created. Now, open a new Terminal window and do: cd Desktop/tophat-1.0.13 then type: ./configure CFLAGS=”-m32” CXXFLAGS=”-m32” then type: make; make install You should see a bunch of stuff scroll up your screen while the software compiles, which takes a couple of minutes, at the end of which we should hopefully have a working tophat program. Close this terminal window, and return to your previous terminal window. We now also have to add the path to the tophat program, as well as to the bowtie program to our PATH variable: export PATH=$PATH:/data/software/bowtie-0.12.5/:/home/user/Desktop/tophat-1.0.13/bin/ we can now run tophat itself: tophat -p 2 --solexa1.3-quals -a 4 -o spliced_reads S288c S_cerevisiae_RNA_Seq_trunc.fastq We will now want to visualize the data output in the sam file, by creating a sorted and indexed bamfile. However, there is a bug in tophat that means that its output cannot be correctly processed unless we fix the file by getting rid of a particular line in it that begins with ‘@PG’. To do this, we will use the grep command, in conjunction with the -v option, which tells grep to exclude lines containing certain content: grep -v PG spliced_reads/accepted_hits.sam > spliced_reads/accepted_hits.fixed.sam and then rename the file back to the original name: mv spliced_reads/accepted_hits.fixed.sam spliced_reads/accepted_hits.sam Now we can convert the fixed sam file to an indexed, sorted bam file, using the following commands: samtools view -bS -t index.txt -o accepted_hits.bam spliced_reads/accepted_hits.sam samtools sort accepted_hits.bam accepted_hits_sorted samtools index accepted_hits_sorted.bam Now look at this file (accepted_hits_sorted.bam.bai) in GenomeView, in conjunction with the other data that you already loaded - if you go back to chromosome 6, position 54,000, what do you now see there? Discussion There are a lot of tools for analyzing RNA-Seq data, as well as for visualizing the data, and you should certainly check them out. Note, as we discovered today, because many of these tools are written by different authors, there are often errors or subtle incompatibilities, and it is often a case of either trial and error, googling error messages, or emailing the tool authors to solve these problems (I had to use all three to get past some of the road blocks in preparing this exercise). Also note, as discussed yesterday, it is really useful to learn a scripting language, such as Perl or Python, to “glue” together various tools into a pipeline that you can use over and over again. This not only has the advantage of not needing to type in complex command lines over and over again, but can also document what parameters you use for the tools. The O’Reilly “Learning Perl” or “Learning Python” books are a good place to start. Finally, much of this type of software requires that you use the unix command line. It is definitely useful to learn enough unix to be able to get around the command line with reasonable ease. Resources Bowtie: http://bowtie-bio.sourceforge.net/ TopHat: http://tophat.cbcb.umd.edu/ Cufflinks: http://cufflinks.cbcb.umd.edu/ Visualizing Data GenomeView In Firefox, go to: http://genomeview.sourceforge.net/ Click on the "Launch" button on the top left. In the dialog that appears, click "Ok". Wait a minute or so (it appears that nothing is happening, but don’t worry, it just takes a little), then click "Run" on the next dialog. When the software has opened, in the menus, go to: File->Load Features Select “Local file” (it should already be selected), then click OK. Navigate to Desktop/XXX, then select: S288c.fasta Repeat these steps to load: saccharomyces_cerevisiae.gff S_cerevisiae_RNA_Seq_sorted.bam.bai The gff file loads a lot of different “tracks” of annotations – most of them are not useful for this exercise, and take up a lot of space, meaning that you can’t really see the sequence data, that are scrolled off the bottom of the screen. To stop most of these showing up on the display, you can turn them off, by clicking on the green ticks on the panel on the right hand side to convert them into red crosses. Click on the green ticks to turn off everything except: Gene Structure Ruler gene Short reads As you’re turning these off, you’ll see the sequence read data scroll into view from below. To navigate the data, you can use the arrow keys. The left and right arrow keys will scroll the display from left to right, while the up and down arrow keys will zoom in and out on the display. To switch to a different chromosome, click the “Entry” box at the top, and select a different chromosome. Explore the data.