* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Essential gene wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
X-inactivation wikipedia , lookup
Oncogenomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Human genome wikipedia , lookup
Transposable element wikipedia , lookup
Genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Copy-number variation wikipedia , lookup
Gene therapy wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Ridge (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene desert wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome evolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
RNA-Seq Lab Radhika S. Khetani Powerpoint by Casey Hanson RNA-Seq Lab v5 | Radhika S. Khetani 1 Exercise 1. Use the Tuxedo Suite to: a. Align RNA-Seq reads using TopHat(splice-aware aligner). b. Perform reference-based transcriptome assembly with CuffLinks. c. Obtain a new transcriptome using CuffLinks & CuffMerge. d. Use CuffDiff to obtain a list of differentially expressed genes. e. Report a list of significantly expressed genes. 2. Use a genome browser and visualization tool to observe the aligned data and the new transcriptome. RNA-Seq Lab v5 | Radhika S. Khetani 2 Premise Question: Is their a difference in our results if the Tuxedo Suit is run 2 different ways? 1. Procedure: Run 1A: Allow TopHat to select splice junctions de novo and proceed through the steps without giving the software known genes/gene models. Run 1B: Force TopHat to use only known splice junctions (i.e. known genes/gene models) and proceed through the steps making sure we are doing our analysis in the context of these gene models. 2. Evaluation: a. 2 metrics: # of mapped reads and # of significantly different identified gene s b. Compare new transcriptome to known genes. RNA-Seq Lab v5 | Radhika S. Khetani 3 Data Sources RNA-Seq: 100 bp, single end data sample replicate # fastq name # reads control Replicate 1 thrombin_control.txt 10,953 experimental Replicate 1 thrombin_expt.txt 12,027 Genome & gene information name description chr22.fa Fasta file with the sequence of chromosome 22 from the human genome (hg19 – UCSC) genes-chr22.gtf GTF file with gene annotation, known genes (hg19 – UCSC) RNA-Seq Lab v5 | Radhika S. Khetani 4 Step 0A: Accessing the IGB Biocluster Open Putty.exe In the hostname textbox type: biocluster.igb.Illinois.edu Click Open If popup appears, Click Yes Enter login credentials assigned to you; example, user class45. Now you are all set! RNA-Seq Lab v5 | Radhika S. Khetani 5 Step 0B: Lab Setup The lab is located in the following directory: ~/mayo/khetani This directory contains the finished version of the lab (i.e. the version of the lab after the tutorial). Consult it if you unsure about your runs. You don’t have write permissions to the lab directory. Create a working directory of this lab in your home directory for your output to be stored. Note ~ is a symbol in unix paths referring to your home directory. Copy the files Make sure you login to a machine on the cluster using the qsub command. The exact syntax for this command is given below. This particular command will login you into a reserved computer (denoted by classroom) with 4 cpus with an interactive session. You only need to do this once. $ mkdir ~/khetani $ cp # Make working directory in your home directory, ~/mayo/khetani/data/* $ qsub –I –l ncpus=4 ~/khetani # Copy data to your working directory. # Login to a computer on cluster. RNA-Seq Lab v5 | Radhika S. Khetani 6 Step 0C: Shared Desktop Directory For viewing and manipulating files on the classroom computers, we provide a shared directory in the following folder on the desktop: classes/mayo In today’s lab, we will be using the following folder in the shared directory: classes/mayo/khetani RNA-Seq Lab v5 | Radhika S. Khetani 7 Step 1: Access the Biocluster We will login to the biocluster and examine various files. $ qsub –I –l ncpus=4 $ cd ~/khetani/ # Open session on node with 4 cpus. SKIP IF DONE # Change to the directory of today’s lab session. $ head chr22.fa # Examine first 10 lines of chr22 sequence file. $ head –n 12 thrombin_control.txt # Examine first 12 lines (first 3 reads) from control. $ tail –n 12 thrombin_expt.txt # Examine last 12 lines (last 3 reads) from experimental sample. $ head genes-chr22.gtf # Examine first 10 lines of chr22 genes file. RNA-Seq Lab v5 | Radhika S. Khetani 8 Step 2: Create an Index using Bowtie Assume the data is of good quality and quality trimming has taken place. We will now use Bowtie in the Tuxedo Suite to build a chromosome index of chr22. $ module load tophat2/2.0.8 # Load Tophat2 and dependencies. $ bowtie2-build chr22.fa chr22 # Make Bowtie index for chr22. # This index is essential for the rest of the exercise. # Make sure you use the correct bowtie version (type bowtie-build). RNA-Seq Lab v5 | Radhika S. Khetani 9 . Run 1A: de novo Alignment In this exercise, we will be aligning RNA-Seq reads to a reference genome in the absence of gene models. Splice junctions will be found de novo. RNA-Seq Lab v5 | Radhika S. Khetani 10 Step 3A: Align Reads Using TopHat We will now align RNA-Seq reads to chr22 using mostly default parameters. Note: We are not providing gene information. TopHat will find splice junctions de novo. $ tophat –p 4 –o ctrl chr22 thrombin_control.txt # Run TopHAT using chr22 as reference and sequences in the control. # -p indicates the number of cpus (4). # -o indicates the output directory (ctrl). $ tophat –p 4 –o expt chr22 thrombin_expt.txt # Run TopHAT using chr22 as reference and sequences in the experimental sample. # -o indicates the output directory (expt). RNA-Seq Lab v5 | Radhika S. Khetani 11 Step 4A: Evaluate de novo Alignment We will now evaluate our alignment by observing how many reads DID NOT align to the reference genome chr22. $ samtools view –c ctrl/unmapped.bam # -c instructs the view tool to count the unmapped reads. # The result should be 101 unmapped reads. $ samtools view –c expt/unmapped.bam # The result should be 163 unmapped reads. RNA-Seq Lab v5 | Radhika S. Khetani 12 Step 5A: Create Index for Mapped Reads For the purpose of visualization, we will create an index for the mapped reads. $ samtools index ctrl/accepted_hits.bam # Create index of mapped reads in control. $ samtools index expt/accepted_hits.bam # Create index of mapped reads in experimental sample. RNA-Seq Lab v5 | Radhika S. Khetani 13 . Run 1B: Informed Alignment In this exercise, we will be aligning RNA-Seq reads to a reference genome in the presence of gene information. This obviates the need for TopHat to find splice junctions de novo. RNA-Seq Lab v5 | Radhika S. Khetani 14 Step 3B: Align Reads Using Gene Info We will now align RNA-Seq reads to chr22 using mostly default parameters for TopHat and information on the location of genes on chr22. $ tophat –p 4 –G genes-chr22.gtf –o ctrl-genes chr22 thrombin_control.txt # Run TopHAT using chr22 as reference and sequences in the control. # -p indicates the number of cpus (4). # -o indicates the output directory (ctrl-genes). # -G indicates the gene file to use to aid in alignment. $ tophat –p 4 –G genes-chr22.gtf –o expt-genes chr22 thrombin_expt.txt # Run TopHAT using chr22 as reference and sequences in the experimental sample. # -o indicates the output directory (expt-genes). RNA-Seq Lab v5 | Radhika S. Khetani 15 Step 4B: Evaluate Informed Alignment We will now evaluate our informed alignment by observing how many reads DID NOT align to the reference genome chr22. $ samtools view –c ctrl-genes/unmapped.bam # -c instructs the view tool to count the unmapped reads. # The result should be 27 unmapped reads. $ samtools view –c expt-genes/unmapped.bam # The result should be 39 unmapped reads. RNA-Seq Lab v5 | Radhika S. Khetani 16 Step 5B: Create Index for Mapped Reads For the purpose of visualization, we will create an index for the mapped reads in the informed alignment. $ samtools index ctrl-genes/accepted_hits.bam # Create index of mapped reads in control. $ samtools index expt-genes/accepted_hits.bam # Create index of mapped reads in experimental sample. RNA-Seq Lab v5 | Radhika S. Khetani 17 Checkpoint 1: Comparison of Alignments sample # fastq name # reads control thrombin_control.txt experimental thrombin_expt.txt Unmapped Reads de novo (Run 1A) Informed (Run 1B) 10,953 101 27 12,027 63 39 Conclusions There are fewer unmapped reads with the informed alignment, or Run 1B (i.e. when we use the known genes, and known splice sites)! TopHat’s prediction of splice junctions de novo is not working very well for this dataset. (This is likely due to the low number of reads in our dataset.) RNA-Seq Lab v5 | Radhika S. Khetani 18 . Finding Differentially Expressed Genes Next, we will utilize our RNA-Seq alignments to assembly gene transcripts, thereby permitting us to get relative gene abundances between the two samples (control and experimental). RNA-Seq Lab v5 | Radhika S. Khetani 19 Step 6: Assemble Transcripts using Cufflinks For the de-novo alignment (Run 1A) , we will run the program CuffLinks in order to obtain gene transcripts from our aligned RNA-Seq reads . There is no need to conduct this step for the informed alignment (Run 1B) because we have the locations of known genes already. $ module load cufflinks/2.1.1 # Load cufflinks v2 and dependencies $ cufflinks –p 4 –o cuff-ctrl ctrl/accepted_hits.bam # -p indicates the number of processors to use (4) # -o indicates the output directory (cuff-ctrl) $ cufflinks -p 4 –o cuff-expt expt/accepted_hits.bam # -o indicates the output directory (cuff-expt) RNA-Seq Lab v5 | Radhika S. Khetani 20 Step 7: Merge Transcripts Using CuffMerge For the de-novo alignment (Run 1A) , we will run the program CuffMerge in order to merge our assembled transcripts. There is no need to conduct this step for Run 1B. $ echo -e "cuff-ctrl/transcripts.gtf\ncuff-expt/transcripts.gtf" > gtf.list.txt # Create a text file named gtf.list.txt with the following contents: cuff-ctrl/transcripts.gtf cuff-expt/transcripts.gtf $ cuffmerge –o cuffmerge gtf.list.txt # -o indicates the output directory (cuffmerge) RNA-Seq Lab v5 | Radhika S. Khetani 21 Step 8A: Gene Expression Using CuffDiff For both alignments, de-novo (Run 1A) and informed (Run 1B), we aim to collected the abundances of the expressed genes. To do this, we will utilize the CuffDiff program. We need only a gene (.gtf) file and alignment (.bam) files to calculate differentially expressed genes between the different sample groups (control and experimental). $ cuffdiff –p 4 –o cuffdiff cuffmerge/merged.gtf expt/accepted_hits.bam ctrl/accepted_hits.bam # -p indicates the number of processors to use (4) # -o indicates the output directory(cuffdiff) $ cuffdiff –p 4 –o cuffdiff-genes genes-chr22.gtf expt/accepted_hits.bam ctrl/accepted_hits.bam # -o indicates the output directory (cuffdiff-genes) RNA-Seq Lab v5 | Radhika S. Khetani 22 Step 8B: Gene Expression Using CuffDiff For both alignments, de-novo (Run 1A) and informed (Run 1B), we aim to collected the abundances of the expressed genes. To do this, we will utilize the CuffDiff program. $ head cuffdiff-genes/gene_exp.diff # Examine first 10 lines of file. # We want all rows where the 14th column is yes. The awk command is convenient for file parsing. We’ll see a Galaxy interface for awk in a later lab. $ awk ‘{if ($14==“yes”) print $0}’ cuffdiff/gene_exp.diff > cuffdiff/gene_exp.SIG.diff $ awk ‘{if ($14==“yes”) print $0}’ cuffdiff-genes/gene_exp.diff > cuffdiffgenes/gene_exp.SIG.diff RNA-Seq Lab v5 | Radhika S. Khetani 23 Step 8C: Gene Expression Using CuffDiff For both alignments, de-novo (Run 1A) and informed (Run 1B), we aim to collected the abundances of the expressed genes. To do this, we will utilize the CuffDiff program. We will count the lines in the .SIG.DIFF files to see how many genes are differentially expressed in each of the two alignments. $ wc –l cuffdiff/gene_exp.SIG.diff cuffdiff-genes/gene_exp.SIG.DIFF # Count the number of lines in each .SIG.DIFF files # Output 3 cuffdiff/gene_exp.SIG.diff 0 cuffdiff-genes/gene_exp.SIG.diff Only the de-novo alignment (Run 1A) reports any differentially expressed genes between experiment and control ! RNA-Seq Lab v5 | Radhika S. Khetani 24 . Visualization Using IGV The Integrative Genomics Viewer (IGV) is a tool that supports the visualization of mapped reads to a reference genome, among other functionalities. We will use it to observe where hits were called for the de-novo alignment (Run 1A) for the two samples (control and experimental), the new transcriptome generated by CuffMerge, and the differentially expressed genes. RNA-Seq Lab v5 | Radhika S. Khetani 25 Step 9: Start IGV In this step, we will start IGV and load the chr22.fa file, the known genes file (genes-chr22.gtf), the hits for both sample groups, and the merged transcriptome. These files are located in classes/mayo/khetani on the desktop. Graphical Instruction: Load Genome 1. Within IGV, click the FILE tab on the menu bar. 2. Click the the ‘Load Genome from File’ option. 3. In the browser window, select chr22.fa (genome). Graphical Instruction: Load Other Files 1. Within IGV, click the FILE tab on the menu bar. 2. Click the ‘Load from File’ option. 3. Select the genes-chr22.gtf file (known genes file). 4. Perform Steps 1-3 for the files to the right. RNA-Seq Lab v5 | Radhika S. Khetani Files to Load genes-chr22.gtf ctrl_accepted_hits.bam expt_accepted_hits.bam merged.gtf 26 Step 10A: Visualization With IGV Your browser window should look similar to the picture below: RNA-Seq Lab v5 | Radhika S. Khetani 27 Step 10B: Visualization With IGV Click here and type the following location of a differentially expressed gene: chr22:19960675-19963235 Move to the left and right of the gene. What do you see? RNA-Seq Lab v5 | Radhika S. Khetani 28 Step 10C: Visualization with IGV Looks like the new transcriptome (merged.gtf) compares poorly to the known gene models. This is very likely due to the very low number of reads in our dataset. We can see that there are many more reads for one dataset compared to the other. Hence, it makes sense that the gene was called as being differentially expressed. Note the intron spanning reads. RNA-Seq Lab v5 | Radhika S. Khetani 29 Conclusion Today we did the following: 1. Used the Tuxedo Suite to: a. Aligned RNA-Seq reads using TopHat(splice-aware aligner). b. Performed reference-based transcriptome assembly with CuffLinks. c. Obtained a new transcriptome using CuffLinks & CuffMerge. d. Used CuffDiff to obtain a list of differentially expressed genes. e. Reported a list of significantly expressed genes. 2. Used a genome browser and visualization tool to observe the aligned data and the new transcriptome. RNA-Seq Lab v5 | Radhika S. Khetani 30