Download Introduction Exercise 1: Measuring gene expression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Oncogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomics wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Pathogenomics wikipedia , lookup

Genome editing wikipedia , lookup

Gene wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genome (book) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Metagenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome evolution wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Introduction
In this workshop we will demonstrate a typical analysis done by a bioinformatician working with
RNA-seq data using Galaxy. This involves quality control, aligning sequencing data to a known
reference genome, doing differential gene expression analysis and visualization.
•
•
•
Open a browser and go to: http://bioinf-galaxian.erasmusmc.nl/galaxy
Register to gain access to data libraries and workflows using the credentials given by the
lecturer
All the data that belongs to this pactical – including results and answers – can be found in:
◦ Shared Data
▪ Data Libraries
• EMC Galaxy Training Materials
◦ EMC Galaxy Training – 6: RNA-Seq Basics (non-tuxedo)
We will refer further in the practical refer to “EMC Galaxy Training – 6: RNA-Seq Basics (nontuxedo)” directory with “the data-library”. To prepare yourself on the RNA-Seq topic, obtain the
presentation from the “01 Slides & Manuals” directory within the data library, and go through it. If
you have questions about this topic, ask an assistant for help.
Exercise 1: Measuring gene expression
In the first exercise we will start with raw paired-end sequencing data and process it to gene
expression levels and interactive genome browsing. The input data is a small selection of reads that
will align to just a small region on the human genome. We have chosen to reduce this data because
processing is a time and resource consuming task.
•
Import the following files from:
◦ [02 Input Data]
▪ [Exercise 1: .Measuring gene expression]:
• ucsc refseq.gtf
• miR-23b_1.fq
• miR-23b_2.fq
These are paired-end sequencing data files in the same format provided by the manufacturer. Take a
look at one of the FASTQ files by pressing the eye-icon. Each read is represented by four lines: a
header, the sequence itself, a “+” and the quality scores. We're going to get an impression of the
quality of the reads:
• Run FastQC on both FASTQ files.
◦ Hint: When selecting input files, you can choose multiple datasets and run in parallel.
The tool produces two output files per input file. We're only interested in the 'website' because the
data is presented in a human understandable format. We can remove the history items named
“FastQC on …: RawData”. When we look at the output of the QC steps, you will notice a lot of
warnings and errors, which arise partially from the fact that we work with a (artificial) very small
dataset. Take a look at on of the two websites and answer the questions.
Questions:
• What is the total number of sequences?
• What is the quality encoding? (Write this down, you will need it later on)
• Take a look at the figures “Per base sequence quality”, “Per sequence quality scores” and
“Sequence Length Distribution”. What is the reason for the warning at the “Per base
sequence quality”?
◦ Ask for help if you're not sure
There are – unfortunately – several derivatives of the FASTQ file format. The differences are in
particular in the quality encoding. The current 'standard' is the FASTQ-Sanger format. To convert
the FASTQ files into FASTQ-Sanger, you should run the tool FASTQ Groomer on both files.
Therefore you need to know the quality encoding of our two files (hint: results of FastQC). If you
need more help, go to: http://en.wikipedia.org/wiki/FASTQ_format#Encoding or ask for it. Thus,
run FASTQ Groomer on both miR-23b_2.fq and miR-23b_2.fq. After grooming, rename the files to
something meaningful (e.g. “..._1_groomed” and “..._2_groomed”).
•
Check if the files are converted correctly: press the pencil-icon. Go to the “Datatype” tab
and confirm the New Type is equal to fastqsanger (not fastq, not fastqcssanger!!)
When both files are converted, we can use Sickle for trimming low quality bases at the ends of the
reads. Select the tool Sickle and make sure:
• you select the appropriate type of reads (paired-end)
• make sure you use the right quality type (fastsqsanger)
After running the tool, browse through on of the files and scroll down.
•
•
What difference do you see when you look at the newly generated FASTQ files?
What are the “singletons”?
Run FastQC again, but now on either the forward or reverse files from Sickle.
•
If you run FastQC again, which metrics are improved / changed?
To put our high quality reads into a biological context, we align the sequencing data to the human
reference genome. The most widely used human reference genome at the moment is hg19. Please
read the first paragraph (3 scentences) of: http://en.wikipedia.org/wiki/Reference_genome
•
On how many individuals is hg19 based?
For RNA-Seq we need specialized RNA aligners, able to cope with gaps that originate from
splicing. For this exercise we will make use of a tool called RNA-STAR. Load the tool n galaxy
(under the name 'rnastar') and make sure:
•
•
•
•
single ended or mate-pair ended reads in this library = paired-end
forward reads = forward reads from sickle
reverse reads = reverse reads from sickle
built-in index = hg19
From the results only keep the “rnastarrun_starmapped.bam”, since this is the alignment. This file
is in a binary compressed format and visualizing its file contents is not very useful. Instead, we can
visualise it interactively in Galaxy. Visualise the BAM file by going to Trackster, and save it
immediately. When Trackster is loaded, press the [+] in the right corner and add refseq_genes.gtf
and save it again.
Questions:
• Can you find evidence for alternative splicing in region chr16:15696870-15745667
•
◦ Hint: Change the visualisation mode of the alignment to “pack”.
Can you find mismatches in the alignment (or possibly even variants)?
To measure the expression levels per gene, we lookup the coordinates of all known genes in a
database (e.g. refseq). For each gene, we look at it's regions in which we expect to find reads and
count them. To do this, load the tool featureCounts and select:
•
•
alignment file: rnastarrun_starmapped.bam
GFF/GTF source: Use reference from history
◦ Gene annotation file: ucsc_refseq.gtf
Leave the other settings default and execute. This will result into two files; a summary that contains
statistics on the counting and the actual counts. Take a look at the non-summary file; the actual
counts. The second gene, WASH7P, which has a read-count of 0 (2nd column) has a 'length' of
1769bp. Yet, when we look at the UCSC genome browser it's genomic location is given as
chr1:14407-29370 and similarly gene-cards says the gene's length is “Size: 15,445 bases”.
•
Why is there a difference in length?
◦ Hint: take a look at the GTF file's 3rd column or visualize it e.g. within the BAM
visualization.
The featureCounts output has many 0-values. This is because we make use of a artificially
shortened trimmed dataset. To show only the relevant content, we can filter the 0 value lines out.
Proceed with the tool “Filter data on any column using simple expression” and set:
•
•
•
File: the featureCounts non-summary output file
Condition: c2!=0
Number of lines to skip: 1
Take a look at the result:
•
Which gene has the highest readcount?
Exercise 2: Differential expression analysis.
Gene expression levels don't mean at to much unless they're put in a relative context. In the
previous exercise we found that ANXA2 had the highest readcount but what if this gene has a high
readcount in any sample? To understand what expression levels mean in a relative context we need
normalization and apply statistical testing. A very popular tool that allows to do this with RNA-Seq
data is EdgeR. In the following analyses we're going to find the differentially expressed genes
in the MCF7-cell line between samples that have been treated with the hormone β-estradiol
(E2) and those that were left as control. These samples are thus taken from the same cell line (in
vitro). Look up where MCF7 cell lines originate from if you don't know this.
Reference: http://www.ncbi.nlm.nih.gov/pubmed/24319002
In this experiment the RNA-Seq datasets have been subsampled. For each replicate, we have the
full sequencing depth (~30M), 10M and 5M.
Subsampled datasets
Create a new histroy, preferably called “DGE MCF7 (SubSampled)”. Import from the “02 Input
Data/Exercise 02: Differential expression analysis” directory, available in the data-library, the
following items into your history:
As described before, we have two classes we want to compare to each other. One class is treated
with estradiol, called “E2”, and the other class wsa left as a control, called “Control”. Take a look at
the design_matrix and see if you can find samples that belong to those classes.
• Take a look at GSE51403_expression_matrix_5M_coverage.txt and find how many columns
you see (apart from column gene-id), divide this number through 2 and write this number
down in Table 01 (bottom of the practical).
• Do the same for GSE51403_expression_matrix_10M_coverage.txt.
To run the differential gene expression testing, load the galaxy tool:
edgeR: Differential Gene(Expression) Analysis RNA-Seq gene expression analysis using edgeR (R
package).
We will first test the dataset that has the lower sequencing depth – 5 million reads:
• Select GSE51403_expression_matrix_5p_coverage as expression matrix.
• Select the GSE51403_design_matrix_subsampled as the design matrix.
• The more complicated part is the biologcial question we want to ask. This question is asked
as a mathematical formulation in a format described by a well known R package limma. For
two class problems it's very simple: classTreated-classnormal, which in our case is:
◦ Control-E2 (case sensitive!)
• Don't select addition output files.
Take a look at the file “edgeR DGE on 1: GSE51403_design_matrix_subsampled.txt - differentially
expressed genes”. If everything is correct, the gene GREB1 is located in the top of the file. Please
check it’s corresponding gene cards page:
http://www.genecards.org/cgi-bin/carddisp.pl?gene=GREB1
Question:
Can you find on the gene cards page a regulatory factor of the gene that relates to the E2 treatment?
Hint: what was E2 again?
Question:
Can you find on the gene cards page an association with MCF-7 cells?
Hint: what was MCF-7 for type of cell line again?
Each line in the file represents one gene; the 2nd column represents the gene symbol. The Pvalue is
a probability that represents the chance to find the expression values that belong to the gene given
that they are coming from the same distribution. The FDR is a multiple testing correction of the
Pvalue and is usually used instead of the Pvalue. The lower this value, the less likely it is that they
are derived from the same distribution by chance. Thus, differentially expressed genes will have a
low FDR or PValue. To distinguish between differences still caused by chance and differences that
are so large that they are considered significant, we make use of a cut-off, commonly set to < 0.01.
To find how many genes are significantly differentially expressed, go to “Filter data on any column
using simple expressions”:
• File: edgeR DGE on 1: GSE51403_design_matrix_subsampled.txt - differentially expressed
genes
• Condition: c7<0.01
• Header lines to skip: 1
Question:
How many genes are differentially expressed between Control and E2?
• Write this number down in Table 01 (bottom of the practical)
Question:
Would you expect more or less differentially expressed genes if the experiment was done on
individual patient samples instead of cell-line replicates?
Run edgeR: Differential Gene(Expression) Analysis again and leave everything the same, but use
GSE51403_expression_matrix_10M_coverage.txt instead. Run the filter to find the number of
significant genes (Condition: c7<0.01) and Write this number down in Table 01 (bottom of the
practical).
Question:
Do we find more or less differentially expressed genes (FDR < 0.01) between the Control and E2
using 10M instead of 5M raw reads per sample?
• Hint: the fact that the numbers of significantly expressed genes are different, indicates that it
relates to the statstical power of the experiment.
Full resolution datasets
Create a new histroy, ideally called “DGE MCF7 (Full Resolution)”.
Import:
•
•
GSE51403 expression matrix full.txt
GSE51403 design matrix full depth.txt
Please find out the number of replicates by inspecting GSE51403_design_matrix_full_depth.txt, and
write this number down on the next row in Table 01 (bottom of the practical).
Run edgeR: Differential Gene(Expression) Analysis again and leave everything the same, but use
GSE51403_expression_matrix_10M_coverage.txt instead. Run the filter to find the number of
significant genes ( Condition: c7<0.01 ) and write this number down in Table 01 (bottom of the
practical).
We have now ran several tests with the same number of replicates but with different sequencing
depths. To see what the effect is of the amount of replicates, we should run the same analysis with a
lower number of replicates. We can reduce this number very easily: go to the tool:
edgeR: Concatenate Expression Matrices Create a full expression matrix...
Add an expression table and select the following 5 replicates per sample:
This will create a truncated version of the expression matrix, only including the 5+5 replicates.
Rename the new expression matrix to: GSE51403_expression_matrix_full__5+5_replicates.txt
Run edgeR: Differential Gene(Expression) Analysis again and leave everything the same, but use
GSE51403_expression_matrix_full__5+5_replicates.txt instead. Enable the optional output: “MDSplot (logFC-method)”. Run the filter to find the number of significant genes ( Condition: c7<0.01 ).
Count the lines and write it down in Table 01 on the last row.
Table 01: results
Replicates
Seq. depth (million)
Significant diff. expressed genes
0
0
0
...
5000000
…...
...
10000000
…...
...
30000000
…...
5
30000000
…...
The number of significant genes somehow reflect the statstical power of the experiment; if you
wouldn't find significant genes at all you have a dataset in which you also can't distinguish beteen
your classes.
Final question
Can you create a tab delimited file of Table 01 and upload it as a 'tabular' file within Galaxy?
Try to open it in Galaxy as Scatterplot (visualization on the history item) and discuss with other
people about the impact of removing these 2 replicates on the statistical power.