Download - RNA-Seq for the Next Generation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics in learning and memory wikipedia , lookup

NEDD9 wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Primary transcript wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic library wikipedia , lookup

Transposable element wikipedia , lookup

Non-coding DNA wikipedia , lookup

Point mutation wikipedia , lookup

Copy-number variation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Public health genomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Human genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

Gene expression programming wikipedia , lookup

Metagenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome editing wikipedia , lookup

Microevolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
RNA Seq Lab I: Analyzing cuffdiff output from an RNA-seq dataset1
Bio 461 Developmental Biology Lab
Saint Louis University
Dr. Judith Ogilvie
Objectives:



Obtain GO terms and other gene attributes for differentially expressed genes
Obtain genomic DNA and mRNA sequences for candidate genes
Edit and annotate sequences
In this training module you will analyze cuffdiff output from an Illumina RNA-seq data set that my lab conducted
to identify differentially expressed mRNAs isolated from postnatal day 4 (P4) and P6 mouse retinas, both from
wild type (wt) and from the mutant rd1 mouse. In Part A we will filter the list of differentially expressed genes
through the Ensemble BioMart database and the DAVID database to find Gene Ontology (GO) terms and other
gene attributes. Part B will use the UCSC Genome Browser to obtain sequence information for your genes of
interest. In Part C we will import sequence information for candidate genes into a sequence editing and
annotation program called ApE.
The RNA-seq experiment was analyzed using the Tuxedo Protocol with four different 2-way comparisons: (1)
wt P4 compared to wt P6, (2) rd1 P4 compared to rd1 P6; (3) wt P4 compared to rd1 P4, and (4) wt P6
compared to rd1 P6 (see data spreadsheet). During this lab activity you will sort through the differentially
regulated genes and pick out a handful of your choice for in-class validation using quantitative reverse
transcriptase PCR (qRT-PCR).
The Excel spreadsheet list of genes that are significantly differentially expressed is tabbed with the following
information:










Transcript: Transcript ID #
Nearest Ref Id: Ensembl transcript ID #
Gene: gene identifier #
Alias: gene name abbreviation
fold change: fold change of P6 wt/P6 rd1 FPKM
Direction: Up or Down regulated
P6 wt FPKM: fragments/Kb of exon/million fragments for each gene in P6 wt sample
P6 rd1 FPKM: fragments/Kb of exon/million fragments for each gene in P6 rd1 sample
Q-value: probability of observed expression change being “real”
Description: Gene name
Part A: Assigning full gene names and Gene Ontology or GO Terms
Genes can be sorted using the Sort function under the “data” tab in excel. The initial spreadsheet is sorted first
by direction (down regulated above up regulated genes), then by q-value, and within a significance range by
fold change. You can change this order. You may also want a shorter list to work with. Copy your list into a
new sheet in the excel spreadsheet. Sort in a way that separates the genes you are interested in and delete
1
Adapted from a lab developed by Ray Enke at James Madison University
Cold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724
1
the rest. Be sure to rename the sheet to indicate what is in this list. For example, you may want to include only
genes that are downregulated or you may want to include only genes with more than a 2-fold change. Note
that the second of our analysis tools prefers lists that are not more than 500 genes. If your total list is shorter
than this, you probably want to work with the complete list.
To pick “interesting” genes out of the list, we need to get some additional information about each of them. A
gene ontology or GO term is short descriptor of a gene product’s function. There are three different kinds of
GO Terms: Biological function, Molecular function, and Cellular component. We are most interested in the
biological function. Why? There are two very useful free tools for identifying GO terms. The results should be
the same, but the output appears in very different formats. We will start with a database called DAVID
(Database for Annotation, Visualization and Integrated Discovery) which will allow us to select for only GO
terms associated with biological function.




Navigate to DAVID (http://david.abcc.ncifcrf.gov/tools.jsp)
On the left side, click the “upload” tab.
From your CuffDiff output file, select and copy the entire column A (Nearest Ref ID) and paste it into
window labeled “A. Paste a list.”
Select Identifier>>> Ensembl Transcript ID; List type>>> Gene list; Click “Submit List.”
You now have a number of DAVID tools you can use to analyze the data.




Click on “Gene Functional Classification Tool.” The top row will show an “enrichment score” for each
group of functionally related genes. A larger score means this cluster is more enriched than a smaller
score. You can download the data as a tab delimited file that can be imported to an excel spreadsheet.
When you are done with this data, go back to the previous screen.
Click on “Functional Annotation Tool”
Annotation Summary Results>>>uncheck “Check defaults” >>> Clear All >>> Expand “Gene_Ontology”
by clicking on the plus sign >>> check GOTERM_BP_FAT to include only GO terms for Biological
Processes.
You will have 3 options. You can click on all three and download the files.
o Functional Annotation Clustering will provide clusters of related GO terms similar to the clusters
of functionally related genes above. The second column lists the GO terms in that cluster. The
“count” represents how many of your genes have been annotated with that GO term and the
P_value tells you how significant the enrichment is.
o Functional Annotation Chart includes each of the GO terms in the previous analysis, sorted by
P-value, without clustering.
o Functional Annotation Table provides a table listing each gene and all of the GO terms
associated with that gene.
The output may be cumbersome to sort through to identify interesting genes, so we will also use a database
called Ensemble BioMart to assign GO terms to each differentially expressed gene. This tool gives a very
nice spreadsheet, but does not allow you to select what kind of GO terms to include. BioMart prefers lists of
less than 500 genes.



Navigate to Ensemble BioMart (http://useast.ensembl.org/biomart/martview/)
Choose database>>>Ensembl Genes 78>>>Choose dataset>>>Mus musculus genes (GRCm38.p3)
Filters>>>Gene>>>check ID list limit>>>select Associated gene names from dropdown
These commands tell the database that we are going to filter a list of gene name abbreviations, labeled Alias in
your spreadsheet, through the annotated mouse genome (Mus musculus). In the RNA-seq spreadsheet, copy
the entire column B (or a subset that you have selected) and paste the gene aliases into the BioMart search
window. The next set of commands will tell the database what information we want back from our search.
Cold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724
2
There are many options you can select (see list under Attributes). For this exercise we will select the full gene
name, a description of the gene, and associated GO terms for each gene as our output:




Attributes>>>gene
Check only the following boxes under Gene: Description, Associate gene name
Check only the following box under External: GO Term name
Select Results (top tab)>>> for Export results to select File>>>XLS>>>check Unique results
only>>>Go. If it times out, simply repeat.
You now have a new spreadsheet with the Alias/Associated gene name, Description/Full gene name and GO
terms for each of the genes. Note that genes with multiple GO terms are repeated in multiple rows. For
example, using this information we now know that Cngb1 is a cyclic nucleotide gated ion channel that is
involved in phototransduction and smell. Copy or move this spreadsheet into the existing RNA-seq data
spreadsheet as a new sheet. Use this spreadsheet to search for keywords of interest. This list will include all of
the GO terms. You may want to start by sorting by GO term and delete the terms that are not of interest, such
as cellular components. Some terms will be so similar that you may want to delete them to simplify the list (eg.
“phototransduction” and “phototransduction, visible light” in the example below.
What kind of genes do you want to find to validate? My lab is most interested in photoreceptor development
and degeneration. What keywords would you search for if you were interested in this process?

Control F in Excel>>>enter search term>>>Find next
Working in groups of 3, decide what genes you are going to study. Before you leave today, groups will select 1
process or gene family to study. Each person in the group will select at least 2 genes to work on. Once you’ve
decided on your genes, write them on the board so we do not have duplicates.
Once your group has selected several genes, find them in the RNA-seq spreadsheet and copy those specific
rows of data into the workbook tab named “genes to validate” so they are organized together. Add columns for
the several GO terms.
Part B. Obtaining Sequences from the UCSC Genome Browser
Next we will find genomic DNA and mRNA sequences for each of our genes of interest. The UCSC Genome
Browser consists of a suite of tools for viewing and mining of genomic data. We will use some of the basic
features of the browser to collect sequences for the chicken Rhodopsin gene. Navigate to the UCSC Genome
Browser homepage: http://genome.ucsc.edu/



Select Genomes or Genome Browser
In the pull down menus select Group>>>mammal; genome>>>mouse; assembly>>>2011; enter
Rhodopsin as the search term>>>submit
Select Rho at chr6 from the result page to access the genome browser view.
Cold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724
3
This takes you to a view of the entire Rhodopsin gene on the mouse chromosome 6 with multiple other tracks
showing data corresponding to this genetic location. For simplicity, first deselect all tracks and start from
scratch.



Directly under the viewer select the hide all option to hide all tracks
Under genes and gene predictions select the Ensemble Genes and RefSeq Genes options
with full display options
Select Refresh
The Rhodopsin gene with annotated exons (solid bars) and introns (arrowed lines) is now displayed in the
viewer with corresponding genome coordinates (Ensemble annotation in red, RefSeq annotation in blue).
The direction of the arrowed line indicates which strand the gene is encoded on. Arrows pointing to the right
indicate the gene is coded 5’ to 3’ on the top strand (left to right in this view), arrows pointing to the left indicate
the gene is coded 5’ to 3’ on the bottom strand (right to left in this view). Looking at the viewer you can easily
see that Rho is coded on the top strand with 5 exons and 4 introns (exon 1 on the far left).
If you are interested in exploring the many additional features and tracks available in the genome browser, the
site contains an excellent tutorial that you I encourage you to check out (http://www.openhelix.com/ucsc
Obtaining sequence information:
To obtain sequence information for a gene or a genetic region, click on the gene name or gene ID on the left
side of the viewer (eg “Rho”). This brings you to a page index where you can access more info about your
gene. Under the Links to sequence heading you have options to view the genomic DNA, mRNA, or protein
sequence for this region. We will collect gDNA and mRNA sequences for Rhodopsin. Select the Genomic
sequence link 1st to go to a sequence formatting page. Get the Rho sequence with the following formatting
options and paste it into a Word file.




5’UTRs, CDS exons, 3’UTRs, introns
One FASTA record per gene
Exons in upper case, everything else in lower case
Submit
This outputs the Rhodopsin genomic DNA sequence with all exons in upper case and introns and everything
else in lower case. Visually, you should be able to pick out the 5 exons by seeing where the upper case letters
are separated from lower case. For now, copy/paste the entire sequence into a MS Word file. Go back to the
Rho index page and select the mRNA sequence link. This outputs the mRNA sequence (with Ts instead of
Us), that is all of the exonic coding sequence stitched together with the intronic sequences spliced out.
Copy/paste this sequence into a Word file as well.
C. Editing and annotating sequences in ApE
For your take home assignment you will import and annotate these sequences using a program called A
plasmid Editor (ApE). ApE is freeware used for sequence analysis developed by Wayne Davis at the
Cold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724
4
University of Utah. The programed is installed on all of the #3033 lab computers but you will also need to install
it on a computer that you can use outside of class. It can also be easily downloaded at
http://biologylabs.utah.edu/jorgensen/wayned/ape/ and installed onto your personal computers (note: there are
slightly different installation instructions for Mac users).
Creating a new sequence file
Open ApE and copy/paste your gDNA and mRNA sequences that you obtained from UCSC into separate new
DNA entries. Be sure not to paste in the FASTA sequence tags. You'll notice the software warns you if you try
to paste in illegal letters (ie, not ATGC) and will remove them.
To find a particular sequence, press "Command F" or click on the "binoculars" icon or select "Find" under
the "Edit" menu. Input the sequence you're looking for (type or copy/paste) into the search field and click "Find
next". Use this feature to search either of your sequences for the 3rd exon of the Rho gene:
MmRho 3rd exon:
GTACATCCCTGAGGGCATGCAATGTTCATGCGGGATTGACTACTACACACTCAAGCCTGAGGTCAACA
ACGAATCCTTTGTCATCTACATGTTCGTGGTCCACTTCACCATTCCTATGATCGTCATCTTCTTCTGCTATG
GGCAGCTGGTCTTCACAGTCAAGGAG
Annotating sequence features
A nucleic acid sequence looks like nothing more than a bunch of random As Cs Gs and Ts. To make sense of
it we will annotate our sequences, meaning we will highlight a few features as points of reference. To annotate,
highlight a portion of sequence with the cursor then select Features>>>New Feature. Give the feature a name
(eg “exon 3”) and select a Forward color to highlight your sequence. (Note, for annotating features on the
reverse strand,such as reverse primers, select the “Rev-Com” option on the top right of the edit feature
screen. Hit OK. You should see exon 3 annotated in your sequence as whatever color you selected. You can
select areas of your sequence to find precise nt location or size of a highlighted region using the metrics at the
top of your sequence.



What nucleotides does exon 3 start and end at on the gDNA sequence?
How big is exon 3 in bp?
What nt represents the junction between exon 2 and exon 3 in the mRNA sequence?
To view and print your annotated sequence map, select Enzymes>>>Text Map. Keep default settings for
configurations and hit OK. (There can also be done by selecting the “text map” icon in the tool bar above the
sequence; 3rd icon from the left). This command gives you your sequence + annotations in a printable format.
Right click to print OR save your annotated text map sequence to a Word document as a screenshot and print
via Word. There are a number of other features not described by this guide that you can explore on your own if
you like.
Due in next class March 19 (individual assignment)





Text map print out of your MmRho genomic DNA sequence with exon 2 and exon 3 annotated (print out
does not have to be in color)
Indicate the nt that each exon starts at and how big each exon is (mark these numbers in pen or pencil
on your print outs
Text map print out of your MmRho mRNA sequence with exon 2 and exon 3 annotated (does not have
to be in color)
Don’t forget to put your name on the printouts!
See examples below
Cold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724
5
Example ApE mRNA sequence:
Here’s the same sequence in “text map” view:
Cold Spring Harbor Laboratory, DNA Learning Center, 1 Bungtown Road, Cold Spring Harbor, NY 11724
6