Download RNA seq Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ridge (biology) wikipedia , lookup

Public health genomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Nucleic acid tertiary structure wikipedia , lookup

Genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

MicroRNA wikipedia , lookup

Protein moonlighting wikipedia , lookup

Point mutation wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

X-inactivation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

RNA wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy wikipedia , lookup

Genome evolution wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

History of RNA biology wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Metagenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Primary transcript wikipedia , lookup

RNA interference wikipedia , lookup

NEDD9 wikipedia , lookup

Gene nomenclature wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene wikipedia , lookup

Microevolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA silencing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Non-coding RNA wikipedia , lookup

Designer baby wikipedia , lookup

Epitranscriptome wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Exploring the Human
Transcriptome
Claudia Neuhauser
University of Minnesota
Informatics Institute
From DNA to Proteins
Source: Wikipedia (http://en.wikipedia.org/wiki/Alternative_splicing)
RNA: Ribonucelic Acid
• Types of RNA
– Ribosomal RNA (rRNA): catalytic component of
ribosomes (about 80-85%)
– Transfer RNA (tRNA): transfers amino acids to
polypeptide chain at the ribosomal site of protein
synthesis (about 15%)
– Messenger RNA (mRNA): carries information about a
protein sequence to the ribosomes (about 5%)
– Other types
• miRNA, siRNA,snRNA, dsRNA,…
RNA: Ribonucelic Acid
• Types of RNA
– Ribosomal RNA (rRNA): catalytic component of
ribosomes (about 80-85%)
– Transfer RNA (tRNA): transfers amino acids to
polypeptide chain at the ribosomal site of protein
synthesis (about 15%)
– Messenger RNA (mRNA): carries information about a
protein sequence to the ribosomes (about 5%)
– Other types
• miRNA, siRNA,snRNA, dsRNA,…
Transcriptome
• The transcriptome is the set of all RNA
produced in a cell (or population of cells)
• The transcriptome of a cell varies over time
and with environmental conditions
• The mRNA transcripts reflect which genes are
actively expressed
– Microarray technology
– RNA-seq technology
Exploring
Transcriptomes
Both microarray and RNA-seq
compare mRNA and provide
quantification of gene
transcripts
From: Functional Genomics (G.
Meroni and F. Petrera. 2012.
Accessed through INTECH
(http://www.intechopen.com/books/functionalgenomics/beyond-the-gene-list-exploringtranscriptomics-data-in-search-for-gene-functiontrait-mechanisms-and)
Comparing Microarray and RNA-Seq
Wang, Zhong, Mark Gerstein, and Michael Snyder. "RNA-Seq: a revolutionary tool for
transcriptomics." Nature Reviews Genetics 10.1 (2009): 57-63.
RNA seq Experiment
By Boraas (Own work) [Public domain], via Wikimedia Commons
http://commons.wikimedia.org/wiki/File%3ARNA_Seq_Experiment.png
RNA seq Alignment
Malone, John H., and Brian Oliver.
"Microarrays, deep sequencing and
the true measure of the
transcriptome." BMC biology 9.1
(2011): 34.
Figure 4: Correlation of gene expression based on RPKM by RNA-Seq and
protein abundance by label-free method(A) MS1 based quantification by
msInspect plotted against RPKM, log transformed. (B) Normalized MS2 spectral
counts (NSAF)) plotted against RPKM, log transformed. Data for mouse
mitochondrial genes in brainstem tissue. Protein abundance by msInspect is
based on top 3 normalized peptide area intensities.
Source: Ning, Kang, Damian Fermin, and Alexey I. Nesvizhskii. "Comparative analysis
of different label-free mass spectrometry based protein abundance estimates and
their correlation with RNA-Seq gene expression data." Journal of proteome
research 11.4 (2012): 2261-2271.
Resources
• Recount
– http://bowtie-bio.sourceforge.net/recount/
– Online resource of RNA-seq gene count datasets from 18 different
studies
• Ensembl
– http://www.ensembl.org/index.html
– Genome database (automated gene annotation system)
• RefSeq
– http://www.ncbi.nlm.nih.gov/refseq/
– NCBI Reference Sequence Database (manually curated)
• Expression Atlas
– http://www.ebi.ac.uk/gxa/home
– Information on gene expression patterns under different biological
conditions
The Data
• ReCount
– http://bowtie-bio.sourceforge.net/recount/
• “ReCount is an online resource consisting of RNAseq gene count datasets built using the raw data
from 18 different studies. […] By taking care of
several preprocessing steps and combining many
datasets into one easily-accessible website, we
make finding and analyzing RNA-seq data
considerably more straightforward.”
From ReCount to Excel I
• Wang, ET, et al. (2008):
http://www.ncbi.nlm.nih.gov/pubmed?term=189
78772
• Count tables can be accessed by clicking on the
“link”
• Ctrl-a
• Ctrl-c
• Open Excel
– Click on Cell A1
– Ctrl-v
From ReCount to Excel II
• Click on the Data tab in your spreadsheet and click on
Text to Columns in the ribbon under Data Tools. The
Convert to Columns Wizard will guide you through the
next steps.
• Your original data are separated by spaces. Click on
Delimited to choose the original data type, and click
Next.
• Click Space in the Delimiters box. You should see how
the data will be displayed in the data preview. If it looks
correct, click Finish.
• Save your file or use the ones uploaded to the site.
Gene ID
Sample ID
gene
The Data
Reads
SRX003935 SRX003921 SRX003924 SRX003923
1 ENSG00000000003
2 ENSG00000000005
1
12
22
16
0
0
0
0
3 ENSG00000000419
4 ENSG00000000457
25
13
74
26
65
19
22
26
5 ENSG00000000460
6 ENSG00000000938
12
0
5
8
0
0
0
0
7 ENSG00000000971
8 ENSG00000001036
0
0
0
0
33
13
125
88
0
0
0
0
339
269
404
253
9 ENSG00000001084
10 ENSG00000001167
Exercises 1 & 2: The Wang et al. Data
• Open the Heart tab
• Explore the genes
– Pick a gene ID and search in your browser for the
gene ID
– Explore the gene on the Ensemble website
• Explore the read count distribution
– What percentage of genes are expressed?
– What is the distribution of read counts?
– Detailed instructions are in workbook
From Raw Counts to Interpretation
• What affects the magnitude of the number of
reads assigned to a specific gene?
– Exon model
– Expression level
– Length of gene
– Sequencing depth
Normalizing Raw Counts I
• Raw Data
gene
length
SRX003929
ENSG00000104936
12836
2323
ENSG00000161016
2823
2319
• Similar number of reads but different lengths
• To compare genes within a sample, divide raw
count by length of gene
raw count 2,323
length normalized expression 

 0.1810
length
12,836
Normalizing Raw Counts II
• Find the total number of reads N
• For gene i, calculate
raw count/length  (qi / Li )

i 

total count
N
• These numbers are very small
– The median is around 4x10-10
• Multiply by 109=1,000,000,000
• This new quantity is called RPKM (or FPKM)
– Reads per kilobase pair per million mapped reads
Normalizing Raw Counts III
• Calculating RPKM
qi
raw count
RPKMi 

 length   total count   Li  N 
 1,000   1,000,000   103   106 



 
 
• This quantity can be used for within sample
analysis
• Note: gene annotation and length come from
an ‘exon model’
Exercise 3
•
•
•
•
Heart Length tab
Calculate RPKM
Plot RPKM as a function of length
Find genes that are strongly expressed in the
heart and go to the Expression Atlas to
confirm
• Detailed instructions are in workbook
Exercise 4
• The Heart-Liver tab has RNA-seq read counts
for two tissue types, the heart and the liver.
We will use this data set to learn about
differential expression.
• How many genes are expressed in both the
heart and the liver, in one but not the other,
and in neither tissue?
Normalizing Raw Counts IV
• To compare across samples, we need to
account for sequencing depth
• For each sample, find the total number of
reads
• For gene i in sample k, calculate
raw count/length  (qik / Lik )

ik 

total count
Nk
• Sum over all genes i in sample to obtain
normalizing factor Λk
Normalizing Raw Counts V
• For each gene i in sample k, divide λik by Λk
 qik / Li
ik  
 Nk



1
 q jk / Lj 
j  N 

k

• This quantity, called relative abundance, can
be used to compare across samples
Exercise 5
• The Heart Liver Length tab has an additional
column (Column C) with the length of each gene.
We will compare relative importance of each
gene.
• Determine the total number of reads N for each
tissue.
• Calculate relative abundance for each tissue
• Graph the cumulative distribution function of the
relative abundance as a function of the number
of genes.
• Detailed instructions are in workbook
Exercise 6
• Calculate the log fold change
RPKM i 1
fold change 
RPKM i 2
• ‘=ABS(LOG(ratio,2))’
• Graph the log fold change as a function of
relative abundance for each tissue type