Download Comprehensive Analysis of RNA-Seq Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomics wikipedia , lookup

Epigenetics in stem-cell differentiation wikipedia , lookup

Point mutation wikipedia , lookup

X-inactivation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genome evolution wikipedia , lookup

Gene nomenclature wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene wikipedia , lookup

Metagenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Primary transcript wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

Oncogenomics wikipedia , lookup

Designer baby wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

NEDD9 wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Comprehensive Analysis
of RNA-Seq Data
Ewing’s Sarcoma Study
Comprehensive Analysis of RNA-seq Data
Introduction
RNA sequencing (RNA-Seq) opens the door to many new
discoveries. Using RNA-Seq, exploration can be performed
such as:
1.
2.
3.
4.
5.
6.
7.
Differential gene expression
Transcript expression
Alternative splicing
Detect new exons, transcripts and splice variants
Allele specific expression
Coding SNPs
Pathways and biological interpretation
Ewing’s sarcoma is a rare cancer that can form in bone
and soft tissue.
It is the second most frequent primary malignant bone
cancer found in young people. Patients usually experience extreme bone pain and the tumor can quickly become metastatic and spread quickly to the lungs, other
bones, or to the bone marrow. The overall five-year survival rate for Ewing’s sarcoma is approximately 60%.
FIGURE 1: Distribution of Ewing’s Sarcoma.
Data Import and Quality Control
Partek Flow is used to directly import the sequencing reads
from TorrentSuite™ software via the Partek Flow Uploader
which is integrated with TorrentSuite™ software. The data has
three replicates for each of the three cell lines; nine samples
in total are imported.
This case study demonstrates the RNA-Seq analysis of a
Ewing’s Sarcoma data set. The samples were sequenced
on an Ion Torrent™ PGM sequencer with analysis performed in Partek® Flow®, Partek® Genomics Suite®, and
Partek® Pathway™ software.
This experiment compares two tumor cell lines derived
from Ewing’s sarcoma; a primary and a metastatic cell line
with non-tumor fibroblasts, later referred to as “normal”, derived from the same patients. Genomic changes between
What is Ewing’s Sarcoma?
normal and tumor tissue are explored, as well as the mechanisms that turn a primary tumor metastatic allowing it to
spread to other parts of the body.
F I G U R E 2 : Base composition histogram shows this study does
not have adaptor sequences.
To identify if there are adaptor sequences that require
trimming prior to performing alignment, the base composition
histogram in figure 2 is used. Based on this graph it is determined that there are no adaptor sequences to be trimmed.
Additionally, Phred scores, which give an average quality
score per position, are used to investigate the need to trim
away reads with low quality bases. In this experiment reads
are trimmed to exclude bases below a score of 13 (figure 3).
Trimming enables removal of sequence ends that are of
low quality. Advanced trimming options allow low quality
3’ ends on a ‘per read’ basis to be removed. Each base is
screened from the 3’ end until the threshold is met. Minimal
approved length for the reads is specified and if the trimming
shortens the reads beyond the threshold, the reads are discarded.
FI G U R E 3 : Phred score diagram helps identify threshold of quality
reads for trimming.
—2—
Comprehensive Analysis of RNA-seq Data
Alignment, Quantification,
and Normalization
Partek Flow, which offers a wide selection of alignment algorithms from which to choose, is used to align the reads. For
this study the TMAP algorithm in combination with TopHat2 to
capture junction read information is used. In working with next
generation sequencing (NGS) data, Partek scientists have
found that a combination of multiple aligners can at times improve results.
Quantification and normalization are performed with
an expectation maximization (EM) algorithm. Quantification maps all reads to the exon structure of a chosen transcriptome database such as RefSeq, GENCODE, Ensembl,
AceView, or estimated levels of each transcript in all samples
are calculated by the quantification process.
the X-axis, or along the first principal component (PC1) describes 30.52% of the variance and divides the primary
and metastatic tumor cell lines from the normal. The gap
between the primary and metastatic cell lines indicates
important gene expression changes in the evolution of
Ewing’s Sarcoma diseased cells.
Differential Transcript Expression
To avoid any prior assumptions about the data, the Partek
developed Gene Specific Analysis (GSA) statistical model is
used. Although it is widely accepted that all genes do not behave in the same way, are affected by the same experimental
factors, or have the same data distribution, researchers have
frequently applied a single statistical algorithm to all genes
within their studies. In comparison, the novel GSA algorithm
selects the best statistical model for each gene. This acknowledges the fact that each gene is influenced by different factors and that each gene has a unique data distribution. Two
important advantages of GSA are that it gives more statistical
power through a better model fit and it provides more information about which experimental factors are influencing gene
expression.
This study shows that all genes do not have the same distribution. Figure 4 shows 76.43% of genes exhibit a Poisson
distribution, 20.73% a normal distribution (to which GLM statistics are applicable), and 2.84% a negative binomial distribution. The gene expression profiles in this study clearly display
different distributions and benefit from applying different statistical tests for differential transcript expression.
FI G U R E 5 : Unsupervised PCA separates samples by cell line.
This pattern is also apparent along the vertical dendrogram
of the Hierarchical Clustering results (figure 6). Hierarchical
clustering enables visualization of differential expression and
identifies groups of transcripts/genes of interest in this study.
Three distinct gene groups are highly expressed (red) and
separated along the horizontal top dendrogram (purple, blue
and green). Detailed investigation of these genes help unveil
how a normal cell becomes a primary cancer cell, then a metastatic cell.
FI G U R E 6 :
gene groups.
FI GURE 4 : GSA output shows that the genes in this study follow
three different statistical data distributions.
Revealing Data Patterns
Principal Components Analysis (PCA) is a data reduction
method that allows visual investigation of sample grouping in a 3D scatter plot. The PCA results in figure 5 show
the distinct grouping of cell line types. Separation along
Hierarchical
clustering
points
out
interesting
Self Organizing Maps (SOM) shown in figure 7 are used
to identify even more distinct gene groups. Groups four and
seven show genes with growing expression levels between
normal, primary, and metastatic cells. These genes may be
important in the evolution of cancer cells. The opposite trend
is observed in group nine with gene expression levels dropping between normal, primary, and metastatic. Other groups,
such as three, help identify genes that are active only in the
primary cells, hence unique to this status. Visually interpreting
data with SOM unveils truly interesting biology.
Not surprisingly, the Ewing’s Sarcoma gene EWSR1 ap-
—3—
Comprehensive Analysis of RNA-seq Data
pears as a highly relevant gene for this data set. Indeed,
EWSR1 is recognized as a master regulator in the development of Ewing’s Sarcoma. Figure 8 highlights that the EWSR1
gene is over expressed in tumor verses normal.
Alternative Splicing
FIGURE 7: SOM results identify interesting gene profiles across
diseases stages.
Partek Flow uses alternative splicing analysis to highlight
transcript isoforms that are differentially expressed between
experimental conditions. A target hit gene from the alternative
splicing analysis is shown in figure 9. This is Sept9, a DNA
methylation biomarker that is known to be involved in breast
cancer, colorectal cancer, as well as leukemia.
Interestingly the results show that out of three alternatively
spliced forms of Sept9, only one transcript is expressed in
FI GURE 8: Close-up of EWSR1 gene. Top three tracks are sequencing reads for metastatic, normal, and primary cell lines. The bottom track
shows the available RefSeq transcripts for the EWSR1 gene. EWSR1 is highly expressed in the diseased cell lines.
FIGURE 9: Isoform proportion plot of gene Sept9 displays different distribution of reads and one exon (highlighted by the red arrows), which
is only available in the normal cells, clearly displaying the alternative splice event.
—4—
Comprehensive Analysis of RNA-seq Data
the normal fibroblasts (highlighted in figure 9), but not in the
cancer cell lines. Further investigation is needed to determine
if Sept9 could be utilized as a biomarker for Ewing’s Sarcoma.
Detect Novel Regions,
Exons, and Transcripts
The analysis of novel regions provides several very interesting targets. The colored reads in figure 10 reflect expression
of an area within the RYR2 gene that have not been annotated in the GENCODE or RefSeq databases. This could be
an entirely new exon as well as a possible new, undiscovered
transcript. Interestingly these are only expressed in the tumor
samples. Mutations of the RYR2 gene have already been reported in lung cancer through previous studies.
Coding SNPs
Coding SNPs are easily identified and visualized with
Partek Genomics Suite. In figure 11, a coding SNP in the
human leukocyte antigen HLA-A gene is identified. HLA is
critical to the interaction between tumor cells and are components of both innate and adaptive to the immune system. The
highlighted SNP is heterozygote in normal cells, but
homozygote in Ewing’s sarcoma in both the primary and
metastatic cells.
Allele Specific Expression
With Partek software you can associate SNPs with allele specific expression. Figure 12 displays a SNP in the CNP gene.
FIGURE 10 : Discovery of new transcript for gene RYR2. To be validated.
FIGURE 11 : Coding SNP visualization of HLA-A gene.
—5—
Comprehensive Analysis of RNA-seq Data
CNP is shown to be a biomarker in cancer types such as glioblastoma and is associated with infiltration of the tumor activated by Wnt signaling, a pathway highly involved in tumor
progression. This study reveals differential expression of the
allele based on normal/tumor phenotype. This may infer some
involvement in tumor progression.
FI GURE 12 : Allele specific expression of a CNP gene SNP. In blue,
normal samples express the A allele predominantly, whereas the tumor samples are expressed in the G allele.
has demonstrated a complete analysis of an RNA-Seq data
set, including:
•
•
•
•
•
•
•
•
Data import and quality control of FastQ bases
Alignment, quantification, and normalization
Differential transcript expression
Alternative splicing
Detected novel regions, exons, and transcripts
Coding SNPs
Allele specific expression
Pathways and biological interpretation
FI G U R E 1 3 : Gene ontology results.
Pathways and Biological Interpretation
Partek software takes this study from raw sequence data to
biological interpretation, including gene ontology (figure 13)
and pathway analysis (figure 14).
Gene ontology analysis shows activity expected in cancer, such as a higher activation of the cell cycle, DNA repair,
and cellular developmental process. It is typical for tumor
markers to show high proliferation and high replication, as
highlighted in the DNA Replication pathway in figure 14.
Gene ontology analysis shows activity expected in cancer, such as a higher activation of the cell cycle, DNA replication, and cell division. It is typical for tumor markers to show
high proliferation and replication.
Conclusion
In this RNA-Seq experiment, Partek Flow, Partek Genomics
Suite, and Partek Pathway reveal the biology. This example
FI G U R E 1 4 : DNA replication pathway. Red highlights upregulated
genes and green highlights downregulated genes.
—6—
Contact Us
North America
Sales: +1 314.878.2329
Europe
Sales: +44 (0) 2075 588491
Asia/Australasia
Sales: +65 64789730
Partek Corporate Offices
St. Louis, Missouri USA
+1 314.878.2329 (office)
+1 314.275.8453 (fax)
www.partek.com
Try it Today!
Download a free, no-obligation, 14-day trial of Partek software at www.partek.com.
Trial software is fully functional
and supported by Partek’s friendly and knowledgeable customer support team.
Copyright © 2014 Partek Incorporated. All rights reserved.
Partek Genomics Suite, Partek Flow and Partek Pathway are trademarks of Partek Incorporated.
All other trademarks are property of their respective owners.