* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comprehensive Analysis of RNA-Seq Data
Genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Epigenetics in stem-cell differentiation wikipedia , lookup
Point mutation wikipedia , lookup
X-inactivation wikipedia , lookup
Genomic imprinting wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Genome evolution wikipedia , lookup
Gene nomenclature wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Gene desert wikipedia , lookup
History of genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome (book) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Primary transcript wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Oncogenomics wikipedia , lookup
Designer baby wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Gene expression profiling wikipedia , lookup
Comprehensive Analysis of RNA-Seq Data Ewing’s Sarcoma Study Comprehensive Analysis of RNA-seq Data Introduction RNA sequencing (RNA-Seq) opens the door to many new discoveries. Using RNA-Seq, exploration can be performed such as: 1. 2. 3. 4. 5. 6. 7. Differential gene expression Transcript expression Alternative splicing Detect new exons, transcripts and splice variants Allele specific expression Coding SNPs Pathways and biological interpretation Ewing’s sarcoma is a rare cancer that can form in bone and soft tissue. It is the second most frequent primary malignant bone cancer found in young people. Patients usually experience extreme bone pain and the tumor can quickly become metastatic and spread quickly to the lungs, other bones, or to the bone marrow. The overall five-year survival rate for Ewing’s sarcoma is approximately 60%. FIGURE 1: Distribution of Ewing’s Sarcoma. Data Import and Quality Control Partek Flow is used to directly import the sequencing reads from TorrentSuite™ software via the Partek Flow Uploader which is integrated with TorrentSuite™ software. The data has three replicates for each of the three cell lines; nine samples in total are imported. This case study demonstrates the RNA-Seq analysis of a Ewing’s Sarcoma data set. The samples were sequenced on an Ion Torrent™ PGM sequencer with analysis performed in Partek® Flow®, Partek® Genomics Suite®, and Partek® Pathway™ software. This experiment compares two tumor cell lines derived from Ewing’s sarcoma; a primary and a metastatic cell line with non-tumor fibroblasts, later referred to as “normal”, derived from the same patients. Genomic changes between What is Ewing’s Sarcoma? normal and tumor tissue are explored, as well as the mechanisms that turn a primary tumor metastatic allowing it to spread to other parts of the body. F I G U R E 2 : Base composition histogram shows this study does not have adaptor sequences. To identify if there are adaptor sequences that require trimming prior to performing alignment, the base composition histogram in figure 2 is used. Based on this graph it is determined that there are no adaptor sequences to be trimmed. Additionally, Phred scores, which give an average quality score per position, are used to investigate the need to trim away reads with low quality bases. In this experiment reads are trimmed to exclude bases below a score of 13 (figure 3). Trimming enables removal of sequence ends that are of low quality. Advanced trimming options allow low quality 3’ ends on a ‘per read’ basis to be removed. Each base is screened from the 3’ end until the threshold is met. Minimal approved length for the reads is specified and if the trimming shortens the reads beyond the threshold, the reads are discarded. FI G U R E 3 : Phred score diagram helps identify threshold of quality reads for trimming. —2— Comprehensive Analysis of RNA-seq Data Alignment, Quantification, and Normalization Partek Flow, which offers a wide selection of alignment algorithms from which to choose, is used to align the reads. For this study the TMAP algorithm in combination with TopHat2 to capture junction read information is used. In working with next generation sequencing (NGS) data, Partek scientists have found that a combination of multiple aligners can at times improve results. Quantification and normalization are performed with an expectation maximization (EM) algorithm. Quantification maps all reads to the exon structure of a chosen transcriptome database such as RefSeq, GENCODE, Ensembl, AceView, or estimated levels of each transcript in all samples are calculated by the quantification process. the X-axis, or along the first principal component (PC1) describes 30.52% of the variance and divides the primary and metastatic tumor cell lines from the normal. The gap between the primary and metastatic cell lines indicates important gene expression changes in the evolution of Ewing’s Sarcoma diseased cells. Differential Transcript Expression To avoid any prior assumptions about the data, the Partek developed Gene Specific Analysis (GSA) statistical model is used. Although it is widely accepted that all genes do not behave in the same way, are affected by the same experimental factors, or have the same data distribution, researchers have frequently applied a single statistical algorithm to all genes within their studies. In comparison, the novel GSA algorithm selects the best statistical model for each gene. This acknowledges the fact that each gene is influenced by different factors and that each gene has a unique data distribution. Two important advantages of GSA are that it gives more statistical power through a better model fit and it provides more information about which experimental factors are influencing gene expression. This study shows that all genes do not have the same distribution. Figure 4 shows 76.43% of genes exhibit a Poisson distribution, 20.73% a normal distribution (to which GLM statistics are applicable), and 2.84% a negative binomial distribution. The gene expression profiles in this study clearly display different distributions and benefit from applying different statistical tests for differential transcript expression. FI G U R E 5 : Unsupervised PCA separates samples by cell line. This pattern is also apparent along the vertical dendrogram of the Hierarchical Clustering results (figure 6). Hierarchical clustering enables visualization of differential expression and identifies groups of transcripts/genes of interest in this study. Three distinct gene groups are highly expressed (red) and separated along the horizontal top dendrogram (purple, blue and green). Detailed investigation of these genes help unveil how a normal cell becomes a primary cancer cell, then a metastatic cell. FI G U R E 6 : gene groups. FI GURE 4 : GSA output shows that the genes in this study follow three different statistical data distributions. Revealing Data Patterns Principal Components Analysis (PCA) is a data reduction method that allows visual investigation of sample grouping in a 3D scatter plot. The PCA results in figure 5 show the distinct grouping of cell line types. Separation along Hierarchical clustering points out interesting Self Organizing Maps (SOM) shown in figure 7 are used to identify even more distinct gene groups. Groups four and seven show genes with growing expression levels between normal, primary, and metastatic cells. These genes may be important in the evolution of cancer cells. The opposite trend is observed in group nine with gene expression levels dropping between normal, primary, and metastatic. Other groups, such as three, help identify genes that are active only in the primary cells, hence unique to this status. Visually interpreting data with SOM unveils truly interesting biology. Not surprisingly, the Ewing’s Sarcoma gene EWSR1 ap- —3— Comprehensive Analysis of RNA-seq Data pears as a highly relevant gene for this data set. Indeed, EWSR1 is recognized as a master regulator in the development of Ewing’s Sarcoma. Figure 8 highlights that the EWSR1 gene is over expressed in tumor verses normal. Alternative Splicing FIGURE 7: SOM results identify interesting gene profiles across diseases stages. Partek Flow uses alternative splicing analysis to highlight transcript isoforms that are differentially expressed between experimental conditions. A target hit gene from the alternative splicing analysis is shown in figure 9. This is Sept9, a DNA methylation biomarker that is known to be involved in breast cancer, colorectal cancer, as well as leukemia. Interestingly the results show that out of three alternatively spliced forms of Sept9, only one transcript is expressed in FI GURE 8: Close-up of EWSR1 gene. Top three tracks are sequencing reads for metastatic, normal, and primary cell lines. The bottom track shows the available RefSeq transcripts for the EWSR1 gene. EWSR1 is highly expressed in the diseased cell lines. FIGURE 9: Isoform proportion plot of gene Sept9 displays different distribution of reads and one exon (highlighted by the red arrows), which is only available in the normal cells, clearly displaying the alternative splice event. —4— Comprehensive Analysis of RNA-seq Data the normal fibroblasts (highlighted in figure 9), but not in the cancer cell lines. Further investigation is needed to determine if Sept9 could be utilized as a biomarker for Ewing’s Sarcoma. Detect Novel Regions, Exons, and Transcripts The analysis of novel regions provides several very interesting targets. The colored reads in figure 10 reflect expression of an area within the RYR2 gene that have not been annotated in the GENCODE or RefSeq databases. This could be an entirely new exon as well as a possible new, undiscovered transcript. Interestingly these are only expressed in the tumor samples. Mutations of the RYR2 gene have already been reported in lung cancer through previous studies. Coding SNPs Coding SNPs are easily identified and visualized with Partek Genomics Suite. In figure 11, a coding SNP in the human leukocyte antigen HLA-A gene is identified. HLA is critical to the interaction between tumor cells and are components of both innate and adaptive to the immune system. The highlighted SNP is heterozygote in normal cells, but homozygote in Ewing’s sarcoma in both the primary and metastatic cells. Allele Specific Expression With Partek software you can associate SNPs with allele specific expression. Figure 12 displays a SNP in the CNP gene. FIGURE 10 : Discovery of new transcript for gene RYR2. To be validated. FIGURE 11 : Coding SNP visualization of HLA-A gene. —5— Comprehensive Analysis of RNA-seq Data CNP is shown to be a biomarker in cancer types such as glioblastoma and is associated with infiltration of the tumor activated by Wnt signaling, a pathway highly involved in tumor progression. This study reveals differential expression of the allele based on normal/tumor phenotype. This may infer some involvement in tumor progression. FI GURE 12 : Allele specific expression of a CNP gene SNP. In blue, normal samples express the A allele predominantly, whereas the tumor samples are expressed in the G allele. has demonstrated a complete analysis of an RNA-Seq data set, including: • • • • • • • • Data import and quality control of FastQ bases Alignment, quantification, and normalization Differential transcript expression Alternative splicing Detected novel regions, exons, and transcripts Coding SNPs Allele specific expression Pathways and biological interpretation FI G U R E 1 3 : Gene ontology results. Pathways and Biological Interpretation Partek software takes this study from raw sequence data to biological interpretation, including gene ontology (figure 13) and pathway analysis (figure 14). Gene ontology analysis shows activity expected in cancer, such as a higher activation of the cell cycle, DNA repair, and cellular developmental process. It is typical for tumor markers to show high proliferation and high replication, as highlighted in the DNA Replication pathway in figure 14. Gene ontology analysis shows activity expected in cancer, such as a higher activation of the cell cycle, DNA replication, and cell division. It is typical for tumor markers to show high proliferation and replication. Conclusion In this RNA-Seq experiment, Partek Flow, Partek Genomics Suite, and Partek Pathway reveal the biology. This example FI G U R E 1 4 : DNA replication pathway. Red highlights upregulated genes and green highlights downregulated genes. —6— Contact Us North America Sales: +1 314.878.2329 Europe Sales: +44 (0) 2075 588491 Asia/Australasia Sales: +65 64789730 Partek Corporate Offices St. Louis, Missouri USA +1 314.878.2329 (office) +1 314.275.8453 (fax) www.partek.com Try it Today! Download a free, no-obligation, 14-day trial of Partek software at www.partek.com. Trial software is fully functional and supported by Partek’s friendly and knowledgeable customer support team. Copyright © 2014 Partek Incorporated. All rights reserved. Partek Genomics Suite, Partek Flow and Partek Pathway are trademarks of Partek Incorporated. All other trademarks are property of their respective owners.