Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Maximizing the Value of NGS and Gene Expression Experiments (Synopsis): Strategies to Streamline Data Analysis and Interpretation for Actionable Research Outcomes HIGHLIGHTS INTRODUCTION • Research projects require large sums of money to produce data, but the Today, scientists are able to leverage high-throughput techniques such micro- investment and funds decrease once the data needs to be analyzed. Data arrays, proteomics or NGS to measure levels for nearly every mRNA, protein or analysis becomes the greatest roadblock in research due to the amount of DNA sequence variation, and easily produce tens or hundreds of thousands data now being produced and the time it takes to analyze the data. Sci- of data points. The resulting data analysis is exponentially more complicated entists are looking for the most effective, cheapest ways to streamline and time consuming, and scientists grapple with the daunting task of analyz- their gene expression data analysis in the shortest amount of time. IPA is ing these data to determine what actually occurred in the experiment. Gaining the industry leading gene expression analysis software enabling labs to the full value from the experiment demands thorough biological interpretation quickly narrow in on relevant information and examine data with biologi- to understand cause and effect. Scientists often look for “upstream” regulatory cal references. In this paper, we will discuss how IPA (commercial product) molecules such as transcription factors or microRNAs that may be responsible compares to open-source tools and the ROI (return on investment). for gene expression changes observed in the experiment. To fully understand the effects of the experimental results, scientists must analyze the data for • The ROI is compared by calculating the time to analyze RNA-seq and Microarray data sets between commercial products (IPA) and open source tools. molecular pathways, biological functions, known toxicities and identify particular gene(s) for further research (i.e. candidate targets or biomarkers). • Gene expression datasets were analyzed to calculate the time required to Historically, scientists could rely on their own expertise and published literature complete the three most common scientific tasks – (1) research an unfa- to perform simple analysis tasks. With the proliferation of published literature, miliar gene to create experimental hypotheses, (2) analyze gene expression this becomes increasingly challenging and results in an analysis and informa- data (transcription factors, pathway and biological function effects) and (3) tion bottleneck. Today, scientists can turn to internet-based software tools, identify target genes regulated by microRNAs. including informational websites (i.e. PubMed) and open-source or commercial analysis software (i.e. DAVID, Ingenuity-IPA) for help interpreting data. Conduct- • The time savings to do these common analyses between the products corresponds to the ROI. ing literature searches, reading as many papers as possible and investigating “top” genes with the highest changes in expression are common practices used to interpret high throughput data and design hypotheses for next experiments. • Not only did IPA outperform open source tools in the time it took to analyze Unfortunately, the chance of missing a critical facet of biology is increased gene expression data, but IPA also exceeded analysis capabilities. IPA was greatly, simply due to the abundance of available information in molecular found to save over 30 hours per data analysis over open-source tools. Also, databases and publications coupled with the amount of experimental data. with the powerful combination of the Ingenuity Knowledge Base and IPA, This complexity is further increased with RNA-sequencing data that provides researchers gain a deeper understanding of the underlying biology of a more precise measurement of the level of transcripts and isoforms over experimental systems and models. microarray technologies. Tools that help understand the data from different perspectives, bring different sources of information together and allow the • Scientists who use IPA can now go into greater, deeper biological detail, have access to scientific findings and perform more frequent analyses. Sci- scientist flexibility to explore avenues of interest are critical to understanding experimental results . entists can generate more informed decisions on the next steps to take in their research studies. In this study, we compared a commercially available tool, IPA, with several open-source tools that are commonly used for analysis of ‘omics data from high throughput experiments for time investment and effectiveness of results. Three representative tasks were performed with the goal of determining which was the most efficient analysis for biological interpretation: 1) Research an unfa- IPA analyzing this data which equates to 1.6 weeks per year per person. For the miliar gene 2) Analyze gene expression data to identify regulatory transcription simplistic test analysis conducted in this business brief, IPA was found to exceed factors, pathway and biological function effects and 3) Find target genes regu- the capabilities of the combined use of 3 open-source tools (DAVID, PathVisio lated by microRNAs in a data set. Time to completion and information gained and PubMed) saving almost 30 hours for a single data file and analysis (Figure 5). were recorded, including possible benefits or liabilities of a particular method. Consolidated tools with efficient workflows, such as the IPA and DAVID pro- This is a 60% time savings per person per analysis over other tools. Using a fully vided the highest ROI, compared to using search websites such as PubMed loaded hourly rate of $100/hour the resulting savings was extrapolated to the or miRBase. In addition, this exercise identified key drivers to consider with any number of data files uploaded per average IPA users to 600 hours (30hours x biological data analysis solution purchase. 20 datasets analyzed) over the course of a year. This equates to $60,000/year savings per person. Taking a conservative approach to the cost of an annual DISCUSSION subscription for IPA, the average cost per user was determined from commercial account licensing fees. The resulting return on investment was determined DETERMINING THE RETURN ON INVESTMENT (ROI) to be 253% for one year. The goal of this study was to quantitate the time required for analyzing gene expression data resulting from microarray or RNA-sequencing experi- To maximize the value of the large-scale gene expression experiments, ments using both commercial and open-source tools. Based on the analysis researchers must understand their data from a biological perspective. IPA helps times, time savings could be determined and a Return on Investment could scientists understand the biology most relevant to their experimental results be calculated. For purposes of this study the ROI compares the net benefits and generate more confident hypothesis. We have identified five key driv- per scientist of implementing the software, versus its total cost per scientist. ers important for a biological data analysis solution to drive value and ensure The ROI is calculated from the net benefits divided by the software costs and successful ROI . In addition, indirect benefits can be realized through success- expressed as a percentage. The ROI was calculated for a single year as IPA is an ful implementation of these drivers; such as a deeper understanding of the annual subscription. underlying biology of experimental systems and models, an increased level of experimental confidence, an improvement in researcher’s ability to prioritize To determine the return on investment for a single year of IPA the average work and make decisions, and an overall improvement in innovation among number of datasets uploaded and analyzed and the average amount of time research teams. a user spends in IPA was collected from the production system logs. IPA is the market leading commercial pathway analysis tool and maintains a significant The five key drivers important for a biological data analysis solution: user base for quantifying utilization on the system by the average user. In 2011 the average IPA user uploaded and analyzed 20 data files from gene expression, 1. Analysis: Elucidate cause and effect of observed gene expression changes. metabolomics and proteomic experiments. This average user spent 62 hours in The ability to predict the activation state of upstream causes of gene A. IPA Data Analysis Workflow Upload Data Run Core Analysis Pathways (overlay) Functional Effects Transcription Regulators Research Genes of Interest Save, Export IPA Analysis IPA Time 32.65 min. Review Findings (research 10 genes) 2.33 h (average/gene) X 10 20.87 Hours B. Open-Source Data Analysis Wokflow* *Direction of effect and transcription regulators prediction not included. Upload Data Run Analysis Analysis Time Functions Analysis* DAVID 26.47 min. Save, Export Upload Data Path Visio (10 pathways) 8.32 min. X 10 Pathway (overlay) Save, Export PubMed (research 10 genes) 4.75 h (average/gene) X 10 Research Genes of Interest >49.29 Hours Figure 1. Comparison of IPA and open-source data analysis workflows. A) Conservative estimate of time for IPA data analysis is 20.87 h., including high value functional effects and transcription regulator prediction. B) Open-source workflow requires 3 tools to conduct a complete analysis including functions and pathways, view pathways with data overlay, and follow-up with manual gene research to interpret effects. This analysis workflow still lacks significant benefits only available in IPA including predicted activation state of upstream transcription factors, the microRNA-mRNA Target Filter, and directional downstream effects. expression changes including transcription factors, microRNA, and other and known molecular relationships, and identifying upstream causes of those molecules that are upstream of the genes in a dataset is key to a bet- expression changes and the downstream effects on biological processes and ter understanding of the biological system and impact of the experiment. disease, provides a faster and more reliable, replicable way to identify key insights from complex data. Using a commercial tool such as IPA leads to a 2. Analysis: Enable a systems biology approach through network exploration. Using an iterative exploratory approach enables a deeper understanding of savings of over 30 hours per data analysis which can mean optimization of resources leading to new testable hypotheses in a shorter amount of time. the biological system being studied. Tools need to accommodate multiple data types such as microRNA, metabolomics and proteomics in addition to When considering an analysis strategy identifying the best tools which give gene expression data. Scientists can then use tools to generate networks the highest return on investment is critical to maximizing the value of each and further explore the biology associated with the data such as build sec- experiment and investment on reagents and instrumentation. This paper ond messenger cascades, identify clinically validated biomarkers associated systematically described and calculated the time required for three critical with the data, or determine what pathways are significantly impacted by steps within the biological analysis of microarray data resulting in a signifi- selected molecules. cant ROI of 253% for the commercial tool IPA. In addition, five key drivers were outlined to ensure successful environment to achieve the maximum 3. Platform: Support for cutting edge research. The research community ROI and other benefits. is fast paced especially with the advent of next-generation sequencing technologies and the race to better understand the biology associated full version with the data. The best analysis solutions will be at the forefront of these technologies, embracing new ways to interpret data from these com- This is a synopsis of the paper Maximizing the Value of NGS and Gene Expression plex studies. Comprehensive tools should aide researchers using RNA-Seq Experiments (Synopsis): Strategies to Streamline Data Analysis and Interpre- in understanding experimental results at the isoform level and provide the tation for Actionable Research Outcomes. To access the unabridged version ability to visualize specific biology associated with splice-variants and their please go to: http://www.ingenuity.com/products/ipa#/?tab=resources impacted protein domains. In addition, look for a comprehensive tool to identify and prioritize microRNA-mRNA target pairings by biological con- ACKNOWLEDGEMENTS text such as pathways or disease. The authors are grateful to the many scientists at EBI, Ingenuity, NIH-NCBI, Stan4. Integration: Time cost of integrating all of the results. Even if you use multiple pieces of software to get different types of insights, it takes an extremely long time to integrate and interpret the results, since other software is usually designed to answer a single question about your data, and not designed for integration with other types of information. 5. Content: Comprehensive and timely quality content. The quality and timeliness of the content in the database supporting the analytical tools is critical. In addition, how the manually extracted facts are organized is crucial to enable computation and inferencing, semantic and linguistic consistency, and directional predictions. Comprehensive content can be incorporated from published literature and third party databases for maximum coverage of biological and chemical interactions, functional annotations, protein domains, biomarkers, mutations, and microRNAmRNA relationships to name a few. Manual review by experts to ensure the content is accurate and detailed is key to providing confidence in the information and resulting analytics. CONCLUSION Researchers and laboratories invest thousands of dollars on instrumentation to produce data, but that investment can be lost or misguided if the data analysis is lacking or haphazard. For the large amounts of data produced from microarray or RNA-Seq experiments, biological interpretation aided by software such as IPA is key to enabling scientists to quickly narrow in on relevant information and examine data within a consistent set of biological references. Examining the results from an RNA-Seq or microarray dataset in the context of established ford Labs for working to provide gene, microRNA, and protein data bases and analysis tools that help solve research problems. endnotes 1. Shendure J and Hanlee J. 2008. Next-generation DNA Sequencing. Nat. Biotechnol. 26:1135-145. 2. Fuller CW, et al. 2009. The challenges of sequencing by synthesis. Nat. Biotechnol. 27:1013-1023. 3. Metzker ML. 2010. Sequencing technologies – the next generation. Nat. Rev. Genet. 11:31-46. 4. NCBI PubMed. http://www.ncbi.nlm.nih.gov/pubmed/ 5. Huang DW, Sherman BT, Lempicki RA. 2009. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 4(1):44-57. 6. Huang DW, Sherman BT, Lempicki RA. 2009. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1):1-13. 7. Ingenuity Systems, Redwood City, CA. http://www.ingenuity.com. 8. Kelder T, Conklin BR, Evelo CT, Pico AR. 2010. Finding the Right Questions: Exploratory Pathway Analysis to Enhance Biological Discovery in Large Datasets. PLoS Biol 8(8): e1000472. Doi:10.1371/journal.pbio.1000472 9. This 2004 white paper (Life Science Insights, an IDC company) concluded that IPA has a significant ROI for organizations in terms of productivity, cost savings, and innovation. Paper is available at http://www.ingenuity.com/products/ROI_ IDC_LSI_7_04.pdf. Ingenuity Systems, Inc. 1700 Seaport Blvd. Third Floor Redwood City, CA 94063 © 2013 Tel. +1 650 381 5100 Fax. +1 650 381 5190 Ingenuity Systems, Inc. All Rights Reserved. [email protected] www.ingenuity.com