* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Can a successful bioinformatics service become profitable?
Survey
Document related concepts
Transcript
+ Can a successful bioinformatics service become profitable? The case of R parallel Toni Espinosa- Universitat Autonoma of Barcelona + What’s in the R n Statistical analysis package n Open source version of SAS, SPSS or Matlab n Analysis environment for life sciences: new algorithms/ protocols are n n n published in journals like bioinformatics source code available as a package in bioconductor Any scientist can replicate recently published protocols with their data in R at no cost + Successful domain specific language n “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems” n Daryl Pregibon, a research scientist at Google + Relevance of R platform Source: kdnuggets.com + Cheap sequencing: molecular biology becomes big data + Genomics data boom n Experimentation costs drop by orders of magnitude n Every few days, new data set of Terabytes is available n Extending from big research centers to every biomedical group, next hospitals n Great need of processing large amount of experimental data in any available system + Users need help n Some data scientists have programming abilities to divide datasets and then join results n But most of them lack any knowledge of current multi-core, gpu, Amazon AWS systems and how to program them n And they have a large set of data to process in someone’s disks + HPC required n Lack of HPC resources in common scientific labs n No parallel version of statistical/data mining software n No experience of HPC in the expert users field n Sometimes is not easy to justify project to adapt algorithms or for research for computing hours in the cloud + HPC in R community n High-Performance and Parallel Computing with R packages n http://cran.r-project.org/web/views/ HighPerformanceComputing.html n Explicit/implicit parallelism n Cloud computing n Hadoop n Resource managers n GPUs n Super-R HPC workgroup + From high productivity to high performance: R parallel n Effectively adapt R to high-performance and high-throughput computing technologies n Allow the execution of R applications iterating on large data volumes in parallel n Specifically applied to DNA large sets of data n Support for independent & heterogeneous computers with multicore processors n Extension to non dedicated computation environments + 1st wave of R tool users: success n Hundreds installation requests n Dozens of daily Basic code adaptation tutorial requests n Some requests of further collaboration n Heroku, University of Arkansas, Linkstar, … n Cease-and-desist request by US company n Still top-5 most forwarded paper of all time in bioinformatics journal + The problem: Tool development work is a PhD. Thesis n Good idea n Good timing n Promising prototype n Users are interested, technology is instantly (wildly) used and help is requested: installation, bug correction, features,… n Write publications and PhD Thesis: mail inbox is not managed n Tool is abandoned + More life than publishing: code quality n Standard software development practices, like documentation and testing, tend to fall by the wayside. n Quality control of production software: stability, security, performance, documentation n Management of early users as a community: eliminate entry barriers, produce new features often n New functionalities requested by industry: initial business models + Bioinformatics tool from academia to profitable service n Shared used by (public sector) community: free testbed n Everyday examples of enhanced productivity n Good environment for fast code maturity and new feature R+D: freemium model n Explore business cases of industry needs: big data, complex computational models, adaptation to specific platforms, … n Relevant EU tools: T-Coffee, GEM mapper, SeqAn, … + Larger life for successful tools n Need of standard industrial software development practices n Need of middle long-term positions in charge of product development, not classic PhD. Students n Need of similar small software company developer professionals n Opportunities of obtaining early customers and owning relevant intellectual property + Existing initiatives n Berkeley Institute for Data Science (BIDS): 5-year, $37.8 M n Saul Perlmutter (Lawrence Berkeley National Lab) Supercomputing 2013 keynote n n n Objective is to create software tools for data handling and analysis that can be widely shared Creating new, long-term career paths for the data scientists that develop such tools “Universities are losing much of the top data science talent they produce to industry. We need them back at the universities, working on the world’s most important science problems — not trying to make people click on ads.” http://bits.blogs.nytimes.com/2013/11/12/program-seeks-to-nurture-data-scienceculture-at-universities/