Download Can a successful bioinformatics service become profitable?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data Protection Act, 2012 wikipedia , lookup

Clusterpoint wikipedia , lookup

Data model wikipedia , lookup

Big data wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Data analysis wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
+
Can a successful
bioinformatics service
become profitable?
The case of R parallel
Toni Espinosa- Universitat Autonoma of Barcelona
+
What’s in the R
n 
Statistical analysis package
n 
Open source version of SAS, SPSS or Matlab
n 
Analysis environment for life sciences: new algorithms/
protocols are
n 
n 
n 
published in journals like bioinformatics
source code available as a package in bioconductor
Any scientist can replicate recently published protocols with
their data in R at no cost
+
Successful domain specific
language
n 
“It allows statisticians to do very intricate and complicated
analyses without knowing the blood and guts of computing
systems”
n 
Daryl Pregibon, a research scientist at Google
+
Relevance of R platform
Source: kdnuggets.com
+
Cheap sequencing: molecular
biology becomes big data
+
Genomics data boom
n 
Experimentation costs drop by orders of magnitude
n 
Every few days, new data set of Terabytes is available
n 
Extending from big research centers to every biomedical
group, next hospitals
n 
Great need of processing large amount of experimental data
in any available system
+
Users need help
n 
Some data scientists have programming abilities to divide
datasets and then join results
n 
But most of them lack any knowledge of current multi-core,
gpu, Amazon AWS systems and how to program them
n 
And they have a large set of data to process in someone’s
disks
+
HPC required
n 
Lack of HPC resources in common scientific labs
n 
No parallel version of statistical/data mining software
n 
No experience of HPC in the expert users field
n 
Sometimes is not easy to justify project to adapt algorithms
or for research for computing hours in the cloud
+
HPC in R community
n 
High-Performance and Parallel Computing with R packages
n 
http://cran.r-project.org/web/views/
HighPerformanceComputing.html
n 
Explicit/implicit parallelism
n 
Cloud computing
n 
Hadoop
n 
Resource managers
n 
GPUs
n 
Super-R HPC workgroup
+
From high productivity
to high performance: R parallel
n 
Effectively adapt R to high-performance and high-throughput
computing technologies
n 
Allow the execution of R applications iterating on large data
volumes in parallel
n 
Specifically applied to DNA large sets of data
n 
Support for independent & heterogeneous computers with
multicore processors
n 
Extension to non dedicated computation environments
+
1st wave of R tool users: success
n 
Hundreds installation requests
n 
Dozens of daily Basic code adaptation tutorial requests
n 
Some requests of further collaboration
n 
Heroku, University of Arkansas, Linkstar, …
n 
Cease-and-desist request by US company
n 
Still top-5 most forwarded paper of all time in bioinformatics
journal
+
The problem: Tool development
work is a PhD. Thesis
n 
Good idea
n 
Good timing
n 
Promising prototype
n 
Users are interested, technology is instantly (wildly) used
and help is requested: installation, bug correction, features,…
n 
Write publications and PhD Thesis: mail inbox is not
managed
n 
Tool is abandoned
+
More life than publishing: code
quality
n 
Standard software development practices, like
documentation and testing, tend to fall by the wayside.
n 
Quality control of production software: stability, security,
performance, documentation
n 
Management of early users as a community: eliminate entry
barriers, produce new features often
n 
New functionalities requested by industry: initial business
models
+
Bioinformatics tool from academia
to profitable service
n 
Shared used by (public sector) community: free testbed
n 
Everyday examples of enhanced productivity
n 
Good environment for fast code maturity and new feature
R+D: freemium model
n 
Explore business cases of industry needs: big data, complex
computational models, adaptation to specific platforms, …
n 
Relevant EU tools: T-Coffee, GEM mapper, SeqAn, …
+
Larger life for successful tools
n 
Need of standard industrial software development practices
n 
Need of middle long-term positions in charge of product
development, not classic PhD. Students
n 
Need of similar small software company developer
professionals
n 
Opportunities of obtaining early customers and owning
relevant intellectual property
+
Existing initiatives
n 
Berkeley Institute for Data Science (BIDS): 5-year, $37.8 M
n 
Saul Perlmutter (Lawrence Berkeley National Lab)
Supercomputing 2013 keynote
n 
n 
n 
Objective is to create software tools for data handling and
analysis that can be widely shared
Creating new, long-term career paths for the data scientists that
develop such tools
“Universities are losing much of the top data science talent they
produce to industry. We need them back at the universities,
working on the world’s most important science problems — not
trying to make people click on ads.”
http://bits.blogs.nytimes.com/2013/11/12/program-seeks-to-nurture-data-scienceculture-at-universities/