Download Document

Document related concepts

Metabolic network modelling wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Project C
Sage Infrastructure Tools Project
• Carole Goble, University of Manchester, UK
• Ted Liefeld, Broad Institute
• Alex Pico, Gladstone Institutes
• Marc Hadfield, Alitora
Tools Afternoon Session
• Review of developments to date
–
–
–
–
–
Creating Semantic Model for Sage Networks
Storing Sage Networks with Alitora for Search & Visualization
Performing Key Driver Analysis with GenePattern
Taverna workflow for annotating and analyzing the network model
Working with Sage Networks in Cytoscape
• Other network model tools
– Additional tool providers discuss integrating with Sage
• Looking forward
– open questions and gaps
– breakout sessions
Project Workstream C: Tools
Raw
Datasets
Annotated &
Standardized
Network Inference
Infrastructure
Tools
Access &
Analysis
Core principles
1. Maximize access
2. Maximize use
3. Maximize reuse
 Distribute multiple file formats
 Make use of existing standards and tools
 Design for flexible, extensible solutions
 Support collaboration and community annotation
The SAGE Pipeline
FORMAT
Re-integrate
Visualisation
Network
Data
R-Script
Data
Re-integrate
FORMAT
Cytoscape
Visualisation
Session for Project C: Tools
1. Sage Semantic Ontology (Data Model)
2. Direct Download: just give me the data
3. Search and Browse: web interface
4. Interactive Analysis: extensible workflows
A. Gene Pattern Workflow
B. Taverna Workflow
C. Cytoscape Workflow
5. Related Tools: related communities
A. SCF/SWAN –Tim Clark
B. Bio2RDF – Michel Dumontier
RDF (Semantic) Standard
triple: base unit of “meaning”…
Semantic LinkedData
Sage Ontology (OWL)
Tools and Semantics
Tools and LinkedData
Direct Download
1. Go to http://sagebase.org/commons
2. Access standardized datasets and networks
contributed to Sage Commons
3. Download networks as:
A. Formatted text files (.tab)
B. Simple interaction files (.sif)
C. Cytoscape session files (.cys)
D. Semantic OWL files (.owl)
Repository of Sage Networks
Web App
Plug Ins
Alitora’s Semantic Repository
Repository of Semantic Data
Copyright Alitora Systems, Inc. 2009
Semantic Repository
Graph Database
Designed for network storage & query
Scalable to billions of data objects
 Federated
 Cloud-deployable
 Web-scale
 Indexing 1 billion RDF triples/hour
 1000 QPS/CPU: “semantic select”
 Clustering Algorithms in graph elements

 Queries can focus on relevant Cluster(s)
 Typical Query is 1-to-1 to relevant Cluster
 Worst case query performance is inverted index


As per semantic queries, there are no “joins”
Full Pathway Queries
Knowledge Relevancy
Algorithms help determine
which knowledge is
important across billions of
facts.
Sage “KDA” is an example of
an algorithm to find
important “nodes” in the
networks.
Relevancy can be based on
Graph Topology
Collaborative Interface
SageCommons Web Demo
Search and Browse
1. Go to http://saas.alitora.com/sagedemo/
2. Access web interface to semantic database
A. Anonymous access
B. Login to store and share findings
C. Identify networks for download,
visualization and workflows
Sage Commons Demo
Open API
Web interface
Cytoscape plugin
Interactive Analysis
Extensible workflows direct Sage Commons
networks through customizable pipelines for
analysis and visualization
1.Access semantic database of networked data
2.Perform Key Driver Analysis (KDA)
3.Write results back to database
4.Visualize network and results in Cytoscape
GenePattern Workflow
An integrative genomics analysis platform with
• Comprehensive repository of tools
• Construction of flexible, reproducible
analysis workflows
• Ability to add new tools easily
• Interface accessible to many levels of user
• Configurable to available compute resources
www.genepattern.org
GenePattern: A platform for integrative genomics
Module Repository
KNN
PCA
GISTIC
GSEA
SVM
NMF
FLAME
CBS
Module Integrator
Pipeline Environment
all_aml_train
Preprocess
Client User Interfaces
all_aml_test
Preprocess
SOM
Clustering
Class
Neighbors
Weighted
Voting
Cross-Val
SOM
Cluster
Viewer
Marker
Selection
Viewer
Prediction
Results
Viewer
Weighted
Voting
Train/Test
Visualizer
Prediction
Results
Viewer
Golub and Slonim et. al 1999
Web
Programming
GenePattern Software
Release Information
 Originally released 2004
 Current version 3.2.1, released November 2009
 Currently 12,000+ users, 500+ organizations, ~90 countries
Availability
 Freely available, runs on Windows, Mac OS, and Linux platforms
Resources
 http://www.genepattern.org
 User workshops, documentation, email help desk, online user forum
 Reich et al. (2006) Nature Genetics
 Collaborations with 2 NIH Biomedical Computing Roadmap Centers and
NCI’s cancer biomedical informatics grid (caBIG)
GenePattern is a winner of the
2005 BioIT World Best Practices Award
Web 2.0 community to share diverse
computational tools
www.genomespace.org
6 Seed Tools
3 Driving Biological
Projects
Cytoscape
Galaxy
GenePattern
Genomica
IGV
UCSC Browser
Cancer
lincRNAs
Stem cell circuits
Outreach: new
tools
Outreach: new
DPBs
Partner Institutions
Performing Key Driver
Analysis in GenePattern
• Sage provided R scripts that perform the KDA analysis
• These were wrapped as a GenePattern (GP) module
– GP generated a web user interface and web service for KDA
– This web service was used to integrate KDA into Taverna
•A demonstration GenePattern pipeline (workflow)
–Calculate a differentially expressed genes in a TCGA
dataset
–Perform KDA using a Sage breast cancer network model
and the gene list from the differentially expressed genes
–Reformats the KDA output for Cytoscape
–Launches Cytoscape to visualize the results
Key Driver Analysis Demo
Taverna Workflow
A suite of tools for bioinformatics
• Fully featured, extensible and scalable
scientific workflow management system
–
–
–
–
Workbench, server, portal
Standards-compliant provenance collection
Immediate ingest of web services
Grid services, Beanshell scripts, R-scripts,
BioMOBY services…
• Web 2.0 social collaboration
environments (“E-Labs”) for sharing
– Methods, workflows
– Systems biology data, models and SOPS
– Statistical methods
• Curated catalogue of Web Services
Taverna Open Suite of Tools
Workflow Repository
Workflow GUI Workbench
Client User Interfaces
Third Party Tools
Service Catalogue
Provenance
Store
Workflow
Server
Web Portal
Activity and Service
Plug-in Manager
Open
Provenance
Model
Secure Service Access
Programming and
APIs
Taverna Software
Release Information
• Taverna first released 2004.
• Current versions 1.7.2 and Taverna 2.1.2
• Currently 1500 + users per month, 350+ organizations, ~40
countries, 80000+ downloads across versions
Availability
• Freely available, open source LGPL
• On Windows, Mac OS, and Linux platforms
Resources
• http://www.taverna.org.uk, http://www.mygrid.org.uk
• User and developer workshops, documentation, email help
desk
• Collaborations with numerous groups including NCI’s cancer
biomedical informatics grid (caBIG), EMBL-EBI, NCBI, Concept
Web Alliance, Bio2RDF
myExperiment
• A Web 2.0 community for
sharing, discovering and reusing
workflows and other scientific
methods.
• A platform for launching
workflows
• Launched late 2007.
• Currently: 3272 members, 223
groups, 1024 workflows, 306
files and 97 packs, 56 different
countries.
• 10+ workflow systems: Taverna,
Pipeline pilot, BioExtract, Kepler
• ~ 3000 unique hits per month
REST APIs
Linked Open Data
Software Open source BSD
Systems Biology and myGrid
SysMO-SEEK
ONDEX
• e-Laboratory for interlinking
and sharing data, models,
SOPS and workflows for
Systems Biology in Europe
• ISA-TAB & SBML/MIRIAM
compliant
• Network based analysis
environment for Systems
Biology
• Uses Taverna workflows and
text mining
http://www.sysmo-db.org/
http://www.ondex.org/
Performing Taverna
KDA and Pathways pipeline
•
•
•
•
•
•
•
A demonstration Taverna Pipeline
(workflow)
Calculate a differentially expressed
genes in a TCGA dataset
Perform KDA using a Sage breast
cancer network model and the gene
list from the differentially expressed
genes
Reformats the KDA output for
Cytoscape
Launches Cytoscape to visualize the
results
Extracts gene names from TCGA
dataset
Finds pathways for these genes in
KEGG using workflow deposited in
myExperiment.
Taverna pathway pipeline demo
Cytoscape Workflow
Cytoscape is an open source software platform for Cytoscape is a collaboration between
integrating, visualizing, and analyzing measurement
data in the context of networks
University of California, San Diego
Institute for Systems Biology
Memorial Sloan-Kettering Cancer Center
Institute Pasteur
Agilent Technologies
University of Toronto
Gladstone Institute for Cardiovascular Disease
University of California, San Francisco
Unilever
National Center for Integrative
Biomedical Informatics
Free from:
http://www.cytoscape.org
• 60,000+ downloads for 2.x release; 27,000 downloads in the last year; 2,300/month
• 340+ published articles citing Cytoscape; 135 articles in the last year
• 50+ registered plugins, developed by leading research groups
Applications of Networks in Disease
Identification of disease
subnetworks – identification of
disease subnetworks that are
transcriptionally active in
disease
Agilent Literature Search
Mondrian, MSKCC
Subnetwork-based diagnosis
– source of biomarkers for
disease classification, identify
interconnected genes whose
aggregate expression levels
are predictive of disease state
Network-based gene
association – map common
pathway mechanisms affected
by collection of genotypes
(SNP, CNV)
PinnacleZ, UCSD
Cytoscape Plugin
Open API
Web interface
Cytoscape plugin
Connecting to Your Memory
KDA Plugin
Tools Afternoon Session
• Review of developments to date
–
–
–
–
–
Creating Semantic Model for Sage Networks
Storing Sage Networks with Alitora for Search & Visualization
Performing Key Driver Analysis with GenePattern
Taverna workflow for annotating and analyzing the network model
Working with Sage Networks in Cytoscape
• Other network model tools
– Additional tool providers discuss integrating with Sage
• Looking forward
– open questions and gaps
– breakout sessions
SCF/SWAN
Tim Clark
Instructor in Neurology, Harvard Medical School
Director of Informatics, MassGeneral Institute for
Neurodegenerative Disease
Core Member, Harvard Initiative in Innovative Computing
Bio2RDF
Michel Dumontier
Associate Professor
Department of Biology
School of Computer Science
Institute of Biochemistry
University of Carleston, Canada
Tools Afternoon Session
• Review of developments to date
–
–
–
–
–
Creating Semantic Model for Sage Networks
Storing Sage Networks with Alitora for Search & Visualization
Performing Key Driver Analysis with GenePattern
Taverna workflow for annotating and analyzing the network model
Working with Sage Networks in Cytoscape
• Other network model tools
– Additional tool providers discuss integrating with Sage
• Looking forward
– open questions and gaps
– breakout sessions
Implications for Sage infrastructure
Lessons Learned:
Formats
1. Standard network & gene list file
formats are critical to the success of
infrastructure tools.
2. Current dataset and network
repositories fall short of providing a
Identifiers
Services
community resource with adequate
standards and extensible tools.
Map to
standards
Appropriate
interfaces
Challenges Ahead:
1. Preparing for increasing scale and
scope of data
2. Preparing for future data types and
analyses
Semantics
Syntax
Domain Semantics
Domain Semantics
Ontologies
Ontologies
Custom Data Objects
Custom Data Objects
Information models
Information models
Syntax
Syntax
Configuration
Configuration
Invocation model
Invocation model
Interface
Interface
Data format
Data format
Data identity
Data Identity
Keep It Simple.
Open Source.
Web 2.0 Development Patterns
1.
2.
3.
4.
5.
6.
7.
8.
The Long Tail Leverage scientist-self service to reach out to the long tail
Users Add Value Involve colleagues and other scientists, both implicitly and
explicitly, in adding value to your application.
Network Effects by Default Set inclusive defaults for aggregating user data as a
side-effect of their use of the application.
Perpetual Beta Don't package up new features into monolithic releases. Add
them on a regular basis as part of the normal user experience.
Cooperate, Don't Control Design for mash ups. Offer web services interfaces
and content syndication, and re-use the services of others.
Some Rights Reserved. Benefits come from collective adoption. Make sure that
barriers to adoption are low. Follow existing standards.Use licenses with as few
restrictions as possible. Design for "hackability" and "remixability."
Data is the Next Intel Inside Applications are increasingly data-driven. For
competitive advantage, seek to own a unique, hard-to-recreate source of data –
workflows are data and data sources.
Software Above the Level of a Single Device Design your application from the
get-go to integrate and launch services across any interface.
Adapted from Tim O’Reilly’s Web 2.0 2005
This afternoon
• Drill down into demos and experiences
• Guests
– Tim Clark – SWAN, Web 3.0, neurodegeneration
– Michel Dumontier – Bio2RDF
• Audience participation!
– Opportunities, Barriers and Incentives
– Platforms, datasets, services and tools
– Technologies and Standards
– Directions for Sage Bionetworks
Questions for Afternoon
1. Are there specific gene list and network model databases,
tools and platforms that we want to integrate with the Sage
Data?
• e.g. MSigDB gene lists
2. What form of integrated analysis would be most useful for
finding new biological insights using the Sage models and
KDA?
• e.g. Would we like to be able to create lists of mutations from
TCGA to use as inputs to KDA and the Sage models?
• What model annotations are necessary to make this useful –
context?
Questions for Afternoon
1. Provenance - what is needed at Sage to ensure provenance
of network models is preserved for future reference? E.g. do
models need unique, persistent, referencable identifiers? Will
they be versioned? If models change due to new data, or
updated algorithms, how can we easily rerun analyses? What
privacy software do we need and could leverage?
2. Will SageCommons need to be ‘replicable’ at other sites to
support privacy - e.g. Pharma and Biotech who do not want
their use of the models to be potentially snooped on the ‘net?
Audit of Tools