Download Problems of biological information management and analysis as

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Biological information
management and analysis
as illustrated by malaria research
1. Problems
2. Managing data context
3. Managing and analyzing data
Factors in combating malaria
Economic
Political/Ethical
Scientific: biology,
ecology, chemistry, etc.
Cultural/Sociological
Environmental
Scientific layers
Psychology/
Sociology
Emergent properties of brain and
populations
Biology
All complexity of element interactions
(macromolecules, cells, brain, populations)
Chemistry
Properties of "simple" element interactions
Physics
Properties w/o inter-element interactions
The labyrinth of biological research
- Which direction to follow and in what way?
- What relevant information is available?
- How to keep a good record of the path?
- How to find useful collaborators?
- What do the results imply?
Researchers are drowning in the sea of information…
Problems with "physicalization" of biology
• Data richness
• Data sharing and integration
• Model-data correspondance
• Understanding bioresearch problems
• Understanding bioresearch constraints
Information problems in the Nature journal
• Tim Berners-Lee, James Hendler. Scientific publishing on the
'semantic web'.
Nature Debates, April 2001.
• Jonathan Knight. Negative Results: Null and void.
Nature, April 2003.
• Who'd want to work in a team (Editorial).
Nature, July 2003.
• Declan Butler. Open-access row leads paper to shed authors.
Nature, September 2003.
Information management needs of Anopheles GPH
• Inform scientific community
(publications, database submitions, conferences…)
• Prevent loss of information
(unpublished results, method details, …)
IP, Senegal
IP, Madagascar
• Report to administration
(progress, problems, …)
IP, Korea
IP, France
• Share and manage supplies
(materials, equipment, …)
…
…
…
• Share informational resources
(protocols, bibliography, …)
• Facilitate collaboration
(share information, co-author documents, …)
Columbia
University, USA
(outside
collaboration)
…
Sources of Research Information: Status quo
Temporary, individual
information (100%) :
Notebooks
Computer Files
Permanent, shared
information (< 30%) :
Journals
Databases
Sources of Research Information: Ideal
Permanent, shared information (100%):
Integrated Repositories of Structured Data
1. Problems
2. Managing data context
3. Managing and analyzing data
Flow of research information: at present
Administration
Advisor
Scientific
community
Researcher
Research
group
Collaborators
Flow of research information: proposed
Administration
Advisor
Researcher
Scientific
community
Database
Research
group
Collaborators
2 types of information
Structured information:
GenBank, Medline, Employee
database, Invoice database, …
Forms
Unstructured information:
Research notes, Contracts,
Project reports, Clinical trials
documentation …
Documents
Methods of contributing written information
• Traditional documents
- hard to search and manipulate
• Traditional forms
- overly constraining, hard to create documents
• Structured documents (New!)
- best of both worlds
Problems with forms
Project: Measurements of response to …
The ability to resist Plasmodium falciparum malaria is an
important adaptive trait of human populations living in …
Experiment: Entomological Observations of …
The results of our comparative study show consistent
interethnic differences in P. falciparum infection …
Method: Observations
Malaria surveys were carried out in two rural
villages near the town of Ziniaré (35 km northeast
of Ouagadougou) in a shrubby savanna of the
Mossi plateau Different response to P. …
. An
intense P. falciparum transmission is detected …
Summary 1
• Biological function is based on infinity of interactions
between basic elements
• Biologists are drowning in the complexity of
information
• Need to understand biological problems and
constraints before applying analytical approaches
• Need to resolve the problem of information storage
and retrieval
Form constraints
1. Limited number of categories
2. Limited number of fields per category
3. Constrained field space
4. Limited editing (copy, move, delete, etc.)
5. No coherent document representation
6. Unable to represent complex hierarchical
information
"3-tier" architecture of the iPad system
iPad Editor
iPad Web Portal
iPad middle-layer
server
Database
iPad Demo
Major Benefits
Monetary savings:
+ Less lost work
+ Resource optimization
Time savings:
+ Faster search
+ Faster communication and formatting
+ Less lost work
Increase in the quality and quantity of research:
+
+
+
+
+
+
Useful perspectives
Improved collaboration
Improved project management
More information given to the Institute community
More information given to the scientific community (in the future)
A tool to structure scientific data (in the near future)
Drawbacks
- Learning new software (very simple)
- Changing habits (will go away over time, gradual adoption)
Support for structured documents
1. WWW Consortium, industry analysts
2. General systems within the past year
(Microsoft, Arbortext, Altova, etc.)
3. Specific systems in the military
Evolution of information (Tim Berners-Lee)
First Consulting Group, "XML and Pharmaceutical
Industry" (2003) :
"In order to be profitable and competitive as they
serve our global healthcare needs, drug companies
require information systems to help them work
efficiently to deliver a high-quality product. With that in
mind, momentum is growing to leverage XML
technology in the content management and publishing
systems, being used by the pharmaceutical industry
throughout the drug development lifecycle."
* Interest from Aventis Pharma, Sopra Group, Genset
Gilbane Report, "XML for Content" (2003):
"So what's the biggest problem with XML content?
Authoring it… The authoring tools are becoming
more capable and people are starting to figure out
that the ease of processing XML content can
outweigh the pain of creating it, but there is still
some way to go."
1. Problems
2. Managing data context
3. Managing and analyzing data
Summary 2
• Data context is important both for information
management and for data interpretation
• iPad can be used to structure data context
using XML markup
• Structuring data context is the precursor for better
structuring of data.
3 Steps to "Paradise"
1. Agree on standard organizational categories
SB-UML
Gene Ontology
Bioprocess ontology
"Dynamic" ontologies
…
Bioprocess
ontology
3 Steps to "Paradise"
1. Agree on standard organizational categories
- "Dynamic" ontologies, Gene Ontology, Bioprocess ontology, …,
SB-UML.
2. Sort information into the ontological categories
- Data mining algorythms, Electronic forms, Semantic markup.
<protein>p53</protein><interaction>activates</interaction><gene>CD95</gene>
Dynamic ontology
Entity
Property
Relation
name
alternative names
BioStructure
Process
Data
Method
type
value
Molecule
MolecularComplex
Organelle
Organ
Tissue
Organism
Data markup
Molecule (name: Y, type: gene)
Entity (name: salivary glands, type: organ)
X protein activates Y gene in A. gambiae salivary glands.
Entity (name: A. gambiae, alt. name: Anopheles gambiae, type: organism)
Relation (name: activates, type: molecular interaction)
Molecule (name: X, type: protein)
3 Steps to "Paradise"
1. Agree on standard organizational categories
- "Dynamic" ontologies, Gene Ontology, Bioprocess ontology, …,
SB-UML.
2. Sort information into the ontological categories
- Data mining algorythms, Electronic forms, Semantic markup.
3. Develop search, visualization, and analysis tools
- Blast, Bioprocess and molecular modeling, Concept network, …
Concept node
- Better global picture to see where to go
- Helpful info along the way
- Organized research process
- Better ways to share data
- Broader impact of results
- Modeling and simulation tools
Summary 1
• Biological function is based on infinity of interactions
between basic elements
• Biologists are drowning in the complexity of
information
• Need to understand biological problems and
constraints before applying analytical approaches
• Need to resolve the problem of information storage
and retrieval
Summary 2
• Data context is important both for information
management and for data interpretation
• iPad can be used to structure data context
using XML markup
• Structuring data context is the precursor for better
structuring of data.
Summary 3
• 2 steps for structuring data: ontology + methods for
data entry
• Simple "dynamic" ontologies can be used to derive
standard "static" ontologies
• iPad-like system can be used to simplify structuring
biological data
• Data analysis, modeling, and simulation tools need
to be data-driven, generic, and easy to use.