Download EXPO Ontology of scientific experiments concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Weakly-interacting massive particles wikipedia , lookup

Double-slit experiment wikipedia , lookup

Standard Model wikipedia , lookup

Transcript
The EXPO Ontology:
Describing Scientific Experiments
Ross D. King
Department of Computer Science
University of Wales, Aberystwyth
What is e-Science?

E-science is computationally intensive science. It is
also the type of science that is carried out in highly
distributed network environments, or science that
uses immense data sets that require grid computing.
Examples of this include social simulations, particle
physics, earth sciences and bio-informatics. ...
wikipedia. I DISAGREE

“eScience is about global collaboration in key areas
of science and the next generation of infrastructure
that will enable it.” John Taylor (UK e-Science)
I
AGREE
Standard e-Science Vision

Research is done in Lab X.
– All results, metadata, programs, etc. are stored
electronically in an internationally agreed standard
format, and openly published.
– An open access paper is published with links to
the results, metadata, programs, etc.

Research lab Y wishes to replicate/build-on the
published work of Lab X.
– This is easy because all the results, metadata,
programs, etc. are publicly available.
Standard e-Science Projects

Most e-Science is based around building software
infrastructure.
– Software: “web services”
– Databases: digital archiving, standards, etc.
– GRID computing: Globus, Condor
– Communication: Access Grid
– Open Access publishing: UK, NIH, etc.
– Services: Text mining, Bioinformatic, etc.
My View of e-Science
I am interested in the formalisation and automation of
scientific research.

I and my colleagues have two related projects in this
area:
– EXPO an ontology of scientific experiments.
– The Robot Scientist Project.
Formalization of Science

The goal of science is to increase our knowledge of
the natural world through the performance of
experiments.

This knowledge should, ideally, be expressed in a
formal logical language.

Formal languages promote semantic clarity, which in
turn supports the free exchange of scientific
knowledge and simplifies scientific reasoning.
Motivation




The formal description of experiments for efficient
analysis, annotation, and sharing of results is a
fundamental objective of science.
Ontologies are required to achieve this goal.
A few subject-specific ontologies of experiments
currently exist. However, despite the unity of
science, there is no general ontology of scientific
experiments.
We propose the ontology EXPO to meet this need.
Ontologies
An ontology is “a concise and unambiguous
description of what principal entities are relevant to
an application domain and the relationship between
them”*.
*Schulze-Kremer, S., 2001, Computer and Information Sci. 6(21)
The Unity of Science




We aim to formalise generic knowledge about
scientific experimental design, methodology, and
results representation.
Such a common ontology is both feasible and
desirable because all the sciences follow the same
experimental principles.
Despite their different subject matters, they all
organise, execute, and analyse experiments in
similar ways.
They use related instruments and materials; they
describe experimental results in identical formats,
dimensional units, etc.
Ontologies for Experiments


The formal description of experiments for efficient
analysis, annotation, and sharing of results is a
fundamental objective of science.
Ontologies are required to achieve this goal.

A few subject-specific ontologies of experiments
currently exist. The most notable of these is the
MGED Ontology (MO). It was designed to provide
descriptors required by MIAME (Minimum Information
About a Microarray Experiment)

We have developed the ontology EXPO to meet this
need.
Soldatova & King (2005) Nature Biotechnology
Soldatova & King (2006) Royal Society Interface
Advantages of Ontologies

The utilisation of a common standard ontology for the
annotation of scientific experiments would:
– Make scientific knowledge more explicit.
– Help detect errors.
– Enable the sharing and reuse of common
knowledge.
– Remove redundancies in domain-specific
ontologies.
– Promote the interchange and reliability of
experimental methods and conclusions.
Our Approach to Ontology Building

Explicitly list the principles of an ontology's
design, its constraints, along with definitions and
axioms.

Provide compliance with a standard upper
ontology (SUO) developed by IEEE P1600.1.

Keep separately domain-dependent and
domain-independent knowledge, as well as
declarative and procedural knowledge.

Build ontologies so that they are purposeindependent and therefore are future-proof.
The Position of EXPO
SUMO
Upper level
EXPO
Bibliographic
Data Ontology
BiblioReference
Generic level
Mes.Unit
SubjectOfExp.
ObjectOfExp.
Domain Model
Domain level
Plant
ontology
Measurement
ontology
PSI
MO
FuGO
MSI
ChEBI
Small Section of EXPO
Generic ontology of experiments
e-Science
•
Controlled vocabulary of scientific
experiments;
•
Formalized electronic
representation of scientific
experiments;
•
Unified standards for
representation, annotation, storage,
and access to experimental results;
•
Reasoning over experimental data
and conclusions.
Ontology of science
(formalization of scientific
methods, technologies,
infrastructure of science)
EXPO
Ontology of
scientific experiments
concepts: 218
language: OWL
Scientific Experiment
Experimental results
Experimental goal
Experimental action
Experimental design
Classification of experiments
Experimental object Admin info about experiment
EXPO description
EXPO v.1
Concepts: ~200
Language: OWL
Tool: Hozo Ontology Editor
Scientific Publication 1

The traditional way of presenting scientific knowledge
in scientific papers has many limitations.

The most important and obvious of these is the use
of natural language to describe knowledge - albeit
augmented by various formalisms and mathematics.

This is problematic because natural language is
notorious for its imprecision and ambiguity.
Scientific Publication 2

Use of Natural Language is a great hindrance when
using computers to store and analyse data – hence
the growing importance of text-mining.

We argue that the content of scientific papers should
increasingly be expressed in formal languages.

Is writing a scientific paper closer to writing poetry or
a computer program?
Applications of EXPO

Phylogentics

Particle Physics

Structural Biology

Drug Screening and Design

Physical Chemistry

Robot Scientist
Solenodons
Solenodons are endangered insectivores
from Hispaniola and Cuba.
Phylogenetic Example

Random paper selected from Nature: Roca, A.L.,
Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria, R.
Mesozoic origin for West Indian insectivores. Nature,
429, 649-651 (2004).

Paper investigates the phylogenetic status of the
mammalian species Solenodon cubanus and
Solenodon paradoxus. i.e. the evolutionary
relationship of these animals with all others.

Conclusion - Solenodons diverged in the Cretaceous.
Solonedon Annotation
EXPO: A scientific
experiment is a
research method
which permits the
investigation of
cause-effect
relations between
known and unknown
(target) variables
of the field of
EXPO:
A (domain). An
study
classification
experimental of
result
experiments
a
cannot be is
known
hierarchical
system
with certainty
in
of advance.
categories –
types of
experiments –
according to their
domains or used
modelsEXPO:
of A null
experiments.
hypothesis is an
experimental
hypothesis that
states that a known
controlled variable
or variables does
not have a specified
effect on the
unknown (target)
variable or
variables of the
domain.
Scientific Experiment: Hypothesis-forming, Hypothesis-driven
Admin info about experiment:
Title:
Mesozoic Origin of West Indian Insectivores
Author:
Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K.,………………
Organisation:
1. National Cancer Institute, Frederick, USA
………………
Status:
public academic
Reference:
Roca, A.L., Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria, R.
at all. Mesozoic origin for West Indian insectivores. Nature,
429, 649-651 (2004).
Classification of experiment:
Taxonomy
DDC(Dewey): 575 Evolution and Genetics
Library of Congress: QH 367.5 molecular phylogenetics
Zoology
DDC(Dewey): 599: mammalology
Library of Congress: QL351-QL352 Zoology-Classification
Experimental goal:
To discover the phylogeny of the species: Solenodon paradoxus
and Solenodon cubanus
Prolog:
Null hypothesis H01:
explicit
instantiation(solenodon, So),
Representation style:
text
instantiation(soricoidea, Sh),
Linguistic expression: natural language
instantiation(talpoidea,
T), to talpids”
“Some have suggested a close relationship to
soricids (shrews) but not
instantiation(mammalia,
An),
Linguistic expression: arificial language: predicate calculus
…………………………
shared_ancestor([So,
Sh], [T], An).
experimental action
1.1.1 extraction and
purification
%ofshared_ancestror(Shared,
Not_shared).
object:
sample
DNA
XML:
</rdfs:Class>
shared_ancestor([X],[Y],
An)
:parent group:
DNA from Solenodon
paradoxus
<rdfs:Class
rdf:ID="classification
of experiments">
ancestor(An,
X).
sampling:
random sampling
<rdfs:label>classification
of
instrument:
Qiagen not
DNAancestor(An,
cleanup kit Y).
experiments</rdfs:label>
shared_ancestor([X|Lx],[Ly], An)…………………………
:experimental action
1.1.2 DNA amplification
<rdfs:subClassOf rdf:resource="#classification"
/>
shared_ancestor([Lx],[Ly],
An).
Experimental Conclusions (Formed Hypotheses)
C1) Hypothesis
<rdfs:comment>
ancestor(An, X).
Representation style:
text
Def:A classification of experiments is a hierarchical
shared_ancestor([Lx],[[Y|Ly], An) :Linguistic expression: natural language
system of categories - types of experiments An).Soricoidea,
There existed an mammal
that isshared_ancestor([Lx],[Ly],
the
ancestor
of: or
Solenodons,
according
to
their
domains
used models
of
not
ancestor(An,
Y).
Talpoidea, Erinaceidea,
and which is not the ancestor of any other mammal.
experiments.
Linguistic expression: artificial
language: predicate calculus
…………………………
Axiom:
</rdfs:comment>
Problems Highlighted
by Annotation 1

The use of EXPO makes explicit the different
hypotheses described in the paper.

What we have identified in the <research conclusion>
are not mentioned as hypotheses in the text.

This contrasts with what we identify as the seven
null-hypotheses, which are mentioned explicitly in the
main text. – sub-optimal statistically.
Problems Highlighted
by Annotation 2

Another aspect of the research which use of EXPO
would have highlighted, was that the DNA sequences
produced during the experiment were stored in the
EMBL database using the taxonomic term
“Insectivora”.

This taxon is now generally recognised to be
polyphyletic, and its use contradicts the actual
conclusions of the paper.
Problems Highlighted
by Annotation 3

We formalised the knowledge behind the authors’
argument that “Cuban Solenodons should be
classified in a distinct genus, Atopogale”.

Our analysis indicates that it would be more internally
consistent for the authors to have classified Cuban
Solenodons as a distinct family.

etc…..
High-energy/particle physics
Another random paper selected from same Nature
issue: D0 Collaboration. A precision measurement of
the mass of the top quark. Nature, 429, 639-642
(2004). (~350 scientists)
Experimental Equipment!
EXPO D0 Example 1
<scientific experiment>:
<computational experiment>: <simulation>
<admin info about experiment>:
<title>:
A precision measurement of the mass of the top quark
<classification by domain>:
<domain of experiment>:
High Energy Physics / Particle Physics
<DDC(Dewey) classification>:
539.7 Atomic and nuclear physics
<Library of Congress classification>: QC 770-798 Atomic, Nuclear, Particle Physics
<related domain>:
Computational Statistics
<DDC(Dewey) classification>:
519 Probabilities and Applied Mathematics
<Library of Congress classification>: QA 273-274 Probabilities
<research hypothesis>: <representation style>: <text>
<linguistic expression>:<natural language>:
Given the same observed data: use of the new statistical method M1 will
produce a more accurate estimate of Mtop than the original method M0.
<linguistic expression>:<artificial language>:
M0(∀ D0 observations ∧ ∀ relevant background knowledge) ↦ E0
M1(∀ D0 observations ∧ ∀ relevant background knowledge) ↦ E1
estimation_error(E0, Mtop) ↦ Error0
estimation_error(E1, Mtop) ↦ Error1
Error0 > Error1
<alternative hypothesis>:
<subject effect>:
<experimenter bias>:
<linguistic expression>:
<artificial language>
P(The Standard Model) is high
The Standard Model → Higgs boson
(Mtop = 173.3) → Higgs boson mass best fit mass estimate is experimentally
excluded.
∴ Mtop > 173.3
Problems Highlighted
by Annotation 1

Poor science, even though published in Nature!

This annotation makes it explicit that the experiment
was somewhat unusual in not generating any new
observational data. Instead, it presents the results of
applying a new statistical analysis method to existing
data (a set of putative top quark pair decays events
involving e+jets and μ+jets)
Problems Highlighted
by Annotation 2




No explicit hypothesis.
We argue that the paper’s implicit experimental
hypothesis was: given the same observed data, use of
the new statistical method will produce a more
accurate estimate of Mtop than the original method.
This is based on the authors’ statement “here we
report a technique that extracts more information from
each top-quark event and yields a greatly improved
precision when compared to previous measurements”.
We prefer the term “accuracy” to “precision”
Problems Highlighted
by Annotation 3

The Carnap principle: All relevant knowledge should
be used to decide a scientific question:
– 91 candidate events were used to calculate the
old value, but only 22 of these were used for the
new value!
– The old method estimate of Mtop is: 173.3 ± 5.6
(stat) ± 5.5 (sys) GeV/c2
– The new method estimate of Mtop is: 180.1 ± 3.6
(stat) ± 3.9 (sys) GeV/c2.
– The current (June 2005) best estimate for Mtop is
174.3 ± 3.4 GeV/c2
Problems Highlighted
by Annotation 4




The paper concluded that Mtop is higher than previously
estimated, which deductively implies a higher mass for the
Higgs Boson. As the Higgs Boson has not yet been
observed, even at energies above its previously predicted
maximum likelihood mass, the newly inferred higher Mtop
lent support to the existence of the Higgs Boson.
However, it would have been possible to argue validly the
other way: that the Higgs Boson is thought highly likely to
exist, therefore its non observation makes more probable
a higher value of Mtop.
This argument was not explicit in the paper, but may have
existed implicitly as a motivation.
The paper would have benefited from making this
argument explicit, even if not used.
An Ontology for
Drug Screening & Design




Funded BBSRC project started in April 2007.
Extend Expo to formalise meta-data for drug
screening and design.
We are developing our own Drug screening and Drug
design “Robot Scientist” - Eve.
Collaborating with industry. Working with Pfizer to
develop ontology and experiment annotation system.
Especially important in merging data that results from
corporate merging.
Structural Biology



Structural Biology was once a leader in the
development of standards for the preservation and
sharing of data.
This lead has been lost.
– The main data standard, mmCIF, does not meet
state-of-the-art standards in biology for ontologies.
– The main database, PDB, is not “relational”
although it is meant to be.
We have proposed a way forward using EXPO.
Nature Biotechnology (2007) 25, 437-442
ART

An Ontology Based Tool for the Translation of Papers
into Semantic Web Format.

Focussed on physical chemistry – very structured
publications.

Funded by JISC, in collaboration with the Royal
Society of Chemistry, and UKOLN.
ART 2




Tool to add value to papers and data stored in a
repository.
The tool will lead authors through a process where:
experimental goals, hypotheses, methodologies,
results, etc. are described and linked to the etx and
external data.
The result will be an article in OWL format that can be
archived with the original text version.
The OWL version will be more formalised and useful
for computer processing, e.g. text mining.
SIG/ISMB07 Ontology Workshop / BMC Bioinformatics
domain independent
Input:
free text article
Convert: to SciXML article
DC
PRISM
…
Markup: paper metadata
(title, author,…)
ChEBI
FIX
REX
…
EXPO
OBI
ECO
…
Markup: domain concepts
(molecule, bond,…)
Recognition of
generic scientific concepts
(goal, hypothesis,…)
Markup:
generic scientific concepts
Generate:
Summary, RSS feed
Output:
xml/ owl article
Request/ Confirm/
Explain
domain independent
domain
dependent
Named entity recognition
user
The Concept of a Robot Scientist
We have developed the first computer system that is
capable of originating its own experiments, physically doing
them, interpreting the results, and then repeating the cycle*.
Background
Knowledge
Analysis
Hypothesis Formation
Consistent
Hypotheses
Experiment
Final Theory
Experiment
selection
Robot
*King et al. (2004) Nature, 427, 247-252.
Results
Interpretation
Motivation 1: Philosophical

What is Science?

The question whether it is possible to automate the
scientific discovery process seems to me central to
understanding science.

There is a strong philosophical position which holds
that we do not fully understand a phenomenon unless
we can make a machine which reproduces it.
Motivation 2: Technological

In many areas of science our ability to generate data
is outstripping our ability to analyse the data.

One scientific area where this is true is functional
genomics, where data is now being generated on an
industrial scale.

The analysis of scientific data needs to become as
industrialised as its generation.
The Application Domain

Systems Biology

Yeast (S. cerevisiae) – best understood eukaryotic
organism.

Strain libraries, e.g. EUROFAN 2 has knocked out
each of the 6,000 genes.

Task to learn models of yeast metabolism using
selected mutant strains and quantitative growth
experiments.
Movie
Some Example Growth Curves
Soldatova et al., CS Dept., Aberystwyth, UK
The need for a Robot Scientist
ontology (EXPO-RS)

The robot requires detailed and formalized
description: domains, background knowledge,
experiment methods, technologies, hypotheses
formation and experiment designing rules, etc.

Integrity of data and metadata.

Open access of the RS experimental data and
metadata to the scientific community.
Soldatova, Sparkes, Clare, & King (2006) Bioinformatics
EXPO-RS

Formalization of the entities involved in Robot Scientist
experiments.

A controlled vocabulary for all the participants of the
project.

Identification of metadata essential for the experiment's
description and repeatability.

Coordination of the planning of experiments, their
execution, access to the results, technical support of the
robot, etc.

Modelling a database for the storage of experiment data
and track experiment execution.
Conclusions

The unity of science implies that an accepted general
ontology of experiments is both possible and desirable.

Such an ontology would promote the sharing of results
within and between subjects, reducing both the
duplication and loss of knowledge.

It is also an essential step in formalising science, and
fully exploiting computer reasoning in science.
We propose EXPO as a general ontology for scientific
experiments.
We have demonstrated the utility of EXPO on
applications in phylogenetics, high-energy physics,
chemistry, and high-throughput Systems Biology.


Acknowledgements













Larisa Soldatova
Amanda Schierz
Aberystwyth
Aberystwyth
Ken Whelan
Amanda Clare
Mike Young
Jem Rowland
Andrew Sparkes
Wayne Aubrey
Emma Byrne
Larisa Soldatova
Magda Markham
Aberystwyth
Aberystwyth
Aberystwyth
Aberystwyth
Aberystwyth
Aberystwyth
Aberystwyth
Aberystwyth
Aberystwyth
Steve Oliver
Riichiro Mizoguchi
Manchester
Osaka
BBSRC, JISC