Download The promise of graph databases in public health is web-scale discovery.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Versant Object Database wikipedia , lookup

Operational transformation wikipedia , lookup

Data analysis wikipedia , lookup

Database wikipedia , lookup

Data model wikipedia , lookup

Information privacy law wikipedia , lookup

Semantic Web wikipedia , lookup

Business intelligence wikipedia , lookup

Clusterpoint wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Forecasting wikipedia , lookup

Data vault modeling wikipedia , lookup

Database model wikipedia , lookup

Transcript
www.pwc.com/technologyforecast
Technology Forecast: Remapping the database landscape
Issue 1, 2015
The promise of graph
databases in public health
By Alan Morrison
One of the main advantages of a NoSQL graph store
is web-scale discovery.
When a multinational biotech firm needed an
advisory board member, it wanted a certain
top US physician in its field—someone who
was already a busy advisor to the firm’s parent
company. “Placing him on the board in Latin
America was a lower priority for the global
group,” says Muriel Siadak, a medical affairs
director who conducted the global search for
the right person. “Obviously, it was a high
priority for the Latin American affiliate.”
Some of the
world’s most
knowledgeintensive
organizations,
including
multinational
banks, media
companies,
space agencies,
and logistics
companies, are
using graph
databases.
Siadak turned to Zephyr Health, which
aggregates thousands of life science data
sources about people, places, pills, and other
things, and provides a software-as-a-service
(SaaS) platform with search tools. Zephyr
Health converts its sources into document
database format, and then layers a graph store
(or database) on top to easily traverse and
search the underlying data. Siadak launched
a search to find just the right fit for the Latin
American affiliate. The richness of the diverse
detail she could explore with relative ease was
a key to quickly finding the perfect needle in
the haystack—two, actually.
“I found two fairly frequent users of the
company’s products, [pharmaceutical]
investigators with a lot of experience who had
done their medical training in Latin America,
so they obviously spoke Spanish,” she says.
“They would be able and would have interest,
because they grew up there, to be a part of the
Latin American effort. The affiliate was very
pleased to have someone [Siadak] on a global
level come back and say, ‘Look, it’s not just the
top name that you need.’”
If you could continually integrate thousands of
external enterprise data sources, add internal
ones on a custom basis by request, and tailor
the whole so it’s appropriately accessible to
a range of business users through a single
application platform, what could you do that
you haven’t been able to? What Zephyr Health
enables in the life sciences illustrates one of
the myriad possibilities.
The power of this level of data aggregation is
just now becoming apparent. In the biotech
industry, specific skills and knowledge are
always at a premium. Companies in public
health are starting to use graph-facilitated
SaaS to solve business problems. Some
of the world’s most knowledge-intensive
organizations, including multinational
banks, media companies, space agencies,
and logistics companies, are also using graph
databases, and intelligence agencies have
been using them for a decade. Others
will follow.
The graph store is one of many innovations
creating a sea change in database technology.
This issue of the Technology Forecast explores
the promise and upheaval caused by these
new technologies. This article provides a
deeper look at graph technology and how it is
similar to other NoSQL1 data stores, which are
explored in an earlier article.
What is a graph store?
At Zephyr Health, the pivotal technology
is a Neo4j graph store used to find and
traverse relationships between entities in
data originally ingested using MongoDB, a
document database. Document databases
are useful for unstructured data, but Zephyr
Health had troubles with indexing and latency
at web scale, which is why it added the graph
database.2
The standard corporate data storehouse,
the relational database management system
(RDBMS), cannot begin to provide the speedy,
flexible search support of a graph. An RDBMS
needs absolute consistency among its rows
and columns. The difference between a “join”
in an RDBMS and in a graph store is like
the difference between a precision dovetail
joint in woodwork and a freeform Tinkertoy
construct. Graphs only need to join or connect
at a single point to have useful meaning in
searches.
1 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only
structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed
environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational
distributed stores because it has become the default term of art. See the section “Database evolution becomes a revolution” in the article
“Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technologyforecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on relational versus
non-relational database technology.
2 Mahesh Chaudhari, “Avoiding Deadlocks: Lessons Learned with Zephyr Health Using Neo4j and MongoDB,” GraphConnect 2013
presentation, http://watch.neo4j.org/video/76891392, accessed March 24, 2015.
2
PwC Technology Forecast
The promise of graph databases in public health
Effectively using graph technology poses its
own challenges: the technology is relatively
new and only recently capable of web-scale
integration tasks. To take advantage of the
technology, Zephyr Health needed to resolve
some scalability and latency issues associated
with web-scale graph environments.
A NoSQL graph store contains malleable
maps of entities (named people, places, and
things) and how they’re related. Entities
become the nodes and relationships become
the connections (developers call them edges)
in these maps, which can take any shape. If
you’re modeling your extended organization,
for example, the relationships that appear in
a native graph store can add to or become
your model.
What’s different in a graph store from a
database perspective is the sheer volume of
connections, or relationships—how people,
places, and things relate to one another
through those interactions.
If your data is rich, you’ll see lots of
relationships between the entities in native
graph form. Older database technologies
place less emphasis on relationships, resulting
in less context. Graphs offer the chance for
richer context through more connections
and any-to-any data models rather than the
usual tabular or hierarchical models. Graphs
can store converted tabular and hierarchical
document object information, too, but the
graph’s uniqueness and power is in its ability
to map any-to-any relationships. Relationship
richness of this kind boosts the integration
potential and the contextual relevance of the
data being represented.
The verb in the sentence diagrammed in
the graphic below is where the richness lies.
Entities can also have attributes, such as keys
to identify them, but verb-style relationships
in this sense describe the entities in a richer
way than simple identifiers or other labels
do. And those relationships can be mined in
ways most conventional analytics techniques
haven’t explored yet, because they’re not
optimized for graphs.
The verbs carry much of the power. The more
that relationships can be mapped in the online
world, the more high-level understanding can
be tapped about those entities, the various
clusters of those entities, and their nature. For
example, social networks mapped in a graph
can capture who’s married to whom, who
works for whom, who went to school with
whom, and so on.
Graph databases place more focus on verbs
In the classic entity-relationship model that graphs articulate, relationships
connect two or more entities, as in the following example, which shows how
entities in an expertise location scenario are related to one another.
Santiago
Ramón y
Cajal
3
PwC Technology Forecast
attended
University of
Zaragoza
The promise of graph databases in public health
Each added relationship enriches the
profile of the persons, or entities, that are
connected. Graphs can describe not only social
relationships, but any relationship, whether
it’s an investigator’s use of an experimental
drug to treat a malignancy, the effect of that
treatment on a given patient, or the cost of a
single dosage of a given amount of the drug.
In the online environment, anything
can be described in terms of entities and
relationships, but design considerations
factor in heavily because graphs can be
computer-memory hungry and don’t like
to be partitioned across separate machines.
The most efficient way to analyze, traverse,
or “walk” a big graph out to the nodes and
connections you’re looking for can be to
load the whole graph into the main memory
(RAM) of a single physical or virtual server.
Because of the memory required, the scale
of data amenable to whole graph analysis
has historically been smaller than what
other database structures are able to handle.
However, some vendors have been developing
workarounds that help users deal with just a
piece of the graph at a time.
The structural uniqueness
of graphs
The main difference between document
trees and graphs is the degree of structure.
Graphs are a further step forward when more
structure is needed. While document objects
in JSON (JavaScript Object Notation) and
XML (Extensible Markup Language) have
parents and children, graphs also have other
relatives, friends, and acquaintances.
Relational, document, and graph data models compared
Relationship
richness
Graph:
Any-to-any
relationships
Document:
Nested, cumulative
hierarchies
Relational:
Row and column headers
and up-front taxonomies
Relationship
sparseness
Static
Selective
Fragmented
Labor intensive
4
PwC Technology Forecast
Additive
Index friendly
Immutable versioning possible
More dynamic
More inclusive
More integrated
More machine assisted
The promise of graph databases in public health
Taxonomies versus ontologies
Taxonomies categorize exclusively
according to parent-child hierarchies,
a powerful approach that sometimes
doesn’t go far enough.
Ontologies categorize in an any-to-any,
3-D fashion that is higher definition and
more suited for graph modeling.
Another way to think about structure is to
remember that taxonomies, like document
objects, are hierarchical and have just parents
and children. If you need a richer classification
scheme, you would use an ontology, which is
a flexible schema, or data domain description,
that articulates specific data contexts.
Document objects3 are thus taxonomic in
their parent-child hierarchies and treelike. The data model of a graph is a path
to richer and more realistic descriptions—
they’re ontological in nature, in the sense
that meaning can reside in any described
relationship between any two entities or
nodes. In the semantic world of Resource
Description Format (RDF)4 graphs, ontologies
are stored alongside instance data in the
same “web.”
And here’s why taxonomies and ontologies are
important in “schemaless” NoSQL stores: the
schema or classification scheme, taxonomy, or
ontology can be part of or derived from a data
environment of rich data and metadata, and it
can evolve as that environment evolves.
Ontologies have historically been handbuilt, but the potential is opening up for rich
domain description discovered in the data and
its relationship metadata through machineassisted or inferred relationship mapping. The
connections and interactions between things
are where most of the contextually based
meaning in data resides. The connections in a
large data aggregation platform such as that
of Zephyr Health might be sparse as initially
constituted, but new relationships can be
inferred over time, and machine learning
could help to derive additional context.
As long as you have sufficient computing
horsepower, a graph can model a multivariate
problem with greater accuracy than a tree
can. The relationship metadata provides the
descriptive power. Conceivably, autonomous
software agents designed to manage the data
model could evolve the relationship-driven
model, the use case, and the data change.
3 See “Using document stores in business model transformation,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/nosql, for
more information on JSON and XML document objects.
4 RDF is Resource Description Format, an official W3C Recommendation for Semantic Web data models.
5
PwC Technology Forecast
The promise of graph databases in public health
How graphs are similar
to other NoSQL stores
Instead of simply storing data as values
with keys or as document objects or tables,
graph stores contain nodes and connections.
Fundamentally, keys (or identifiers) and
values (which could be any groupings of data)
are the atomic building blocks for key-value,
wide-column, and document stores, and these
can also be the building blocks for graphs.
Tables, documents, and graphs provide
additional structure, in different shapes, with
an increasing number of interconnections.
And, as stated earlier, graphs can have many
more interconnections.
All data structures, from simple keys to
hierarchies to graphs, can be represented to
machines in the form of tags associated with
the data.
With so much potential, why
aren’t more people using graphs?
Why are other types of NoSQL stores
predominant, while graph stores are
sometimes considered a niche? Tables are
familiar to most, trees are familiar to some,
but graphs are unfamiliar and often puzzling
to many.
That’s mainly due to lack of maturity and
familiarity, which can be overcome with more
exposure to the different, more powerful kind
of modeling that graphs present for systemslevel optimization challenges that are so core
to business. Another part of the problem is
that graph stores haven’t been easy to use. The
current generation of graph stores is resolving
that issue.
Graph store types
Three main types of graph stores have
emerged during the past decade, each driven
by a different data perishability and reuse
category:
• Property graphs, the most popular, consist
of key-value pairs for each element. They
don’t require standard semantics and
are thus simpler to get started with. It is
possible, however, to store semantic triples
in property graphs.
• Native RDF or semantic triple stores
support the use of Internationalized
Resource Identifiers (IRIs)5 for detailed,
standard semantics, which some developers
have found difficult to use. The RDF has
a classic entity-relationship data model,
which consists of subjects (entities),
predicates or verbs (relationships), and
objects (also entities). Adhering to the RDF
standard offers the potential for global
reuse, content mining, and a means of
using the web for content management via
dynamic semantic publishing, among other
advantages.6
• Dynamic graphs can add new relationships
at scale. These offer the most efficiency for
labor-saving reusability and the chance to
operationalize and integrate heterogeneous
data environments, but they are the most
immature.
The less perishable the data, the more longterm investment comes into view for further
articulating the data model to achieve better
integration and reuse potential. As a general
rule, data the furthest to the right on this
lifecycle continuum would justify investment
in either RDF and/or a dynamic graph
capability. Large enterprises should explore all
three kinds of graph store technology at this
point and consider how use cases differ for
each. They should remember that data-driven
apps and big data analytics are only in their
infancy, and that graph databases will play a
much bigger role once those areas evolve.
5 The IRI standard extends Uniform Resource Identifiers or URIs (a superset of URLs) with Universal Character Set support for languages
such as Chinese, Japanese, Korean and Arabic.
6 See “Semantic Web in the enterprise,” PwC Technology Forecast, Spring 2009, http://www.pwc.com/us/en/technology-forecast/
spring2009/index.jhtml, for a detailed discussion of RDF and other semantic web standards and use cases from an enterprise
perspective.
6
PwC Technology Forecast
The promise of graph databases in public health
Graph
Three graph store types on the data lifecycle continuum
Property
Plan
Capture/collate
Data set uses
Simple persistence
Characteristics
Raw
Less structured
Perishable
Single use
Massively scalable
Stream
Aggregate
Transact
Immediate usability
RDF
Network
Dynamic
Reuse
Long-term enterprise reusability
Refined
Structured
Less perishable
Reusable
Less scalable
Components of graph store types
Type
Main components
Details
Example products
Property
Network: Nodes and
connections with properties
Each property has a key and a Neo4j, OrientDB*, Titan
value, and each node can have
more than one property.
RDF or semantic triple or
quad store
Subject-verb-object triples:
Subjects, verbs, and
AllegroGraph, Ontotext,
Nodes as subjects and objects objects are all identified via
Stardog
and connections as verbs
Internationalized Resource
Identifiers (IRIs, or unique webstyle addresses).
Dynamic
Changing nodes and
connections
Nodes can enter or exit the
graph, and relationships
can be inserted, updated, or
removed.
EnterpriseWeb**
* OrientDB is a document/graph database
** EnterpriseWeb is a full-stack integration platform based on a dynamic graph object store
Outlook: The least mature NoSQL
type, but the most promise
Not every enterprise faces the challenge of a
large-scale aggregation like the one at Zephyr
Health. Depending on the use case, native
graph stores can be overkill. If the immediate
purpose is to capture or cache the data,
then a key-value or column store is more
appropriate. If the purpose is aggregation,
then a document store may be best, at least
for initial data ingestion. If transactional
integrity and concurrency are critical
requirements, then an RDBMS or a NewSQL
store fits best.
7
PwC Technology Forecast
Much depends on the business role played,
the point in the data lifecycle, and the use
case. If the challenge is to model or integrate
large, networked systems and to monitor and
optimize the interconnections, that’s when
graph stores come into play. The current
generation of graph stores is most helpful
in thoughtful, considered systems-level
analytics at the back half of the data lifecycle.
Operational analytics and dynamic process
integration along the lines of EnterpriseWeb
are just emerging.
The promise of graph databases in public health
In-memory technology provides one solution
to a nagging latency problem for large-scale
graph stores. Other solutions include cache
sharding and other distributed graph store
design innovations. Hybrid key-value/graph
stores (such as Titan on Cassandra, Sqrrl
on Accumulo, or OrientDB) or document/
graph stores (such as MarkLogic) promise
other advantages but may also introduce
complexities. The summary article for this
issue of the Technology Forecast and the
interviews examine some scenarios of how
the emerging technology may affect business
environments.7
7 See http:/www.pwc.com/nosql for more background and current thinking on distributed graph store applications, hybrid database
technologies, the emerging Spark stack, and future NoSQL data store scenarios.
To have a deeper conversation
about remapping the database
landscape, please contact:
Gerard Verweij
Principal and US Technology
Consulting Leader
+1 (617) 530 7015
[email protected]
Chris Curran
Chief Technologist
+1 (214) 754 5055
[email protected]
Oliver Halter
Principal, Data and Analytics Practice
+1 (312) 298 6886
[email protected]
Bo Parker
Managing Director
Center for Technology and Innovation
+1 (408) 817 5733
[email protected]
About PwC’s Technology Forecast
Published by PwC’s Center for Technology
and Innovation (CTI), the Technology Forecast
explores emerging technologies and trends
to help business and technology executives
develop strategies to capitalize on technology
opportunities.
Recent issues of the Technology Forecast have
explored a number of emerging technologies
and topics that have ultimately become
many of today’s leading technology and
business issues. To learn more about the
Technology Forecast, visit www.pwc.com/
technologyforecast.
About PwC
PwC US helps organizations and individuals
create the value they’re looking for. We’re a
member of the PwC network of firms in 157
countries with more than 195,000 people.
We’re committed to delivering quality in
assurance, tax and advisory services. Find
out more and tell us what matters to you by
visiting us at www.pwc.com.
© 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may
sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This
content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL