Download The promise of graph databases in public health is web-scale discovery.

www.pwc.com/technologyforecast Technology Forecast: Remapping the database landscape Issue 1, 2015 The promise of graph databases in public health By Alan Morrison One of the main advantages of a NoSQL graph store is web-scale discovery. When a multinational biotech firm needed an advisory board member, it wanted a certain top US physician in its field—someone who was already a busy advisor to the firm’s parent company. “Placing him on the board in Latin America was a lower priority for the global group,” says Muriel Siadak, a medical affairs director who conducted the global search for the right person. “Obviously, it was a high priority for the Latin American affiliate.” Some of the world’s most knowledgeintensive organizations, including multinational banks, media companies, space agencies, and logistics companies, are using graph databases. Siadak turned to Zephyr Health, which aggregates thousands of life science data sources about people, places, pills, and other things, and provides a software-as-a-service (SaaS) platform with search tools. Zephyr Health converts its sources into document database format, and then layers a graph store (or database) on top to easily traverse and search the underlying data. Siadak launched a search to find just the right fit for the Latin American affiliate. The richness of the diverse detail she could explore with relative ease was a key to quickly finding the perfect needle in the haystack—two, actually. “I found two fairly frequent users of the company’s products, [pharmaceutical] investigators with a lot of experience who had done their medical training in Latin America, so they obviously spoke Spanish,” she says. “They would be able and would have interest, because they grew up there, to be a part of the Latin American effort. The affiliate was very pleased to have someone [Siadak] on a global level come back and say, ‘Look, it’s not just the top name that you need.’” If you could continually integrate thousands of external enterprise data sources, add internal ones on a custom basis by request, and tailor the whole so it’s appropriately accessible to a range of business users through a single application platform, what could you do that you haven’t been able to? What Zephyr Health enables in the life sciences illustrates one of the myriad possibilities. The power of this level of data aggregation is just now becoming apparent. In the biotech industry, specific skills and knowledge are always at a premium. Companies in public health are starting to use graph-facilitated SaaS to solve business problems. Some of the world’s most knowledge-intensive organizations, including multinational banks, media companies, space agencies, and logistics companies, are also using graph databases, and intelligence agencies have been using them for a decade. Others will follow. The graph store is one of many innovations creating a sea change in database technology. This issue of the Technology Forecast explores the promise and upheaval caused by these new technologies. This article provides a deeper look at graph technology and how it is similar to other NoSQL1 data stores, which are explored in an earlier article. What is a graph store? At Zephyr Health, the pivotal technology is a Neo4j graph store used to find and traverse relationships between entities in data originally ingested using MongoDB, a document database. Document databases are useful for unstructured data, but Zephyr Health had troubles with indexing and latency at web scale, which is why it added the graph database.2 The standard corporate data storehouse, the relational database management system (RDBMS), cannot begin to provide the speedy, flexible search support of a graph. An RDBMS needs absolute consistency among its rows and columns. The difference between a “join” in an RDBMS and in a graph store is like the difference between a precision dovetail joint in woodwork and a freeform Tinkertoy construct. Graphs only need to join or connect at a single point to have useful meaning in searches. 1 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed stores because it has become the default term of art. See the section “Database evolution becomes a revolution” in the article “Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technologyforecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on relational versus non-relational database technology. 2 Mahesh Chaudhari, “Avoiding Deadlocks: Lessons Learned with Zephyr Health Using Neo4j and MongoDB,” GraphConnect 2013 presentation, http://watch.neo4j.org/video/76891392, accessed March 24, 2015. 2 PwC Technology Forecast The promise of graph databases in public health Effectively using graph technology poses its own challenges: the technology is relatively new and only recently capable of web-scale integration tasks. To take advantage of the technology, Zephyr Health needed to resolve some scalability and latency issues associated with web-scale graph environments. A NoSQL graph store contains malleable maps of entities (named people, places, and things) and how they’re related. Entities become the nodes and relationships become the connections (developers call them edges) in these maps, which can take any shape. If you’re modeling your extended organization, for example, the relationships that appear in a native graph store can add to or become your model. What’s different in a graph store from a database perspective is the sheer volume of connections, or relationships—how people, places, and things relate to one another through those interactions. If your data is rich, you’ll see lots of relationships between the entities in native graph form. Older database technologies place less emphasis on relationships, resulting in less context. Graphs offer the chance for richer context through more connections and any-to-any data models rather than the usual tabular or hierarchical models. Graphs can store converted tabular and hierarchical document object information, too, but the graph’s uniqueness and power is in its ability to map any-to-any relationships. Relationship richness of this kind boosts the integration potential and the contextual relevance of the data being represented. The verb in the sentence diagrammed in the graphic below is where the richness lies. Entities can also have attributes, such as keys to identify them, but verb-style relationships in this sense describe the entities in a richer way than simple identifiers or other labels do. And those relationships can be mined in ways most conventional analytics techniques haven’t explored yet, because they’re not optimized for graphs. The verbs carry much of the power. The more that relationships can be mapped in the online world, the more high-level understanding can be tapped about those entities, the various clusters of those entities, and their nature. For example, social networks mapped in a graph can capture who’s married to whom, who works for whom, who went to school with whom, and so on. Graph databases place more focus on verbs In the classic entity-relationship model that graphs articulate, relationships connect two or more entities, as in the following example, which shows how entities in an expertise location scenario are related to one another. Santiago Ramón y Cajal 3 PwC Technology Forecast attended University of Zaragoza The promise of graph databases in public health Each added relationship enriches the profile of the persons, or entities, that are connected. Graphs can describe not only social relationships, but any relationship, whether it’s an investigator’s use of an experimental drug to treat a malignancy, the effect of that treatment on a given patient, or the cost of a single dosage of a given amount of the drug. In the online environment, anything can be described in terms of entities and relationships, but design considerations factor in heavily because graphs can be computer-memory hungry and don’t like to be partitioned across separate machines. The most efficient way to analyze, traverse, or “walk” a big graph out to the nodes and connections you’re looking for can be to load the whole graph into the main memory (RAM) of a single physical or virtual server. Because of the memory required, the scale of data amenable to whole graph analysis has historically been smaller than what other database structures are able to handle. However, some vendors have been developing workarounds that help users deal with just a piece of the graph at a time. The structural uniqueness of graphs The main difference between document trees and graphs is the degree of structure. Graphs are a further step forward when more structure is needed. While document objects in JSON (JavaScript Object Notation) and XML (Extensible Markup Language) have parents and children, graphs also have other relatives, friends, and acquaintances. Relational, document, and graph data models compared Relationship richness Graph: Any-to-any relationships Document: Nested, cumulative hierarchies Relational: Row and column headers and up-front taxonomies Relationship sparseness Static Selective Fragmented Labor intensive 4 PwC Technology Forecast Additive Index friendly Immutable versioning possible More dynamic More inclusive More integrated More machine assisted The promise of graph databases in public health Taxonomies versus ontologies Taxonomies categorize exclusively according to parent-child hierarchies, a powerful approach that sometimes doesn’t go far enough. Ontologies categorize in an any-to-any, 3-D fashion that is higher definition and more suited for graph modeling. Another way to think about structure is to remember that taxonomies, like document objects, are hierarchical and have just parents and children. If you need a richer classification scheme, you would use an ontology, which is a flexible schema, or data domain description, that articulates specific data contexts. Document objects3 are thus taxonomic in their parent-child hierarchies and treelike. The data model of a graph is a path to richer and more realistic descriptions— they’re ontological in nature, in the sense that meaning can reside in any described relationship between any two entities or nodes. In the semantic world of Resource Description Format (RDF)4 graphs, ontologies are stored alongside instance data in the same “web.” And here’s why taxonomies and ontologies are important in “schemaless” NoSQL stores: the schema or classification scheme, taxonomy, or ontology can be part of or derived from a data environment of rich data and metadata, and it can evolve as that environment evolves. Ontologies have historically been handbuilt, but the potential is opening up for rich domain description discovered in the data and its relationship metadata through machineassisted or inferred relationship mapping. The connections and interactions between things are where most of the contextually based meaning in data resides. The connections in a large data aggregation platform such as that of Zephyr Health might be sparse as initially constituted, but new relationships can be inferred over time, and machine learning could help to derive additional context. As long as you have sufficient computing horsepower, a graph can model a multivariate problem with greater accuracy than a tree can. The relationship metadata provides the descriptive power. Conceivably, autonomous software agents designed to manage the data model could evolve the relationship-driven model, the use case, and the data change. 3 See “Using document stores in business model transformation,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/nosql, for more information on JSON and XML document objects. 4 RDF is Resource Description Format, an official W3C Recommendation for Semantic Web data models. 5 PwC Technology Forecast The promise of graph databases in public health How graphs are similar to other NoSQL stores Instead of simply storing data as values with keys or as document objects or tables, graph stores contain nodes and connections. Fundamentally, keys (or identifiers) and values (which could be any groupings of data) are the atomic building blocks for key-value, wide-column, and document stores, and these can also be the building blocks for graphs. Tables, documents, and graphs provide additional structure, in different shapes, with an increasing number of interconnections. And, as stated earlier, graphs can have many more interconnections. All data structures, from simple keys to hierarchies to graphs, can be represented to machines in the form of tags associated with the data. With so much potential, why aren’t more people using graphs? Why are other types of NoSQL stores predominant, while graph stores are sometimes considered a niche? Tables are familiar to most, trees are familiar to some, but graphs are unfamiliar and often puzzling to many. That’s mainly due to lack of maturity and familiarity, which can be overcome with more exposure to the different, more powerful kind of modeling that graphs present for systemslevel optimization challenges that are so core to business. Another part of the problem is that graph stores haven’t been easy to use. The current generation of graph stores is resolving that issue. Graph store types Three main types of graph stores have emerged during the past decade, each driven by a different data perishability and reuse category: • Property graphs, the most popular, consist of key-value pairs for each element. They don’t require standard semantics and are thus simpler to get started with. It is possible, however, to store semantic triples in property graphs. • Native RDF or semantic triple stores support the use of Internationalized Resource Identifiers (IRIs)5 for detailed, standard semantics, which some developers have found difficult to use. The RDF has a classic entity-relationship data model, which consists of subjects (entities), predicates or verbs (relationships), and objects (also entities). Adhering to the RDF standard offers the potential for global reuse, content mining, and a means of using the web for content management via dynamic semantic publishing, among other advantages.6 • Dynamic graphs can add new relationships at scale. These offer the most efficiency for labor-saving reusability and the chance to operationalize and integrate heterogeneous data environments, but they are the most immature. The less perishable the data, the more longterm investment comes into view for further articulating the data model to achieve better integration and reuse potential. As a general rule, data the furthest to the right on this lifecycle continuum would justify investment in either RDF and/or a dynamic graph capability. Large enterprises should explore all three kinds of graph store technology at this point and consider how use cases differ for each. They should remember that data-driven apps and big data analytics are only in their infancy, and that graph databases will play a much bigger role once those areas evolve. 5 The IRI standard extends Uniform Resource Identifiers or URIs (a superset of URLs) with Universal Character Set support for languages such as Chinese, Japanese, Korean and Arabic. 6 See “Semantic Web in the enterprise,” PwC Technology Forecast, Spring 2009, http://www.pwc.com/us/en/technology-forecast/ spring2009/index.jhtml, for a detailed discussion of RDF and other semantic web standards and use cases from an enterprise perspective. 6 PwC Technology Forecast The promise of graph databases in public health Graph Three graph store types on the data lifecycle continuum Property Plan Capture/collate Data set uses Simple persistence Characteristics Raw Less structured Perishable Single use Massively scalable Stream Aggregate Transact Immediate usability RDF Network Dynamic Reuse Long-term enterprise reusability Refined Structured Less perishable Reusable Less scalable Components of graph store types Type Main components Details Example products Property Network: Nodes and connections with properties Each property has a key and a Neo4j, OrientDB*, Titan value, and each node can have more than one property. RDF or semantic triple or quad store Subject-verb-object triples: Subjects, verbs, and AllegroGraph, Ontotext, Nodes as subjects and objects objects are all identified via Stardog and connections as verbs Internationalized Resource Identifiers (IRIs, or unique webstyle addresses). Dynamic Changing nodes and connections Nodes can enter or exit the graph, and relationships can be inserted, updated, or removed. EnterpriseWeb** * OrientDB is a document/graph database ** EnterpriseWeb is a full-stack integration platform based on a dynamic graph object store Outlook: The least mature NoSQL type, but the most promise Not every enterprise faces the challenge of a large-scale aggregation like the one at Zephyr Health. Depending on the use case, native graph stores can be overkill. If the immediate purpose is to capture or cache the data, then a key-value or column store is more appropriate. If the purpose is aggregation, then a document store may be best, at least for initial data ingestion. If transactional integrity and concurrency are critical requirements, then an RDBMS or a NewSQL store fits best. 7 PwC Technology Forecast Much depends on the business role played, the point in the data lifecycle, and the use case. If the challenge is to model or integrate large, networked systems and to monitor and optimize the interconnections, that’s when graph stores come into play. The current generation of graph stores is most helpful in thoughtful, considered systems-level analytics at the back half of the data lifecycle. Operational analytics and dynamic process integration along the lines of EnterpriseWeb are just emerging. The promise of graph databases in public health In-memory technology provides one solution to a nagging latency problem for large-scale graph stores. Other solutions include cache sharding and other distributed graph store design innovations. Hybrid key-value/graph stores (such as Titan on Cassandra, Sqrrl on Accumulo, or OrientDB) or document/ graph stores (such as MarkLogic) promise other advantages but may also introduce complexities. The summary article for this issue of the Technology Forecast and the interviews examine some scenarios of how the emerging technology may affect business environments.7 7 See http:/www.pwc.com/nosql for more background and current thinking on distributed graph store applications, hybrid database technologies, the emerging Spark stack, and future NoSQL data store scenarios. To have a deeper conversation about remapping the database landscape, please contact: Gerard Verweij Principal and US Technology Consulting Leader +1 (617) 530 7015 [email protected] Chris Curran Chief Technologist +1 (214) 754 5055 [email protected] Oliver Halter Principal, Data and Analytics Practice +1 (312) 298 6886 [email protected] Bo Parker Managing Director Center for Technology and Innovation +1 (408) 817 5733 [email protected] About PwC’s Technology Forecast Published by PwC’s Center for Technology and Innovation (CTI), the Technology Forecast explores emerging technologies and trends to help business and technology executives develop strategies to capitalize on technology opportunities. Recent issues of the Technology Forecast have explored a number of emerging technologies and topics that have ultimately become many of today’s leading technology and business issues. To learn more about the Technology Forecast, visit www.pwc.com/ technologyforecast. About PwC PwC US helps organizations and individuals create the value they’re looking for. We’re a member of the PwC network of firms in 157 countries with more than 195,000 people. We’re committed to delivering quality in assurance, tax and advisory services. Find out more and tell us what matters to you by visiting us at www.pwc.com. © 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The promise of graph databases in public health is web-scale discovery.