* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download How enterprise graph databases are maturing
Survey
Document related concepts
Transcript
www.pwc.com/technologyforecast Technology Forecast: Remapping the database landscape Issue 1, 2015 How enterprise graph databases are maturing Martin Van Ryswyk and Marko Rodriguez of DataStax explore the challenges and benefits of big data analytics with graphs. Interview conducted by Alan Morrison and Bo Parker PwC: Do customers still think of graph model and a graph model all in the same databases as a niche technology, store. That was a common theme we heard. or are attitudes changing? PwC: Graph theory is quite MVR: The reason we acquired Aurelius was old. What has been inhibiting very customer driven. We had more than 30 adoption of graph technologies? Martin Van Ryswyk Martin Van Ryswyk is an executive vice president at DataStax. customers telling us they had graph use cases they needed to scale. They wanted us to do something with the Titan graph database that Aurelius created—either support it commercially or come up with our own version. We took a long look and were really surprised at how mainstream graph databases had become. The Aurelius team was seeing use cases in fraud detection and recommendation engines, evidence that our big enterprise customers had already identified graph as the right modeling framework to solve their problems. These enterprise customers were really just looking to us to make sure we could get them an enterprise-grade solution. Marko Rodriguez Marko Rodriguez is chief of engineering and a co-founder of Aurelius, acquired in February 2015 by DataStax. MR: It took a long time for people to realize that many of the data problems they were trying to solve were graph problems. So although the theory is relatively old, enterprises just didn’t have the terminology to understand what they were getting themselves into or what their problem was. The graph is actually a nice way to represent enterprise data and metadata and to solve enduring data problems. In addition to the conceptual challenge, graph technologies lacked a certain level of enterprise readiness. Take Titan, for example. Aurelius didn’t have enough resources for enterprise support and enterprise testing, and that really hindered adoption for very large customers. PwC: Are the use cases entirely What’s nice about DataStax is that now we’re different from other NoSQL database able to deliver the outreach that helps overcome options, such as Cassandra? the conceptual challenge while also providing the support our enterprise customers require. MVR: They’re somewhat adjacent. Titan has For very large customers with terabytes upon the ability to use Cassandra underneath it as one terabytes of data, there is no graph database way to persist the data. Our customers wanted that supports their needs right now. to have both their wide-column database “In a graph database, every relationship already acts as a join. That’s why we can get better scaling.”—Martin Van Ryswyk PwC: What’s unique about the graph approach from an enterprise perspective? MVR: One of the constraints and benefits of a graph is that it already has the precomputed join. In the SQL world you have tables and columns, and you can arbitrarily join tables based on various columns. In a graph database, we would say that this person knows that person, or this person is related to that person. It scales nicely in that sense, because every relationship already acts as a join. That’s why we can get better scaling with a graph database. PwC: How can established enterprises benefit from those advantages? Our functionality is meant to be accessed programmatically as part of an application in an OLTP [online transaction processing] context. If I am checking out at a grocery store, the store would want data about me plus what I just put in the cart. They would need to run the data through the graph database, so they could figure out all sorts of information in near real time about Martin. Let’s say he’s a football fan, he’s in California, and there’s a game this week. Maybe we’ll offer him a beer coupon. I’m making all of that up. But they’re taking a lot of pieces of data and trying to make very fast analytic decisions, and that’s the big thing with DataStax Enterprise (DSE) Graph. It’s the real-time component. MVR: We’ve seen a number of good use cases across different sectors. For example, with improved relationship analytics, utilities can predict better when they will have peaks in usage or equipment failures. Large retailers can do better targeting for club cards and coupon recommendations. Banks can detect more instances of fraud or insider trading. MR: In the OLTP space Martin is talking about, when you perform a graph analysis, you’re just doing a particular traversal for a realtime query. You’re touching only a subset of the full data set. You’re starting at the Martin vertex and you’re walking around. You’re trying to solve a problem. And the less data you touch, the faster the traversal will be. PwC: When we think about these kinds of use cases in relational technology, we typically look for querying and reporting capability. Is that how to think about graph databases? But in an OLAP query, you’re typically touching the whole graph or large subgraphs. There are multiple threads touching many things and, as a result, touching the disk heavily. [Retrieving more data from disk means more latency.] DSE Graph has both OLTP and OLAP capabilities. MVR: There will be analysts who will run queries or who want some really nice graphics and visualizations out of graphs. For the most part, that’s not our target market. That’s the OLAP [online analytical processing] side of things. 2 PwC Technology Forecast PwC: Does the OLTP approach help with the partitioning problem that graph databases have suffered from? MR: For sure. That’s the biggest problem in graphs. It’s impossible to get a perfect cut across machines, so what you’re trying to do is limit cross-machine communication. How enterprise graph databases are maturing “It’s the God node problem... Everything linked to ‘time,’ and when everything links to ‘time,’ there is no information in the concept of time.” As much as you can put data that will be co-retrieved on the same machine, the better off you’ll be. That is typically not a general function of some abstract algorithm, but rather a function of understanding your data. For example, on a social network, people who communicate with each other tend to be geographically located close in space. You can think of your machine as being represented like a world map; people in the same country will map to a particular machine. PwC: What are some other considerations to take into account when using graph databases? MR: A key concern is relationship density. Although it sounds counterintuitive, you might actually want to avoid relationship density as much as possible. Take a shopping site, for example. You’ve purchased a lot of products over the years, and so the overall graph is dense around you. If you’re doing a query, most of that graph is irrelevant information. What you’re really interested in at the current moment is very, very specific. With graphs, you try to be very particular and filter, filter, filter to only certain types of relationships. You really want to contextualize your traversal, so it meets the semantics of your ultimate problem. To have a deeper conversation about remapping the database landscape, please contact: Gerard Verweij Principal and US Technology Consulting Leader +1 (617) 530 7015 [email protected] Chris Curran Chief Technologist +1 (214) 754 5055 [email protected] Oliver Halter Principal, Data and Analytics Practice +1 (312) 298 6886 [email protected] Bo Parker Managing Director Center for Technology and Innovation +1 (408) 817 5733 [email protected] PwC: Do users struggle with overly dense graphs? MR: Yes, they do. It’s the God node problem. Network science papers and graph theory papers have examined this problem. For example, we had a project with a customer who was parsing arbitrary text. They were looking at people communicating, and they were creating links between two words that both occurred in the same text. We realized that the word “time” became this super node. Everything linked to “time,” and when everything links to “time,” there is no information in the concept of time. With too many linkages, there’s no information. But with no linkages, there is also no information. You want to have connectivity, but not too much connectivity. You really want to have contextualized links between your nodes and various levels (or groupings) of nodes, because that will give you a more accurate representation of the world — where there are structures within structures. If everything is connected to everything in every possible way, there is no form and that is not an accurate representation of the reality that we share (though at some level of awareness, it is correct). About PwC’s Technology Forecast Published by PwC’s Center for Technology and Innovation (CTI), the Technology Forecast explores emerging technologies and trends to help business and technology executives develop strategies to capitalize on technology opportunities. Recent issues of the Technology Forecast have explored a number of emerging technologies and topics that have ultimately become many of today’s leading technology and business issues. To learn more about the Technology Forecast, visit www.pwc.com/ technologyforecast. © 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL