Download Database futures: How Apache Spark fits in to a larger unified data

www.pwc.com/technologyforecast Technology Forecast: Remapping the database landscape Issue 1, 2015 Database futures: How Apache Spark1 fits in to a larger unified data architecture Mike Franklin of the University of California, Berkeley, discusses the goals behind Spark and a more unified cloud-data ecosystem. Interview conducted by Alan Morrison, Bo Parker and Bud Mathaisel PwC: What are you seeing in terms of the data technology that’s coming online, and how is that changing? Mike Franklin Mike Franklin is Director of the Algorithms, Machines and People Lab (AMPLab), and Chair of the Computer Science Division at UC Berkeley. MF: I was trained as a database person. Most of my research and work in industry has been along the lines of traditional relational databases, object oriented databases, and so on. That’s my perspective. I think things are in the process of changing in a fundamental way. In the 30 years or so that I’ve been involved in this area in various ways, this is the biggest change I’ve seen since the relational database revolution. There wasn’t a generally accepted alternative to the relational database in its heyday. You either figured out how to use it, or you were on your own. There were attempts over the years to try to change that, such as object-oriented databases, XML-oriented databases, and the like. None of them really got the traction that we’re starting to see now in some of these newer systems. More recently, the appreciation of the potential value in data has spread everywhere, across all industries. Basically, every department at UC Berkeley, for instance, is doing data-driven work. That’s led to an explosion in the sort of use cases and the types of data that people want to store and analyze. I’m not sure that the formats of the data have changed all that much, but the variety of the data that people want to store is what’s really changed. PwC: How has that impacted the nature of databases? MF: Users need flexibility more than anything else, so you can take different approaches to answering your question. One approach is an idea expressed as store first, schema later. The first step in a traditional database “Our view of the Spark ecosystem is that you don’t need all those isolated data systems.” environment is to analyze your application and organization, then do a data design and schema design—where you figure out what each piece of data has to look like and how all of the data are interrelated. Only after you’ve gotten through that process can you start thinking about putting your system together and loading any data into it. Now you just collect as much data as you can, because the alternatives in storage are incredibly cheap relative to what they used to be. So you just store everything you can get your hands on. Some people call this a data lake.2 About ten years ago, my colleagues and I wrote a paper about something that we called data spaces. which is exactly the same idea, where you don’t impose structure and don’t require data to conform to a structure in order to be able to store it. With that concept, you store the data and then figure out how much structure you need in order to make sense of the data and do what you need to with it. The biggest gain in the flexibility of data management is that you can store anything; then try to make sense of it and work with it. PwC: For the past couple of years, the assumption has been that you’ve got this heterogeneous environment and there are different tools that 2 you can use, depending on what your needs are. You could use a document store here, a graph database there, while continuing to use a relational database for critical transaction support. Is there a more unified approach, or do we have to accept database heterogeneity long-term? MF: From my perspective it’s obvious that heterogeneity is not the right way to do things. If you think about the history of the Spark project and the ecosystem we’ve built around it, Spark was originally a research project. We didn’t worry too much about things that you would worry about if you were building a product from Day One. But our view of the Spark ecosystem is that you don’t need all those isolated data systems. Once you’ve got the data into our Spark environment, you should be able to treat that data any way you want to. If you want to run SQL [structured query language] queries over it, we’ll let you do that. If you want to look at the data in the form of a graph, we have a system called GraphX that will let you do that. If you want to use computers that learn to process data better from their own experience, along with data clustering and recommendation systems, we’ve got libraries that will let you do that. And if you want to write low-level code to do something to the data we haven’t even imagined, you can do that, too. PwC Technology Forecast Database futures: How Apache Spark fits in to a larger unified data architecture “What’s changed today is that the same system can support different points along the data-structure spectrum.” When we created the Spark ecosystem, we had the advantage of not having to hit revenue targets to please impatient investors. We had the time to build out a comprehensive ecosystem. We did it in an open way, so that we’re not the only ones contributing and extending the ecosystem for the benefit of all. Of course, any organization is limited by the technologies it has and knows how to use. But there’s no technical reason why you need a dozen different systems to do a dozen different things. There’s enough commonality in the patterns of data acquisition and analytics to enable more effective management of data with a single enterprisewide data management framework. PwC: What about meeting the requirements of operational data, which tends to need more structure? MF: Data structure poses interesting questions: When do you need it, and how much do you need? I look at structure as a spectrum, from completely unstructured data, like just a bag of bytes, all the way to a relational schema or maybe even something more sophisticated in terms of structure at the opposite extreme. 3 You can think of data structure options as a choice about what tradeoff you’ll accept to meet your needs. On the unstructured side, you get incredible flexibility. You get the ability to bring in whatever data you find, keep it and, hopefully, do something useful with it. But as you move towards the increasingly structured side of data management, you gain more confidence about what your data mean, and how that data can be applied to benefit your business. As you move to operational systems that house valuable data, where consistency in the structure and management of that data is the top priority, you have to give up flexibility in the way that you structure and manage the data. You apply rules and constraints to get more predictable results. What’s changed today is that the same system can support different points along the data-structure spectrum. With systems like Spark and some other NoSQL [not only structured query language] environments, you get to pick different points along the data-structure spectrum in the same environment. At least that’s the goal, because I’m not sure that capability fully exists yet. PwC Technology Forecast Database futures: How Apache Spark fits in to a larger unified data architecture PwC: How does the Spark ecosystem address the operational aspect? MF: Spark is focused on analytics, but we’re in the process of building out capabilities that will be more operational. Spark is one component of something we’re building called the Berkeley Data Analytics Stack [BDAS]. And Spark is the middle, or main, part of that stack. But we’re building a new component of BDAS called Velox. And Velox, if you look at the architecture diagram [see below], sits right next to Spark in the middle of the architecture. The whole point of Velox is to handle more operational processes. It will handle data that gets updated as it’s available, rather than loading data in bulk, as is common when it’s analyzed. Our models also get updated on a real-time basis when using computers that learn as they process data. PwC: What are the problems you’re trying to solve on the operational side? MF: We’re trying to create a system that lets you move all your data from one environment to another as you go from the operational side to analytics. When we succeed, you’ll be able to move your data in a closed loop between operations and analytics, so that analytics can directly inform your operations. PwC: How do you support different user groups? MF: The users of our tool include both frontline analysts in a security operations center and more advanced security investigators and incident handlers. Often, certain sensitive types of data are not available to the front-line analysts, but the more advanced investigators would be able to see all the data. The missing piece in BDAS Training BlinkDB Spark Streaming Management + Serving MLbase Graph X Spark SQL ML library Velox Spark Mesos Tachyon HDFS, S3, ... Source: Dan Crankshaw, UC Berkeley AMPlab, http://www.slideshare.net/dscrankshaw/velox-at-sf-data-mining-meetup, 2015 4 PwC Technology Forecast Database futures: How Apache Spark fits in to a larger unified data architecture PwC: Some people think that business logic should not be contained in a database. For example, say someone decides not to order supplies from a specific vendor for fear of violating compliance regulations, even though database systems have long been used to assert compliance. Now the practice of asserting compliance through a database is being questioned because the proof of business logic is in its application, rather than inside a database. Would you agree that putting business logic in the database itself was a bad idea to start with, something we should move away from? MF: Your question makes me wonder if there’s a third alternative, a shared system for business logic other than the database. Maybe the right idea all along was to pull business logic out of individual applications, so you had a single source of truth for business logic. In that case, putting business logic in the database would have been the wrong way to achieve that. Maybe there is an ideal shared repository for business logic that isn’t the database. I can see that. You may have just given me my new research project. PwC: What caused the rise of NoSQL? Is NoSQL enough? MF: My view on the rise of NoSQL is that database systems require too much upfront work before you can get anything done. The first time I hand-typed an HTML [HyperText Markup Language] page, and then pointed a browser at it, the page wasn’t quite right. But a lot of the page looked okay. That was a revelation to me as a database guy. Because if you did that to a database, it would just return a syntax error. It wouldn’t show you anything until you did it perfectly right. The NoSQL systems are much more forgiving. But at some point you run into a wall with NoSQL. The truth is that once you reach a certain stage structured query language is a really good tool for a lot of what people want to do with data. A flexible and incremental approach to structure is the most valuable. You start by loading whatever data you want. Just throw it in. Make it super easy. Then, as you need more structure and definitions to get the results you need, and comply with any mandatory procedures, you start imposing control on your data. 1 Apache Spark is a big data processing engine originally conceived as an in-memory alternative to MapReduce. It can run in-memory 100 times faster than Hadoop. It can store results in any Hadoop-compatible file system or database, such as HDFS [Hadoop Distributed File System], HBase, and Cassandra—or in Amazon’s Simple Storage Service [S3]. 2 See “Data lakes and the promise of unsiloed data” at http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/ data-lakes.jhtml for more information on data lakes. To have a deeper conversation about remapping the database landscape, please contact: Gerard Verweij Principal and US Technology Consulting Leader +1 (617) 530 7015 [email protected] Chris Curran Chief Technologist +1 (214) 754 5055 [email protected] Oliver Halter Principal, Data and Analytics Practice +1 (312) 298 6886 [email protected] Bo Parker Managing Director Center for Technology and Innovation +1 (408) 817 5733 [email protected] About PwC’s Technology Forecast Published by PwC’s Center for Technology and Innovation (CTI), the Technology Forecast explores emerging technologies and trends to help business and technology executives develop strategies to capitalize on technology opportunities. Recent issues of the Technology Forecast have explored a number of emerging technologies and topics that have ultimately become many of today’s leading technology and business issues. To learn more about the Technology Forecast, visit www.pwc.com/ technologyforecast. © 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Database futures: How Apache Spark fits in to a larger unified data