* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download In database evolution, two directions of development are better than one
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Concurrency control wikipedia , lookup
Relational model wikipedia , lookup
Healthcare Cost and Utilization Project wikipedia , lookup
www.pwc.com/technologyforecast Technology Forecast: Remapping the database landscape Issue 1, 2015 In database evolution, two directions of development are better than one By Alan Morrison NoSQL database technology is maturing, but the newest Apache analytics stacks have triggered another wave of database innovation. Previous articles in Remapping the database landscape explore the nooks and crannies of NoSQL databases,1 how they are best used, and their benefits. This article examines the business implications of NoSQL and the evolution of non-relational database technology in big data analytics platforms. It concludes the PwC Technology Forecast 2015, Issue 1, which looks at the innovations creating a sea change in database technology. NoSQL was a response to the scaling and flexibility demands of interactive online technology. Without the growth of an interactive online customer environment, there wouldn’t be a need for new database technology. The use of NoSQL especially benefits three steps in a single business process: 1.Observe and diagnose problems customers are having. 2.Respond quickly and effectively to changes in customers’ behavior and thus serve them better. 3.Repeat steps 1 and 2 to improve responsiveness continually. New distributed database technologies essentially enable better, faster, and cheaper feedback loops that can be installed in more places online to improve responsiveness to customers. Consider these examples described in depth in the previous five articles of this series: Problem observed Response New capability added Payoff E-commerce website users searching for 17-inch laptops retrieved results that had nothing to do with laptops Codifyd indexes massive product catalogs by each individual entity, enabling faceted search Enriched data description at scale, made possible by NoSQL document databases Users find the products they’re looking for and are more likely to buy at the site they’ve just searched 2 Developers had trouble using conventional electronic medical record databases for longitudinal studies Avera feeds the EMR data into a bridging database for longitudinal studies Long-term studies of patient population outcomes, made easier and faster by using NoSQL document databases The hospital chain serves patients better and shifts to a pay-for-performance business model 3 Pharma companies had Zephyr Health continually trouble finding experts on new integrates thousands of data drugs in the pipeline sets containing expertise location information Comprehensive integration, search, and retrieval via NoSQL graph databases Companies quickly find more suitable experts and don’t overload them 4 Advertisers couldn’t assess web inventory at scale GumGum searches and categorizes the web photo inventory at thousands of publisher sites High-volume image recognition, categorization, and storage via NoSQL widecolumn stores The startup pioneered in-image advertising, a new business model 5 A room reservation website for a hotel chain consortium faced scaling challenges Room Key handles high-volume streams of transactional data, thanks to immutable database technology Decoupled writes and reads, which lead to reduced dependencies and wait times and enable the site to record an average of 1.5 million events every 24 hours and handle peak loads of 30 events per second The site scales, improving space utilization and monetization across the consortium 6 1 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to nonrelational distributed databases because it has become the default term of art. 2 “Solving a familiar e-commerce search problem with a NoSQL document store,” PwC Technology Forecast 2015, Issue 1, http://www. pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/interviews/mark-unak-sanjay-agarwal-interview.jhtml. 3 “Using document stores in business model transformation,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/ technology-forecast/2015/remapping-database-landscape/features/document-stores-business-model-transformation.jhtml. 4 “The promise of graph databases in public health,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technologyforecast/2015/remapping-database-landscape/features/public-health-graph-databases.jhtml. 5 “Scaling online ad innovations with the help of a NoSQL wide-column database,” PwC Technology Forecast 2015, Issue 1, http://www. pwc.com/en_US/us/technology-forecast/2015/remapping-database-landscape/interviews/vaibhav-puranik-ken-weiner-interview.jhtml. 6 “The rise of immutable data stores,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/ remapping-database-landscape/features/immutable-data-stores-rise.jhtml. 2 PwC Technology Forecast In database evolution, two directions of development are better than one Although NoSQL is maturing, advances in database technology promise new forms of heterogeneity in application platforms. For mobile and web developer teams, polyglot persistence and NoSQL heterogeneity have been the watchwords. For integrators and data scientists, unified big data platforms that have primarily had a more analytics slant are gaining additional general utility and making possible a new class of data-driven, more operational applications. At the same time, more NoSQL databases and database functionality are being incorporated into platforms. Hadoop has evolved. The goal is to create a single architecture for stream, analytics, and transaction processing. Operational and analytic directions of evolution Two separate directions of development have existed for decades in big data architecture. The first was on the developer side and began with operational data, such as the data generated by large e-commerce sites. E-commerce developers used MySQL or other relation databases for years. But more and more, they’ve found value in non-relational NoSQL databases. “If I use an OLTP [online transaction processing] database, such as Oracle, the result will be data spread across several tables. So when I write the transaction, I must be worried about who might be trying to read it while I’m writing it,” says Ritesh Ramesh, chief technologist for PwC’s global data and analytics organization. “In the case of NoSQL,” Ramesh points out, “consistency is implicit, because you’ll use the order ID as a key to write the entire receipt into the associated value field. It’s almost like you’re writing the transaction in a single action, because that’s how you’re going to retrieve it. It’s already ID’d and date-stamped. Enterprises will soon figure out that NoSQL delivers implicit consistency.” 7 Relational databases continue to be good for analytics of structured data that has a stable data model, but more and more, web and mobile app developers at retailers, for example, benefit from the time-to-market advantages of a key-value or a document store. “NoSQL doesn’t need a normalized data model, which makes it possible for developers to focus on building solutions and to own the process of storing and retrieving data themselves,” Ramesh says. Less time is consumed throughout the development and deployment process. And if developers want a flexible but comprehensive modeling capability to support a complex service that will need to scale, then graph stores, either standalone or embedded in a platform, provide an answer. The latest twist in non-relational database technologies is the rise of so-called immutable databases such as Datomic, which can support high-volume event processing. Immutability isn’t a new concept, but decoupling reads and writes does make sense from a scalability perspective. Declarative, functional programming is more common in these applications that process many events. Immutable, log-centric data stores also lend themselves to microservices architectures and continuous delivery goals.8 The second direction of database evolution has been Hadoop. Some form of Hadoop has existed for a decade,9 and the paradigm is promising for large-scale discovery analytics. As the next section will explain, Hadoop has evolved, and more complete big data architectures are emerging in open source and in the startup community. The toolsets and all the elements of these newest stream-plusbatch processing platforms aren’t mature, but the goal is to create a single architecture for stream, analytics, and transaction processing. As these vectors of database evolution come closer together, there are two types of hybrids: hybrid multi-model databases and hybrid stream-plus-batch analytics stacks that have more operational capabilities. 7 “The realities of polyglot persistence in mainstream enterprises,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/ technology-forecast/2015/remapping-database-landscape/interviews/ritesh-ramesh-interview.jhtml. 8 “The rise of immutable data stores,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/ remapping-database-landscape/features/immutable-data-stores-rise.jhtml. 9 Hadoop is a low-cost, highly scalable distributed computing platform. See “Making sense of Big Data,” PwC Technology Forecast 2010, Issue 3, http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml, for more background on Hadoop. 3 PwC Technology Forecast In database evolution, two directions of development are better than one timelines and database evolution Distributed data architecture timelines and database evolution Siloed OLTP/OLAP web app database paradigms MySQL 1995 Various 2009 Various 2011 Datomic, etc. 2013 Ad hoc batch/microbatch analytics and the unsiloed data lake paradigm MySQL NoSQL NewSQL Map Reduce GFS Google 2003 MapReduce HDFS Yahoo 2005 Spark Mesos UCB 2009 YARN Immutable Kappa architecture Lambda LinkedIn 2013 Twitter 2013 Hortonworks 2011 BDAS architecture UCB 2015 Hybrids and the polyglotdriven evolution of NoSQL It’s been six years since Scott Leberknight, then a chief architect and co-founder of Near Infinity, coined the term polyglot persistence. It is the database equivalent of polyglot programming, coined by Neal Ford in 2006 in reference to using the right programming language for the right job.10 As of July 2015, DB-Engines listed 279 databases on its monthly ranking website.11 Of those, 128 were NoSQL key-value, document, RDF graph, property graph, or wide-column databases and 35 were object-oriented, time-series, content, or event databases. The remainder— more than half—consisted of relational, multivalue, and search-specific data stores. Hybrid operational/analytic data technology stacks Database proliferation continues, and the questions become: Will polyglot persistence become the norm? Or will efforts to unify distributed data stores counter the polyglot trend? Mainstream enterprises often try to minimize the number of databases their developers use so as not to overwhelm their data management groups. In August 2014, Mike Bowers, enterprise data architect at the Church of Jesus Christ of Latter-day Saints (LDS), described the chilly reception LDS IT leadership initially gave MarkLogic. Over time, Bowers and the rest of the data management staff grew to appreciate the value of a NoSQL alternative in the mix, but they drew the line at a single hybrid NoSQL database. Afterward, rogue instances of other NoSQL databases started to appear that did not gain IT support.12 10 “Polyglot Persistence,” NFJS Lone Star Software Symposium, Austin, Texas, July 10–12, 2009, http://www.nofluffjuststuff.com/conference/austin/2009/07/session?id=14357, accessed June 2, 2015. 11 “DB-Engines Ranking,” DB-Engines, http://db-engines.com/en/ranking, accessed July 8, 2015. 12 Mike Bowers, “AM2: Modeling for NoSQL and SQL Databases,” presentation at NoSQL Now, August 19, 2014, http://nosql2014.dataversity.net/sessionPop.cfm?confid=81&proposalid=6356, accessed June 2, 2015. 4 PwC Technology Forecast In database evolution, two directions of development are better than one The DB-Engines popularity analysis listed 12 single-model NoSQL databases among the top 50 databases; MongoDB, Cassandra, and Redis were in the top 10. NoSQL13 (non-relational, distributed) databases began as a more scalable alternative to MySQL and the other relational databases that developers had been using as the basis for web applications. Use cases for these databases tended to have a near-term operational or a tactical analytics slant. So far, the most popular NoSQL databases for these use cases have been single-model key-value, wide-column, document, and graph stores. Single-model databases offer simplicity for app developers who have one main use case in mind. In July 2015, the DB-Engines popularity analysis listed 12 single-model NoSQL databases among the top 50 databases; MongoDB, Cassandra, and Redis were in the top 10.14 Hybrid multi-model databases emerged in the early 2010s. Examples include key-value and document store DynamoDB, document and graph store MarkLogic, and document and graph store OrientDB. In July 2015, these three multi-model databases ranked in the top 50 in popularity, according to DB-Engines. Multi-model databases will become more common. Cassandra distributor DataStax, for example, plans to offer a graph database successor to Aurelius Titan as a part of DataStax Enterprise.15 MongoDB points out that with its version 3.2, some graph traversal and relational joins will be possible with the help of the $lookup operator, used to bring data in from other collections. In general, MongoDB uses a single physical data model, but can expose multiple logical data models via its query engine. Will the adoption of hybrids slow the growth of polyglot persistence? Given the pace of new database introduction and the increasing popularity of single-model databases, such a scenario seems unlikely. But that doesn’t mean the new unified stacks won’t prove useful. Database technology in new application stacks Previous articles in this PwC Technology Forecast series didn’t focus much on relational databases, because the new capabilities in NoSQL databases—such as data modeling features, scalability, and ease of use—are the key disruptors. With the exception of PostgreSQL, an object-relational database that can handle tables as large as 32 terabytes, the top NoSQL databases ranked by the DB-Engines popularity analysis saw the highest growth of any databases ranked in the top 10. MongoDB grew nearly 49 percent year over year, Redis grew 26 percent, and Cassandra grew 31 percent.16 Relational technology is well understood, including the so-called NewSQL technologies that seek to address the shortfalls of conventional web databases in distributed computing environments. And relational technology continues to be the default. As of July 2015, most database users think primarily in terms of tightly linked tables and up-front schemas. Of the top 30 databases ranked in DB-Engines, 17 relational databases received 86 percent of the ranking points. In contrast, nine NoSQL databases received 11 percent of the points. Dedicated search engines (called databases in the DB-Engines analysis) Elasticsearch, Solr, and Splunk together as a group received 3 percent of the points. NoSQL is not the only growth area. Excluded from the DB-Engines ranking are Hadoop and other comparable distributed file systems, such as Ceph and GlusterFS and their associated data processing engines. These systems, as explained later, are becoming the foundation for unified analytics and operational platforms that incorporate not only other database functions but also stream and batch processing functions in a single environment. 13 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed databases because it has become the default term of art. See the section “Database evolution becomes a revolution” in the article “Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/ en/technology-forecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on relational versus non-relational database technology. 14 “DB-Engines Ranking,” DB-Engines, http://db-engines.com/en/ranking, accessed July 8, 2015. 15 “How enterprise graph databases are maturing,” PwC interview with Martin Van Ryswyk and Marko Rodriguez of DataStax, PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/interviews/ martin-van-ryswyk-marko-rodriguez-interview.jhtml. 16 “DB-Engines Ranking,” DB-Engines, http://db-engines.com/en/ranking, accessed July 8, 2015. 5 PwC Technology Forecast In database evolution, two directions of development are better than one Some unified analytics and operational platforms have NoSQL at their core. These unified platforms offer the ability to close the big data feedback loop for the first time, enabling near-real-time responsiveness— even in legacy systems that previously seemed hopelessly outdated and unresponsive. The Berkeley Data Analytics Stack Researchers sequenced 3 million fragments of a patient’s genomic information within two days. Then they matched the sequences that were left with the sequences of known pathogens within 90 minutes. Hadoop data infrastructure has been the starting point for many of these unified big data frameworks. The Berkeley Data Analytics Stack (BDAS)—an open-source framework created over several years at the University of California, Berkeley, AMPLab (Algorithms, Machines, and People) in collaboration with industry—takes the batch processing of Hadoop, accelerates it with the help of in-memory technologies, and places it in an analytics framework optimized for machine learning. The capabilities of BDAS have been paying off in notable ways for more than a year. Consider the case of Joshua Osborn, a 14-year-old with an immune system deficiency who was suffering from brain swelling in 2014. Doctors couldn’t diagnose the cause of the swelling by using conventional bioinformatics systems. Eventually they turned to an application the AMPLab pioneered called Scalable Nucleotide Alignment Program (SNAP) that runs on BDAS. Researchers at the University of California, San Francisco, took the available DNA from the patient, sequenced 3 million fragments of genomic information within two days, and put aside the human genome information with the help of SNAP. Then they matched the sequences that were left with the sequences of known pathogens within 90 minutes.17 It turned out Joshua was infected with Leptospira, a potentially lethal bacteria that had caused his brain to swell. Doctors gave him doses of penicillin, which immediately reduced the swelling. Two weeks later, Joshua was walking, and soon he fully recovered. At the heart of BDAS is Apache Spark, which serves as a more accessible and accelerated in-memory replacement for MapReduce, the data processing engine for Hadoop. Key to Spark is the concept of a resilient distributed dataset, or RDD. An RDD is an abstraction that allows data to be manipulated in memory across a cluster. Spark tracks the lineage of data transformations and can rebuild data sets without replication if an RDD suffers a data loss.18 The input and lineage data can be stored in the Hadoop Distributed File System (HDFS) or a range of databases. RDDs are one reason Spark can outperform MapReduce in traditional Hadoop clusters. In the 2014 Daytona GraySort competition, Apache Spark sorted 100 terabytes at 4.27 terabytes per minute on 207 nodes. In contrast, the 2013 winner was Yahoo, whose 2,100-node Hadoop cluster sorted 102 terabytes at 1.42 terabytes per minute.19 Spark uses HDFS and other Hadoopcompatible disk-based file systems for disk storage. Tachyon, which the AMPLab announced in 2014, augments the capabilities of these file systems by allowing data sharing in memory, another way to reduce latency in complex machine-learning pipelines.20 In 2015, the AMPLab added Velox to BDAS. Velox allows BDAS to serve up predictions generated by machine-learning models and to manage the model retraining process, completing the model-driven feedback loop that Spark begins. 17 Carl Zimmer, “In a First, Test of DNA Finds Root of Illness,” The New York Times, June 4, 2014, http://www.nytimes.com/2014/06/05/ health/in-first-quick-dna-test-diagnoses-a-boys-illness.html, accessed June 4, 2015. 18 Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” NSDI 2012, http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf, accessed June 4, 2015. 19 Daytona GraySort Benchmark, http://sortbenchmark.org/, accessed June 4, 2015. 20Haoyuan Li, “AMP Camp 5: Tachyon – Haoyuan Li,” UC Berkeley AMPLab video, https://www.youtube.com/watch?v=8yPdhIbfP7A, accessed June 4, 2015. 6 PwC Technology Forecast In database evolution, two directions of development are better than one According to Michael Franklin, head of the AMPLab and chair of the computer science department at the University of California, Berkeley, “If you look at the architecture diagram [included here], Velox sits right next to Spark in the middle of the architecture. The whole point of Velox is to handle more operational processes. It will handle data that gets updated as it is available, rather than loading data in bulk, as is common when it’s analyzed. Our models also get updated on a real-time basis when using computers that learn as they process data.” a disk-based storage or file system first approach is best, particularly in a Kafka, logfile-centric system. What’s stored on disk gets cached in memory without duplication, and the logic that manages the synchronization of the two resides entirely in the operating system. The main memory first approach, they assert, risks greater memory inefficiencies, slowness, and unreliability because of duplication and a lack of headroom.21 In other words, an entire test-driven, machinelearning-based services environment could be built using this stack. Kafka- and Samza-based unified big data platforms Interestingly, while BDAS relies heavily on in-memory technology up front and maximizes the use of random-access memory (RAM) to reduce latency in various parts of the stack, the proponents of platforms that use Apache Kafka and Apache Samza think Apache Kafka and Apache Samza were born of LinkedIn’s efforts to move to a less monolithic, microservices-based system architecture. Kafka is an event-log-based messaging system designed to maximize throughput and scalability and minimize dependencies. Services publish or subscribe to the message stream in Kafka, which can handle “hundreds of megabytes of reads and writes per second.”22 Samza processes the message stream and enables decoupled views of the writes in Kafka. Samza and Kafka together constitute a distributed, log-centric, immutable messaging queue and data store that can serve as the center of a big data analytics platform.23 The missing piece in BDAS Training BlinkDB Spark Streaming Management + Serving MLbase GraphX Spark SQL MLlib Velox Spark Mesos Tachyon HDFS, S3, ... Source: Dan Crankshaw, UC Berkeley AMPLab, 2015, http://www.slideshare.net/dscrankshaw/velox-at-sf-data-mining-meetup 21“4.2 Persistence,” Apache Kafka, https://kafka.apache.org/08/design.html, accessed June 5, 2015. 22Apache Kafka, http://kafka.apache.org/, accessed June 4, 2015. 23“The rise of immutable data stores,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/ remapping-database-landscape/features/immutable-data-stores-rise.jhtml. 7 PwC Technology Forecast In database evolution, two directions of development are better than one But many pieces of that platform are only now being put into place, including serving databases and associated querying capabilities. Confluent is a stream processing platform startup founded by former LinkedIn employees who had worked on Kafka, Samza, and other open-source projects. Kafka can synchronize the data from various sources, and Confluent has envisioned how those data sources could feed into a Kafka-centric enterprise architecture and how stream analytics could be accomplished with the help of Apache Flink, an emerging alternative to Apache Spark that claims true stream and batch processing using the same engine.24 Others have also built their own stacks using Kafka and Samza. Metamarkets, a big data analytics service,25 has built what it calls a real-time analytics data (RAD) stack 26 that includes Kafka, Samza, Hadoop, and its own open-sourced Druid time-series data store to query the Kafka data and serve up the results. Metamarkets says its RAD stack can ingest a million events per second and query trillions of events, but has found it necessary to instrument its stack and establish a separate monitoring data cluster to maintain such a high-throughput system in good working order. The Confluent and Metamarkets platforms could be considered implementations of what have been called the Kappa and Lambda architectures. In the Kappa architecture, stream and batch coexist within the same infrastructure. Confluent asserts that the separate stream processing pipeline originally specified in the Lambda architecture was necessary only because stream processing was weak at the time. Now stream and batch can be together. Metamarkets continues to use a Lambda architecture, but has recently added Samza to the mix. Kappa architecture: A log-centric stream processing system that can also be used for batch processing Speed (real-time) layer Events Serving layer Batch computation (Hadoop) Replayable log Source: www.codefutures.com 24Kostas Tzoumas and Stephan Ewen, “Real-time stream processing: The next step for Apache Flink,” Confluent blog, May 6, 2015, http://blog.confluent.io/2015/05/06/real-time-stream-processing-the-next-step-for-apache-flink/, accessed June 4, 2015. 25“The nature of cloud-based data science,” PwC interview with Mike Driscoll of Metamarkets, PwC Technology Forecast 2012, Issue 1, http://www.pwc.com/us/en/technology-forecast/2012/issue1/interviews/interview-mike-driscoll-ceo-metamarkets.jhtml. 26“A RAD Stack: Kafka, Storm, Hadoop, and Druid,” August 27, 2014, http://lambda-architecture.net/stories/2014-08-27-kafka-stormhadoop-druid/, and “Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets,” June 3, 2015, https://metamarkets.com/2015/dogfooding-with-druid-samza-and-kafka-metametrics-at-metamarkets/, accessed June 5, 2015. 8 PwC Technology Forecast In database evolution, two directions of development are better than one In either case, these are two very highthroughput analytics platforms designed for cost-effective scalability and flexibility. Metamarkets has a number of online and mobile advertising company clients, including OpenX, an online ad exchange that handles hundreds of billions of transactions each month. OpenX connects buyers and sellers in this online exchange. The Metamarkets service provides visibility into the ad exchange marketplace and includes custom dashboards built for OpenX on the front end. The Lambdaarchitecture-based back end queues up the streams to be analyzed, provides the processing, and serves up the results.27 Conclusion: More capable databases and big data platforms NoSQL databases emerged in the late 2000s out of necessity. As big data grew, the need arose to spread data processing across commodity clusters to make processing timely and cost-effective. As big data became more complex and interactive, developers and analysts needed access to a range of data models. As relational database technology improved, NewSQL responded to retake some of the territory lost earlier. Hybrids and immutable data stores are further expanding the choices available to enterprises. But the mid-2010s saw expansion in a different, unanticipated direction—entirely new data-driven application stacks built on distributed file systems, particularly HDFS. Stacks such as BDAS promise to close the feedback-response loop in a machinelearning environment. Separate databases will continue to be used in these environments to provide persistence. Theoretically, machines doing unsupervised learning could continuously refine a model given a closed feedback loop. Doctors confronting illnesses difficult to diagnose could turn to a standard process of genomics testing to identify pathogens in a patient’s body, as described in the University of California, San Francisco, example. Similarly, enterprises could monitor operational processes, focus on those of particular concern, and use machine learning to identify elements that could cause disruption, helping management to diagnose a problem and prescribe a solution. Just as BDAS has become more complete with the addition of Velox, other alternative stacks are emerging, such as the messagelog-centric Kafka plus Samza environment. Stream processing is rapidly maturing with dataflow engines such as Apache Flink, so organizations can create very high-capacity online transactional environments that use the same cluster and engine for Hadoop-style batch analytics. What’s possible with the newest technology is the ability to close the big data feedback loop for the first time, enabling near-real-time responsiveness—even in legacy systems that previously seemed hopelessly outdated and unresponsive. The distributed database can still be used on its own for a mobile or web app, but more and more, its real power comes when the data store is embedded as part of one of these newly built stacks. So-called deep or unsupervised machine learning—an upcoming PwC Technology Forecast research topic—pays off most when analytics and operational systems people work together to build one continuously improving, cloudnative system. The technology that makes such a system feasible no longer requires a forklift upgrade. 27“OpenX case study,” Metamarkets, https://metamarkets.com/case-studies/openx/, accessed June 5, 2015. 9 PwC Technology Forecast In database evolution, two directions of development are better than one To have a deeper conversation about remapping the database landscape, please contact: Gerard Verweij Principal and US Technology Consulting Leader +1 (617) 530 7015 [email protected] Chris Curran Chief Technologist +1 (214) 754 5055 [email protected] Oliver Halter Principal, Data and Analytics Practice +1 (312) 298 6886 [email protected] Bo Parker Managing Director Center for Technology and Innovation +1 (408) 817 5733 [email protected] About PwC’s Technology Forecast Published by PwC’s Center for Technology and Innovation (CTI), the Technology Forecast explores emerging technologies and trends to help business and technology executives develop strategies to capitalize on technology opportunities. Recent issues of the Technology Forecast have explored a number of emerging technologies and topics that have ultimately become many of today’s leading technology and business issues. To learn more about the Technology Forecast, visit www.pwc.com/ technologyforecast. About PwC PwC US helps organizations and individuals create the value they’re looking for. We’re a member of the PwC network of firms in 157 countries with more than 195,000 people. We’re committed to delivering quality in assurance, tax and advisory services. Find out more and tell us what matters to you by visiting us at www.pwc.com. © 2015 PwC. All rights reserved. PwC refers to the US member firm or one of its subsidiaries or affiliates, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL