Download In database evolution, two directions of development are better than one

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
www.pwc.com/technologyforecast
Technology Forecast: Remapping the database landscape
Issue 1, 2015
In database evolution,
two directions of
development are
better than one
By Alan Morrison
NoSQL database technology is maturing, but the newest
Apache analytics stacks have triggered another wave of
database innovation.
Previous articles in Remapping the database
landscape explore the nooks and crannies of
NoSQL databases,1 how they are best used,
and their benefits. This article examines
the business implications of NoSQL and
the evolution of non-relational database
technology in big data analytics platforms.
It concludes the PwC Technology Forecast
2015, Issue 1, which looks at the innovations
creating a sea change in database technology.
NoSQL was a response to the scaling and
flexibility demands of interactive online
technology. Without the growth of an
interactive online customer environment,
there wouldn’t be a need for new database
technology. The use of NoSQL especially
benefits three steps in a single business
process:
1.Observe and diagnose problems customers
are having.
2.Respond quickly and effectively to changes
in customers’ behavior and thus serve
them better.
3.Repeat steps 1 and 2 to improve
responsiveness continually.
New distributed database technologies
essentially enable better, faster, and cheaper
feedback loops that can be installed in more
places online to improve responsiveness
to customers. Consider these examples
described in depth in the previous five
articles of this series:
Problem observed
Response
New capability added
Payoff
E-commerce website users
searching for 17-inch laptops
retrieved results that had
nothing to do with laptops
Codifyd indexes massive
product catalogs by each
individual entity, enabling
faceted search
Enriched data description
at scale, made possible by
NoSQL document databases
Users find the products
they’re looking for and are
more likely to buy at the site
they’ve just searched 2
Developers had trouble
using conventional electronic
medical record databases for
longitudinal studies
Avera feeds the EMR data
into a bridging database for
longitudinal studies
Long-term studies of patient
population outcomes, made
easier and faster by using
NoSQL document databases
The hospital chain serves
patients better and shifts to a
pay-for-performance business
model 3
Pharma companies had
Zephyr Health continually
trouble finding experts on new integrates thousands of data
drugs in the pipeline
sets containing expertise
location information
Comprehensive integration,
search, and retrieval via
NoSQL graph databases
Companies quickly find more
suitable experts and don’t
overload them 4
Advertisers couldn’t assess
web inventory at scale
GumGum searches and
categorizes the web photo
inventory at thousands of
publisher sites
High-volume image
recognition, categorization,
and storage via NoSQL widecolumn stores
The startup pioneered
in-image advertising, a new
business model 5
A room reservation website
for a hotel chain consortium
faced scaling challenges
Room Key handles
high-volume streams of
transactional data, thanks
to immutable database
technology
Decoupled writes and reads,
which lead to reduced
dependencies and wait
times and enable the site
to record an average of 1.5
million events every 24 hours
and handle peak loads of 30
events per second
The site scales, improving
space utilization and
monetization across the
consortium 6
1 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only
structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed
environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to nonrelational distributed databases because it has become the default term of art.
2 “Solving a familiar e-commerce search problem with a NoSQL document store,” PwC Technology Forecast 2015, Issue 1, http://www.
pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/interviews/mark-unak-sanjay-agarwal-interview.jhtml.
3 “Using document stores in business model transformation,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/
technology-forecast/2015/remapping-database-landscape/features/document-stores-business-model-transformation.jhtml.
4 “The promise of graph databases in public health,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technologyforecast/2015/remapping-database-landscape/features/public-health-graph-databases.jhtml.
5 “Scaling online ad innovations with the help of a NoSQL wide-column database,” PwC Technology Forecast 2015, Issue 1, http://www.
pwc.com/en_US/us/technology-forecast/2015/remapping-database-landscape/interviews/vaibhav-puranik-ken-weiner-interview.jhtml.
6 “The rise of immutable data stores,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/
remapping-database-landscape/features/immutable-data-stores-rise.jhtml.
2
PwC Technology Forecast
In database evolution, two directions of development are better than one
Although NoSQL is maturing, advances in
database technology promise new forms of
heterogeneity in application platforms. For
mobile and web developer teams, polyglot
persistence and NoSQL heterogeneity have
been the watchwords. For integrators and
data scientists, unified big data platforms
that have primarily had a more analytics slant
are gaining additional general utility and
making possible a new class of data-driven,
more operational applications. At the same
time, more NoSQL databases and database
functionality are being incorporated into
platforms.
Hadoop has
evolved. The goal
is to create a single
architecture for
stream, analytics,
and transaction
processing.
Operational and analytic
directions of evolution
Two separate directions of development have
existed for decades in big data architecture.
The first was on the developer side and
began with operational data, such as the
data generated by large e-commerce sites.
E-commerce developers used MySQL or other
relation databases for years. But more and
more, they’ve found value in non-relational
NoSQL databases. “If I use an OLTP [online
transaction processing] database, such as
Oracle, the result will be data spread across
several tables. So when I write the transaction,
I must be worried about who might be trying
to read it while I’m writing it,” says Ritesh
Ramesh, chief technologist for PwC’s global
data and analytics organization.
“In the case of NoSQL,” Ramesh points out,
“consistency is implicit, because you’ll use the
order ID as a key to write the entire receipt
into the associated value field. It’s almost
like you’re writing the transaction in a single
action, because that’s how you’re going to
retrieve it. It’s already ID’d and date-stamped.
Enterprises will soon figure out that NoSQL
delivers implicit consistency.” 7
Relational databases continue to be good
for analytics of structured data that has a
stable data model, but more and more, web
and mobile app developers at retailers, for
example, benefit from the time-to-market
advantages of a key-value or a document
store. “NoSQL doesn’t need a normalized
data model, which makes it possible for
developers to focus on building solutions and
to own the process of storing and retrieving
data themselves,” Ramesh says. Less time
is consumed throughout the development
and deployment process. And if developers
want a flexible but comprehensive modeling
capability to support a complex service that
will need to scale, then graph stores, either
standalone or embedded in a platform,
provide an answer.
The latest twist in non-relational database
technologies is the rise of so-called immutable
databases such as Datomic, which can
support high-volume event processing.
Immutability isn’t a new concept, but
decoupling reads and writes does make sense
from a scalability perspective. Declarative,
functional programming is more common in
these applications that process many events.
Immutable, log-centric data stores also lend
themselves to microservices architectures and
continuous delivery goals.8
The second direction of database evolution
has been Hadoop. Some form of Hadoop has
existed for a decade,9 and the paradigm is
promising for large-scale discovery analytics.
As the next section will explain, Hadoop
has evolved, and more complete big data
architectures are emerging in open source and
in the startup community. The toolsets and
all the elements of these newest stream-plusbatch processing platforms aren’t mature, but
the goal is to create a single architecture for
stream, analytics, and transaction processing.
As these vectors of database evolution come
closer together, there are two types of hybrids:
hybrid multi-model databases and hybrid
stream-plus-batch analytics stacks that have
more operational capabilities.
7 “The realities of polyglot persistence in mainstream enterprises,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/
technology-forecast/2015/remapping-database-landscape/interviews/ritesh-ramesh-interview.jhtml.
8 “The rise of immutable data stores,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/
remapping-database-landscape/features/immutable-data-stores-rise.jhtml.
9 Hadoop is a low-cost, highly scalable distributed computing platform. See “Making sense of Big Data,” PwC Technology Forecast 2010,
Issue 3, http://www.pwc.com/us/en/technology-forecast/2010/issue3/index.jhtml, for more background on Hadoop.
3
PwC Technology Forecast
In database evolution, two directions of development are better than one
timelines and database evolution
Distributed data architecture timelines and database evolution
Siloed OLTP/OLAP
web app database paradigms
MySQL
1995
Various
2009
Various
2011
Datomic, etc.
2013
Ad hoc batch/microbatch analytics
and the unsiloed data lake paradigm
MySQL
NoSQL
NewSQL
Map
Reduce
GFS
Google
2003
MapReduce
HDFS
Yahoo
2005
Spark
Mesos
UCB
2009
YARN
Immutable
Kappa
architecture
Lambda
LinkedIn
2013
Twitter
2013
Hortonworks
2011
BDAS
architecture
UCB
2015
Hybrids and the polyglotdriven evolution of NoSQL
It’s been six years since Scott Leberknight,
then a chief architect and co-founder of Near
Infinity, coined the term polyglot persistence.
It is the database equivalent of polyglot
programming, coined by Neal Ford in 2006
in reference to using the right programming
language for the right job.10 As of July 2015,
DB-Engines listed 279 databases on its
monthly ranking website.11 Of those, 128 were
NoSQL key-value, document, RDF graph,
property graph, or wide-column databases
and 35 were object-oriented, time-series,
content, or event databases. The remainder—
more than half—consisted of relational, multivalue, and search-specific data stores.
Hybrid operational/analytic
data technology stacks
Database proliferation continues, and the
questions become: Will polyglot persistence
become the norm? Or will efforts to unify
distributed data stores counter the polyglot
trend? Mainstream enterprises often try
to minimize the number of databases their
developers use so as not to overwhelm their
data management groups. In August 2014,
Mike Bowers, enterprise data architect at the
Church of Jesus Christ of Latter-day Saints
(LDS), described the chilly reception LDS IT
leadership initially gave MarkLogic. Over time,
Bowers and the rest of the data management
staff grew to appreciate the value of a NoSQL
alternative in the mix, but they drew the line
at a single hybrid NoSQL database. Afterward,
rogue instances of other NoSQL databases
started to appear that did not gain IT support.12
10 “Polyglot Persistence,” NFJS Lone Star Software Symposium, Austin, Texas, July 10–12, 2009,
http://www.nofluffjuststuff.com/conference/austin/2009/07/session?id=14357, accessed June 2, 2015.
11 “DB-Engines Ranking,” DB-Engines, http://db-engines.com/en/ranking, accessed July 8, 2015.
12 Mike Bowers, “AM2: Modeling for NoSQL and SQL Databases,” presentation at NoSQL Now, August 19, 2014,
http://nosql2014.dataversity.net/sessionPop.cfm?confid=81&proposalid=6356, accessed June 2, 2015.
4
PwC Technology Forecast
In database evolution, two directions of development are better than one
The DB-Engines
popularity
analysis listed
12 single-model
NoSQL databases
among the top
50 databases;
MongoDB,
Cassandra, and
Redis were in the
top 10.
NoSQL13 (non-relational, distributed)
databases began as a more scalable alternative
to MySQL and the other relational databases
that developers had been using as the basis
for web applications. Use cases for these
databases tended to have a near-term
operational or a tactical analytics slant. So far,
the most popular NoSQL databases for these
use cases have been single-model key-value,
wide-column, document, and graph stores.
Single-model databases offer simplicity for
app developers who have one main use case in
mind. In July 2015, the DB-Engines popularity
analysis listed 12 single-model NoSQL
databases among the top 50 databases;
MongoDB, Cassandra, and Redis were in
the top 10.14
Hybrid multi-model databases emerged in
the early 2010s. Examples include key-value
and document store DynamoDB, document
and graph store MarkLogic, and document
and graph store OrientDB. In July 2015, these
three multi-model databases ranked in the top
50 in popularity, according to DB-Engines.
Multi-model databases will become more
common. Cassandra distributor DataStax,
for example, plans to offer a graph database
successor to Aurelius Titan as a part of
DataStax Enterprise.15 MongoDB points out
that with its version 3.2, some graph traversal
and relational joins will be possible with the
help of the $lookup operator, used to bring
data in from other collections. In general,
MongoDB uses a single physical data model,
but can expose multiple logical data models
via its query engine.
Will the adoption of hybrids slow the growth
of polyglot persistence? Given the pace of
new database introduction and the increasing
popularity of single-model databases, such a
scenario seems unlikely. But that doesn’t mean
the new unified stacks won’t prove useful.
Database technology in
new application stacks
Previous articles in this PwC Technology
Forecast series didn’t focus much on relational
databases, because the new capabilities in
NoSQL databases—such as data modeling
features, scalability, and ease of use—are
the key disruptors. With the exception of
PostgreSQL, an object-relational database
that can handle tables as large as 32 terabytes,
the top NoSQL databases ranked by the
DB-Engines popularity analysis saw the
highest growth of any databases ranked in
the top 10. MongoDB grew nearly 49 percent
year over year, Redis grew 26 percent, and
Cassandra grew 31 percent.16
Relational technology is well understood,
including the so-called NewSQL technologies
that seek to address the shortfalls of
conventional web databases in distributed
computing environments. And relational
technology continues to be the default. As of
July 2015, most database users think primarily
in terms of tightly linked tables and up-front
schemas. Of the top 30 databases ranked in
DB-Engines, 17 relational databases received
86 percent of the ranking points. In contrast,
nine NoSQL databases received 11 percent
of the points. Dedicated search engines
(called databases in the DB-Engines analysis)
Elasticsearch, Solr, and Splunk together as a
group received 3 percent of the points.
NoSQL is not the only growth area. Excluded
from the DB-Engines ranking are Hadoop and
other comparable distributed file systems,
such as Ceph and GlusterFS and their
associated data processing engines. These
systems, as explained later, are becoming
the foundation for unified analytics and
operational platforms that incorporate not
only other database functions but also stream
and batch processing functions in a single
environment.
13 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only
structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed
environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational
distributed databases because it has become the default term of art. See the section “Database evolution becomes a revolution” in
the article “Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/
en/technology-forecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on
relational versus non-relational database technology.
14 “DB-Engines Ranking,” DB-Engines, http://db-engines.com/en/ranking, accessed July 8, 2015.
15 “How enterprise graph databases are maturing,” PwC interview with Martin Van Ryswyk and Marko Rodriguez of DataStax, PwC
Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/interviews/
martin-van-ryswyk-marko-rodriguez-interview.jhtml.
16 “DB-Engines Ranking,” DB-Engines, http://db-engines.com/en/ranking, accessed July 8, 2015.
5
PwC Technology Forecast
In database evolution, two directions of development are better than one
Some unified analytics and operational
platforms have NoSQL at their core. These
unified platforms offer the ability to close
the big data feedback loop for the first time,
enabling near-real-time responsiveness—
even in legacy systems that previously seemed
hopelessly outdated and unresponsive.
The Berkeley Data Analytics Stack
Researchers
sequenced
3 million
fragments of a
patient’s genomic
information
within two days.
Then they matched
the sequences that
were left with
the sequences of
known pathogens
within 90 minutes.
Hadoop data infrastructure has been the
starting point for many of these unified
big data frameworks. The Berkeley Data
Analytics Stack (BDAS)—an open-source
framework created over several years at the
University of California, Berkeley, AMPLab
(Algorithms, Machines, and People) in
collaboration with industry—takes the batch
processing of Hadoop, accelerates it with the
help of in-memory technologies, and places
it in an analytics framework optimized for
machine learning.
The capabilities of BDAS have been
paying off in notable ways for more
than a year. Consider the case of Joshua
Osborn, a 14-year-old with an immune
system deficiency who was suffering from
brain swelling in 2014. Doctors couldn’t
diagnose the cause of the swelling by using
conventional bioinformatics systems.
Eventually they turned to an application
the AMPLab pioneered called Scalable
Nucleotide Alignment Program (SNAP) that
runs on BDAS.
Researchers at the University of California,
San Francisco, took the available DNA from
the patient, sequenced 3 million fragments
of genomic information within two days, and
put aside the human genome information
with the help of SNAP. Then they matched the
sequences that were left with the sequences of
known pathogens within 90 minutes.17
It turned out Joshua was infected with
Leptospira, a potentially lethal bacteria that
had caused his brain to swell. Doctors gave
him doses of penicillin, which immediately
reduced the swelling. Two weeks later, Joshua
was walking, and soon he fully recovered.
At the heart of BDAS is Apache Spark, which
serves as a more accessible and accelerated
in-memory replacement for MapReduce, the
data processing engine for Hadoop. Key to
Spark is the concept of a resilient distributed
dataset, or RDD. An RDD is an abstraction
that allows data to be manipulated in memory
across a cluster. Spark tracks the lineage of
data transformations and can rebuild data
sets without replication if an RDD suffers a
data loss.18 The input and lineage data can be
stored in the Hadoop Distributed File System
(HDFS) or a range of databases.
RDDs are one reason Spark can outperform
MapReduce in traditional Hadoop clusters.
In the 2014 Daytona GraySort competition,
Apache Spark sorted 100 terabytes at 4.27
terabytes per minute on 207 nodes. In
contrast, the 2013 winner was Yahoo, whose
2,100-node Hadoop cluster sorted 102
terabytes at 1.42 terabytes per minute.19
Spark uses HDFS and other Hadoopcompatible disk-based file systems for
disk storage. Tachyon, which the AMPLab
announced in 2014, augments the capabilities
of these file systems by allowing data sharing
in memory, another way to reduce latency in
complex machine-learning pipelines.20
In 2015, the AMPLab added Velox to BDAS.
Velox allows BDAS to serve up predictions
generated by machine-learning models and
to manage the model retraining process,
completing the model-driven feedback loop
that Spark begins.
17 Carl Zimmer, “In a First, Test of DNA Finds Root of Illness,” The New York Times, June 4, 2014, http://www.nytimes.com/2014/06/05/
health/in-first-quick-dna-test-diagnoses-a-boys-illness.html, accessed June 4, 2015.
18 Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker,
and Ion Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” NSDI 2012,
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf, accessed June 4, 2015.
19 Daytona GraySort Benchmark, http://sortbenchmark.org/, accessed June 4, 2015.
20Haoyuan Li, “AMP Camp 5: Tachyon – Haoyuan Li,” UC Berkeley AMPLab video, https://www.youtube.com/watch?v=8yPdhIbfP7A,
accessed June 4, 2015.
6
PwC Technology Forecast
In database evolution, two directions of development are better than one
According to Michael Franklin, head of the
AMPLab and chair of the computer science
department at the University of California,
Berkeley, “If you look at the architecture
diagram [included here], Velox sits right next
to Spark in the middle of the architecture.
The whole point of Velox is to handle more
operational processes. It will handle data that
gets updated as it is available, rather than
loading data in bulk, as is common when it’s
analyzed. Our models also get updated on a
real-time basis when using computers that
learn as they process data.”
a disk-based storage or file system first
approach is best, particularly in a Kafka, logfile-centric system. What’s stored on disk gets
cached in memory without duplication, and
the logic that manages the synchronization
of the two resides entirely in the operating
system. The main memory first approach, they
assert, risks greater memory inefficiencies,
slowness, and unreliability because of
duplication and a lack of headroom.21
In other words, an entire test-driven, machinelearning-based services environment could be
built using this stack.
Kafka- and Samza-based
unified big data platforms
Interestingly, while BDAS relies heavily
on in-memory technology up front and
maximizes the use of random-access memory
(RAM) to reduce latency in various parts of
the stack, the proponents of platforms that use
Apache Kafka and Apache Samza think
Apache Kafka and Apache Samza were
born of LinkedIn’s efforts to move to a less
monolithic, microservices-based system
architecture. Kafka is an event-log-based
messaging system designed to maximize
throughput and scalability and minimize
dependencies. Services publish or subscribe
to the message stream in Kafka, which can
handle “hundreds of megabytes of reads
and writes per second.”22 Samza processes
the message stream and enables decoupled
views of the writes in Kafka. Samza and Kafka
together constitute a distributed, log-centric,
immutable messaging queue and data store
that can serve as the center of a big data
analytics platform.23
The missing piece in BDAS
Training
BlinkDB
Spark
Streaming
Management + Serving
MLbase
GraphX
Spark SQL
MLlib
Velox
Spark
Mesos
Tachyon
HDFS, S3, ...
Source: Dan Crankshaw, UC Berkeley AMPLab, 2015,
http://www.slideshare.net/dscrankshaw/velox-at-sf-data-mining-meetup
21“4.2 Persistence,” Apache Kafka, https://kafka.apache.org/08/design.html, accessed June 5, 2015.
22Apache Kafka, http://kafka.apache.org/, accessed June 4, 2015.
23“The rise of immutable data stores,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/
remapping-database-landscape/features/immutable-data-stores-rise.jhtml.
7
PwC Technology Forecast
In database evolution, two directions of development are better than one
But many pieces of that platform are
only now being put into place, including
serving databases and associated querying
capabilities. Confluent is a stream processing
platform startup founded by former LinkedIn
employees who had worked on Kafka, Samza,
and other open-source projects. Kafka can
synchronize the data from various sources,
and Confluent has envisioned how those
data sources could feed into a Kafka-centric
enterprise architecture and how stream
analytics could be accomplished with the help
of Apache Flink, an emerging alternative to
Apache Spark that claims true stream and
batch processing using the same engine.24
Others have also built their own stacks using
Kafka and Samza. Metamarkets, a big data
analytics service,25 has built what it calls a
real-time analytics data (RAD) stack 26 that
includes Kafka, Samza, Hadoop, and its own
open-sourced Druid time-series data store
to query the Kafka data and serve up the
results. Metamarkets says its RAD stack can
ingest a million events per second and query
trillions of events, but has found it necessary
to instrument its stack and establish a
separate monitoring data cluster to maintain
such a high-throughput system in good
working order.
The Confluent and Metamarkets platforms
could be considered implementations of what
have been called the Kappa and Lambda
architectures. In the Kappa architecture,
stream and batch coexist within the same
infrastructure. Confluent asserts that the
separate stream processing pipeline originally
specified in the Lambda architecture was
necessary only because stream processing was
weak at the time. Now stream and batch can
be together. Metamarkets continues to use a
Lambda architecture, but has recently added
Samza to the mix.
Kappa architecture: A log-centric stream processing system
that can also be used for batch processing
Speed (real-time)
layer
Events
Serving layer
Batch
computation
(Hadoop)
Replayable log
Source: www.codefutures.com
24Kostas Tzoumas and Stephan Ewen, “Real-time stream processing: The next step for Apache Flink,” Confluent blog, May 6, 2015,
http://blog.confluent.io/2015/05/06/real-time-stream-processing-the-next-step-for-apache-flink/, accessed June 4, 2015.
25“The nature of cloud-based data science,” PwC interview with Mike Driscoll of Metamarkets, PwC Technology Forecast 2012, Issue 1,
http://www.pwc.com/us/en/technology-forecast/2012/issue1/interviews/interview-mike-driscoll-ceo-metamarkets.jhtml.
26“A RAD Stack: Kafka, Storm, Hadoop, and Druid,” August 27, 2014, http://lambda-architecture.net/stories/2014-08-27-kafka-stormhadoop-druid/, and “Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets,” June 3, 2015,
https://metamarkets.com/2015/dogfooding-with-druid-samza-and-kafka-metametrics-at-metamarkets/, accessed June 5, 2015.
8
PwC Technology Forecast
In database evolution, two directions of development are better than one
In either case, these are two very highthroughput analytics platforms designed
for cost-effective scalability and flexibility.
Metamarkets has a number of online and
mobile advertising company clients, including
OpenX, an online ad exchange that handles
hundreds of billions of transactions each
month. OpenX connects buyers and sellers
in this online exchange. The Metamarkets
service provides visibility into the ad exchange
marketplace and includes custom dashboards
built for OpenX on the front end. The Lambdaarchitecture-based back end queues up
the streams to be analyzed, provides the
processing, and serves up the results.27
Conclusion: More capable
databases and big data
platforms
NoSQL databases emerged in the late 2000s
out of necessity. As big data grew, the need
arose to spread data processing across
commodity clusters to make processing timely
and cost-effective. As big data became more
complex and interactive, developers and
analysts needed access to a range of data
models. As relational database technology
improved, NewSQL responded to retake
some of the territory lost earlier. Hybrids and
immutable data stores are further expanding
the choices available to enterprises.
But the mid-2010s saw expansion in a
different, unanticipated direction—entirely
new data-driven application stacks built
on distributed file systems, particularly
HDFS. Stacks such as BDAS promise to close
the feedback-response loop in a machinelearning environment. Separate databases
will continue to be used in these environments
to provide persistence. Theoretically,
machines doing unsupervised learning
could continuously refine a model given a
closed feedback loop. Doctors confronting
illnesses difficult to diagnose could turn to
a standard process of genomics testing to
identify pathogens in a patient’s body, as
described in the University of California, San
Francisco, example. Similarly, enterprises
could monitor operational processes, focus on
those of particular concern, and use machine
learning to identify elements that could cause
disruption, helping management to diagnose
a problem and prescribe a solution.
Just as BDAS has become more complete
with the addition of Velox, other alternative
stacks are emerging, such as the messagelog-centric Kafka plus Samza environment.
Stream processing is rapidly maturing with
dataflow engines such as Apache Flink, so
organizations can create very high-capacity
online transactional environments that use
the same cluster and engine for Hadoop-style
batch analytics.
What’s possible with the newest technology
is the ability to close the big data feedback
loop for the first time, enabling near-real-time
responsiveness—even in legacy systems that
previously seemed hopelessly outdated and
unresponsive. The distributed database can
still be used on its own for a mobile or web
app, but more and more, its real power comes
when the data store is embedded as part of
one of these newly built stacks. So-called
deep or unsupervised machine learning—an
upcoming PwC Technology Forecast research
topic—pays off most when analytics and
operational systems people work together
to build one continuously improving, cloudnative system. The technology that makes
such a system feasible no longer requires a
forklift upgrade.
27“OpenX case study,” Metamarkets, https://metamarkets.com/case-studies/openx/, accessed June 5, 2015.
9
PwC Technology Forecast
In database evolution, two directions of development are better than one
To have a deeper conversation
about remapping the database
landscape, please contact:
Gerard Verweij
Principal and US Technology
Consulting Leader
+1 (617) 530 7015
[email protected]
Chris Curran
Chief Technologist
+1 (214) 754 5055
[email protected]
Oliver Halter
Principal, Data and Analytics Practice
+1 (312) 298 6886
[email protected]
Bo Parker
Managing Director
Center for Technology and Innovation
+1 (408) 817 5733
[email protected]
About PwC’s Technology Forecast
Published by PwC’s Center for Technology
and Innovation (CTI), the Technology Forecast
explores emerging technologies and trends
to help business and technology executives
develop strategies to capitalize on technology
opportunities.
Recent issues of the Technology Forecast have
explored a number of emerging technologies
and topics that have ultimately become
many of today’s leading technology and
business issues. To learn more about the
Technology Forecast, visit www.pwc.com/
technologyforecast.
About PwC
PwC US helps organizations and individuals
create the value they’re looking for. We’re a
member of the PwC network of firms in 157
countries with more than 195,000 people.
We’re committed to delivering quality in
assurance, tax and advisory services. Find
out more and tell us what matters to you by
visiting us at www.pwc.com.
© 2015 PwC. All rights reserved. PwC refers to the US member firm or one of its subsidiaries or affiliates, and may sometimes refer to the PwC
network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information
purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL