Download The rise of immutable data stores Some innovators are abandoning long-held database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Database model wikipedia , lookup

Transcript
www.pwc.com/technologyforecast
Technology Forecast: Remapping the database landscape
Issue 1, 2015
The rise of immutable
data stores
By Alan Morrison
Some innovators are abandoning long-held database
principles. Why?
The website for Room Key, a joint venture of
six hotel chains to help travelers find and book
lodging, collects data from as many as
17 million pages per month, records an
average of 1.5 million events every 24 hours,
and handles peak loads of 30 events per
second. To process that onslaught of complex
information, its database records each event
without waiting for some other part of the
system to do something first.
The Room Key data store1 doesn’t much
resemble the relational database typically
used for transaction processing. One
important difference is the way databases
record new entries. In a relational database,
files are mutable, which means a given cell
can be overwritten when there are changes to
the data relevant to that cell. Room Key has an
accumulate-only file system that overwrites
nothing. Each file is immutable, and any
changes are recorded as separate timestamped
files. The method lends itself not only to
faster and more capable stream processing,
but also to various kinds of historical timeseries analysis—that is, both ends of the data
lifecycle continuum.
The traditional method—overwriting mutable
files—originated during the era of smaller
data when storage was expensive and all
systems were transactional. Now disk storage
is inexpensive, and enterprises have lots of
data they never had before. They want to
minimize the latency caused by overwriting
data and to keep all data they write in a fullfidelity form so they can analyze and learn
from it. In the future, mutable could become
the exception rather than the rule. But as of
2015, the use of immutable files for recording
data changes is by far the exception.
The more expansive use of files in a database
context is one of the many innovations
creating a sea change in database technology.
This PwC Technology Forecast series on
Immutable data stores and the data lifecycle
Immutable, append only data stores and the data lifecycle
Immutable data stores could be useful in any stage of the data lifecycle,
but have only recently been used in transactional systems.
Immutable data stores
Current use
NoDBs
Emerging
Established
Plan
Key-value/column
Capture/collate
Data set uses
Simple persistence
Characteristics
Raw
Less structured
Perishable
Single use
Massively scalable
Emerging use
Database as a service
Document
Stream
Aggregate
Immediate usability
Overlays and hybrids
NewSQL
Transact
Graph
Network
Reuse
Long-term enterprise reusability
Refined
Structured
Less perishable
Reusable
Less scalable
1 “Room Key,” Cognitect, http://www.datomic.com/room-keys-story.html, accessed May 18, 2015.
2
PwC Technology Forecast
The rise of immutable data stores
Remapping the database landscape explores
the promise and upheaval caused by these
new technologies. This article examines the
evolution of immutable databases and what
it means for enterprise databases outside
the tech sector. Previous articles2 focused on
NoSQL3 data models, graph databases, and
document stores.
A growing collection of
immutable facts
Whereas transactional databases are
systems of record, most new data stores are
systems of engagement. These new data
stores are designed for analysis, whether
they house Internet of Things (IoT) data,
social-media comments, or other structured
and unstructured data that have current
or anticipated analytical value. The new
data stores are built on cloud-native
architectures, and immutable files are more
consistent with the cloud mentality. At this
early stage in data analytics, any web-scale
architecture is a candidate for a data store
that has immutable files.
This evolving technology makes sense for
many niches within mainstream enterprises.
Those niches are the emerging applications
that enterprises might not be using much now,
but will be in the future. Forward-leaning
enterprises should plant one foot in the future.
They need to understand the immutable
option and learn to work with it.
Consider this IoT example: The Bosch Group
is launching sensor technology that connects
railroad cars to the Internet and gathers
information while the train is moving. The
data will provide insight into temperature,
noise, vibrations, and other conditions
useful for understanding what is happening
to the train and its freight in transit. Data is
transmitted wirelessly to servers, evaluated by
control logistics processes, and presented in
a data portal, integrated with the customer’s
business processes.4
Such a system would transmit a near-steady
stream of data. The users would want the
entire data set intact as recorded to study,
analyze, learn from, and study again; they
would not want anything overwritten by
changes. Immutable files are ideal for this
use case.
Martin Kleppmann, a serial entrepreneur,
developer, and author on sabbatical from
LinkedIn, thinks the only reason databases
still have the mutable state is inertia. Mutable
state, he says, is the enemy, something
software engineers have tried to root out of
every part of the system except databases.
Now storage is inexpensive, which makes
feasible immutable data storage at scale.
Given the economics, Kleppmann says a
database should be “an always-growing
collection of immutable facts,” rather than a
technology that can overwrite any given cell.
2 Other feature articles in this series (PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/
remapping-database-landscape.jhtml) have included the following:
“Enterprises hedge their bets with NoSQL databases” described the big-picture basics, comparing and contrasting relational and nonrelational databases and how NoSQL databases have been filling a gap opened by the growth of heterogeneous data for customer-facing
systems of engagement.
“Using document stores in business model transformation” considered the flexibility, capacity, and search and retrieval capability of
document stores in applications such as Codifyd’s faceted search in web-scale e-commerce catalogs.
“The promise of graph databases in public health” explored the nature of graph stores, their analytics potential, and how software-as-aservice providers such as Zephyr Health are using them to integrate thousands of different data sources.
“How NoSQL key-value and wide-column stores make in-image advertising possible” explored how the speed and scalability of these
database types make innovations such as GumGum’s in-image online advertising possible.
3 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only
structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed
environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational
distributed databases because it has become the default term of art. See the section “Database evolution becomes a revolution” in
the article “Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/
en/technology-forecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on
relational versus non-relational database technology.
4 The Bosch Group, “Bosch brings freight trains to the internet,” news release, May 5, 2015, http://www.bosch-presse.de/presseforum/
details.htm?txtID=7216&tk_id=108, accessed May 27, 2015.
3
PwC Technology Forecast
The rise of immutable data stores
Besides the expense, overwriting also
harkens back to a time when data was more
predictable. Kleppmann, by his own definition
a database rebel, questions the conventional
wisdom of sticking with traditional data
technologies that overwrite, instead of leaving
data written the way it is the first time and
documenting changes in separate files.
He points out that conventional database
replication mechanisms already rely on
streams of immutable events.
Tango object database6 are among the other
data stores currently available or being
developed that claim the ability to support
consistent, high-volume transactions and
write guarantees without a mutable state.
Combinations of Apache Kafka (a messaging
broker described later in this article) and
Samza may be headed in that direction as well.
What is an immutable or logcentric database, anyway?
Conventional database wisdom says you need
to overwrite a cell in a table, or a collection
of related cells in multiple tables, and lock
what’s affected until the write takes hold. This
action must be taken to guarantee integrity
within and across related data tables. Write
locking by definition builds in contention or
dependency among locked data entries, and
thus the potential for delay in overwriting the
data. One part of the database system needs
to wait for cells to be unlocked before writing
to them. The need to ensure consistency
in transactions involving overwrites, part
of the ACID guarantees7 that relational
databases are known for, means that write
locking is necessary. Some database vendors
(Couchbase, for example) claim an ability to
perform nonblocking writes, but that’s within
the context of eventual rather than immediate
consistency.
He refers to Apache Samza, a project he
helped start at LinkedIn and contributes to, as
“a distributed stream processing framework,”
but admits that it’s really “a surreptitious
attempt to take the database architecture we
know and turn it inside out.” 5
Kleppmann, Pat Helland of Salesforce, and
Jay Kreps, formerly of LinkedIn and now
CEO of Confluent, are three of the most
recent and vocal advocates for the concept of
immutability in database technology. They
share these views for transactional systems, as
well as analytics systems.
At present, Hadoop data lakes are immutable,
but there are few other examples. Datomic
(Room Key’s partner) and the Microsoft
A log-centric or immutable file database decouples writes and reads,
avoiding the resource contention associated with writes.
Data source
writes
Log
0
1
2
3
4
5
6
7
reads
Destination
system A
(time =7)
8
9
10
11 12
reads
Destination
system B
(time =11)
Source: Jay Kreps of LinkedIn, 2015
5 See Martin Kleppmann, “Turning the database inside-out with Apache Samza,” transcript and video of presentation at StrangeLoop 2014,
http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/, including his references to writings from Jay
Kreps, Pat Helland, and Microsoft Research. Thanks to Michael Hausenblas of Mesosphere for referring us to this blog post.
6 Mahesh Balakrishnan et al., Tango: Distributed Data Structures over a Shared Log, Microsoft Research, University of California, San
Diego, Cornell University, and Tel Aviv University, http://www.cs.cornell.edu/~taozou/sosp13/tangososp.pdf, accessed May 14, 2015.
7 ACID stands for atomicity, consistency, isolation, and durability, which are principles that database designers have historically adhered to
for mission-critical transactional systems.
4
PwC Technology Forecast
The rise of immutable data stores
Analytics have been the latest focus
of the immutable file, or log-centric,
databases. LinkedIn uses Apache Samza
for various resource metrics related to site
performance. The monitoring takes the
form of a data-flow graph.
Immutable
transaction logs
have existed for
decades.
As described earlier, traditional databases
store data in tables. Some cells in a table
or tables are overwritten whenever
changes need to be made. But not all data
persistence requirements fit this pattern.
A different pattern could be a long list of
time-ordered facts, such as a system of
log files where each individual fact never
changes. Any business use of the data is
concerned only with understanding that
history of facts. A time-based log of facts is
a similar pattern. But the business use could
be confused about the current state of the
business as recorded in a log.
For example, a rental owner would record a
history of tenants, but also be interested in
who the current tenants are. This scenario
sounds like a transaction requiring an
overwrite of the field “current tenant.”
Instead, when there is a change, the system
records that change as a separate record in a
separate file. In other words, the files taken
together reflect all the changes, whereas in
the tabular database, each table must reflect
all changes related to it.
Kreps says the immutable equivalent to a
table is a log. Kreps observes that the “log is
the simplest storage abstraction,” and points
out that “a log is not that different from a
file or a table. A file is an array of bytes, a
table is an array of records, and a log is really
just a kind of table or file where records are
sorted by time.” So why not store all your
records in log form, as Kreps suggests? When
a record changes, just store the change, and
that becomes a separate timestamped record.
Storage is inexpensive enough that this option
is possible.8
Why is immutability in big
data stores significant?
Immutable databases promise these
advantages:
• Fewer dependencies: Immutable
files reduce dependencies or resource
contention, which means one part of the
system doesn’t need to wait for another
to do its thing. That’s a big deal for large,
distributed systems that need to scale and
evolve quickly. Web companies are highly
focused on reducing dependencies. Helland
of Salesforce says, “We can now afford
to keep immutable copies of lots of data,
and one payoff is reduced coordination
challenge.” 9
• Higher-volume data handling and
improved site-response capabilities:
Room Key, with the help of Datomic, can
present additional options to site visitors
who leave before booking a room. Room
Key can also handle more than 160 million
urgent alerts per month to shoppers,
making sure they know when room
availability is diminishing and giving them
a chance to book rooms quickly during
periods of high demand.
• More flexible reads and faster writes:
Michael Hausenblas of Mesosphere
observes that writing the data without
structuring it beforehand means that “you
can have both fast reads and writes,” as well
as more flexibility in how you view the data.
• Compatibility with microservices
architecture, log-based messaging
protocols, and Hadoop: LinkedIn’s
Apache Samza and Apache Kafka, a
simplified, high-volume messaging
queue also designed at LinkedIn, are
symbiotic and compatible with the Hadoop
Distributed File System (HDFS), a popular
method of distributed storage for lessstructured data.10
8 Jay Kreps, “The Log: What every software engineer should know about real-time data’s unifying abstraction,” LinkedIn blog, https://
engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying,
December 16, 2013, accessed May 18, 2015.
9 Pat Helland, “Immutability Changes Everything,” Conference on Innovative Data Systems Research (CIDR), January 2015.
10 See “Agile coding in enterprise IT,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/en_US/us/technology-forecast/2014/
cloud-computing/features/microservices.jhtml, for more information on messaging in a microservices context.
5
PwC Technology Forecast
The rise of immutable data stores
• Suitability for auditability and
forensics, especially in data-driven, fully
instrumented online environments: Logcentric databases and the transactional
logs of many traditional databases share
a common design approach that stresses
consistency and durability (the C and D in
ACID). But only the fully immutable shared
log systems preserve the history that is most
helpful for audit trails and forensics.
Most of the technologies that make such log
systems possible are not new. Systems such
as the transaction logs mentioned earlier,
which hold timestamped log and other files
that contain immutable data, have existed
for decades.
Data storage as an immutable series of
append-only files has been the norm for more
than 15 years in some truly distributed, big
data analytics computing environments.
Examples include the Google File System
(GFS) in the mid-2000s and most recently the
HDFS, a clone of GFS, in the 2000s and 2010s.
Earlier append-only log-file systems include
the Zebra striped network file system from
the mid-1990s. Among NoSQL databases,
CouchDB (a document store) also stores
database files in append-only mode.11
Functional programming languages such
as Erlang, Haskell, and LISP have seen a
resurgence of interest because of the growth
of parallel or cluster computing, and these
all embrace the principle of immutability to
simplify how state, for example, is handled.
Rich Hickey, creator of Clojure (a LISP library
for Java developers) and the founder of
Datomic, boils the notion of state down to a
value associated with an identity at a given
point in time.12 “Immutable stores aren’t new,”
Dave Duggal, founder of full-stack integration
provider EnterpriseWeb, points out. “They are
common for high-level programming—that’s
the reason why immutability proponents are
often data-driven application folks.”
What is new is that Kleppmann and others
advocate the immutable, append-only
approach as the norm for online analytical
processing (OLAP) and online transaction
processing (OLTP) environments. On the
OLAP side, LinkedIn’s Kreps advocates a
unified data architecture that supports stream
and batch processing. On the OLTP side, the
designers of the Microsoft Tango metadata
object store claim fast transactions across
partitions. However, the intended purpose
of that data store is metadata services, not
transactional services, according to the
research paper.
Conclusion: Renewed
immutability debates and
other considerations
It’s interesting to note the proponents of these
immutable, log-centric databases generally
don’t have database backgrounds. Instead,
they tend to be data-driven application
and large-scale system engineers. It’s also
interesting to note that more traditional
database designers can be immutability’s most
vocal opponents.
Take, for example, Baron Schwartz, one of
the contributors to MySQL, an open-source
relational database. Schwartz wrote a cogent
critique of databases such as Datomic,
RethinkDB, and CouchDB in 2013. Among
Schwartz’s arguments:
• Maintaining access to old facts comes at a
high price, like the cost of infinitely growing
storage.
• Even in a solid-state environment, entities
are spread out across physical storage,
slowing things down.
• Disks eventually get full, which means you
must save the old database and start a new
one in the case of CouchDB, and reserve
enough space to do so. Running out of space
can make a database such as Datomic totally
unavailable, he says.
11 For more on Hadoop in a data lake integration context, see “Data lakes and the promise of unsiloed data,” PwC Technology Forecast
2014, Issue 1, http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.jhtml. For more on CouchDB
and RethinkDB database design, see “Background: Current CouchDB storage,” http://www.couchbase.com/wiki/display/couchbase/
Generational+Append-Only+Storage+Files, accessed May 6, 2015. See “Using document stores in business model transformation,”
PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/
features/document-stores-business-model-transformation.jhtml, for a case study involving Avera and CouchDB.
12 Rich Hickey, “Persistent Data Structures and Managed References,” InfoQ QCon presentation, October 1, 2009, http://www.infoq.com/
presentations/Value-Identity-State-Rich-Hickey, accessed May 26, 2015.
6
PwC Technology Forecast
The rise of immutable data stores
• Relational databases do have short-term
immutability that takes the form of old rows
maintained in a history list. In this way and
others, designs have long steered clear of
the shortfalls that log-centric databases
seem to be facing now.13
“At one point, if
you weren’t
using MySQL as
your back end,
VCs [venture
capitalists] would
take their money
away from you.”
—Roberto Zicari
Schwartz made some valid points, but these
points didn’t seem to factor in the broader
data storage landscape and its more recent
impact. The fact is that an immutable
approach has been in place for a number of
years now; in Hadoop clusters with HDFS,
for example. That approach continues to
evolve. NoSQL database immutability will
presumably improve along similar lines.
Schwartz’s points about ACID-compliant
databases and how they preserve consistency
and availability are well taken, and those
databases continue to be well suited to
traditional core transactional systems.
Those systems aren’t going to change much
anytime soon.
Schwartz also doesn’t seem to think much
about the data (or the metadata) between
the data and how big data analytics systems
are designed to look at the same data sets
or aggregations of data sets from different
vantage points. Immutable databases take
snapshots of points in time, and analysts
can mine that data in many different ways
for different purposes.14 Conventional
transactional systems are rather singleminded and singular in purpose by
comparison.
To pursue innovation, companies engaged
in transformation efforts must do different
things with the data they’ve collected.
Demand for immutability in databases will be
strong in certain industries—for applications
such as patient records that value the longterm audit trail and a full history. Avera’s use
of CouchDB for its longitudinal studies is a
prime example.15
The potential impact of web-company
innovations can’t be discounted. The
database landscape has seen the intrusion
of these engineers before, in the case of
NoSQL, Hadoop, and related big data
analytics technologies. These technologies
had a considerable disruptive impact, and
that impact continues. Roberto Zicari, a
professor at J. W. Goethe University Frankfurt,
Germany, and editor of ODBMS.org, a site that
tracks database evolution, described the birth
of NoSQL in an interview with PwC:
MySQL came out. And at one point, if
you weren’t using MySQL as your back
end, VCs [venture capitalists] would take
their money away from you. It became
standard operating procedure. But they
didn’t question the fundamental value of
the relational model at that time. It was like
violating a law.
Then the next big challenge was that the
scale of these web companies reached the
point where they faced problems staying
within the relational model. Driven more by
necessity than insight, they started creating
solutions to match the scale of the problems
they faced. These solutions were produced
through open-source processes and became
the solve-this-problem databases: keyvalue, column, and document stores.
Since then, the broader storage environment
has become heterogeneous, multitiered,
and aligned with cloud application and
infrastructure technologies. Taken together,
Apache Kafka and Samza aren’t just another
sort of database; they’re components in an
entirely redesigned ecosystem that LinkedIn
uses. They fit with the architectural principles
of microservices. Samza works with the
input from Apache Kafka, which is already a
popular, high-volume, high-speed messaging
queue in native cloud architectures. And
equivalents exist in Apache Spark and Storm
when used in conjunction with HDFS.
13 Baron Schwartz, “Immutability, MVCC, and garbage collection,” Xaprb (blog), December 28, 2013, http://www.xaprb.com/
blog/2013/12/28/immutability-mvcc-and-garbage-collection/, accessed May 19, 2015.
14 See “Creating a big data canvas with NoSQL,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technologyforecast/2015/remapping-database-landscape/interviews/tom-foth-interview.jhtml. Tom Foth makes this point: “As soon as we clean the
data, we’ve lost data, and that can be a problem. Let me give you an example. For a power grid, let’s say data is coming in from a bunch
of smart meters from people’s homes. That can be a really noisy environment. If I can analyze the noise, there’s a chance that I can have
a lot of other insight about the operation of that grid because of the way the noise is generated.”
15 See the Avera case study in “Using document stores in business model transformation,” PwC Technology Forecast 2015, Issue 1,
http://www.pwc.com/us/en/technology-forecast/2015/remapping-database-landscape/features/document-stores-business-modeltransformation.jhtml, for more information.
7
PwC Technology Forecast
The rise of immutable data stores
HDFS. The Apache Spark stack, or Berkeley
Data Analytics Stack, incorporates its own
distributed storage system called Tachyon.
This data store is designed to complement
distributed file systems or object stores
such as HDFS, GlusterFS, or S3 and act
as a checkpointing repository for the data
sets in memory.
Hundreds of companies are developing
and using applications based on Spark and
Storm. At a more tactical level, there’s the
database tier that’s used on its own. For
more strategic and ad hoc views across silos
by using analytics technologies, there’s the
data lake tier—the locus of much of the
meaningful big data–related innovation
happening in open source.
Mutable data, particularly in core
transactional systems, will continue to
To have a deeper conversation
about remapping the database
landscape, please contact:
Gerard Verweij
Principal and US Technology
Consulting Leader
+1 (617) 530 7015
[email protected]
Chris Curran
Chief Technologist
+1 (214) 754 5055
[email protected]
Oliver Halter
Principal, Data and Analytics Practice
+1 (312) 298 6886
[email protected]
Bo Parker
Managing Director
Center for Technology and Innovation
+1 (408) 817 5733
[email protected]
have a place in database management.
Sometimes users will want to take old records
offline to ensure superseded products are
not accidentally made available in online
operational systems. Recalls of automotive
parts, for example, would require a new part
number in ordering systems to replace the old
part. This kind of critical identifier is handled
most reliably via a well-established ACIDcompliant transactional system.
With big data analytics, a new approach
demands new structures and methods for
collecting, recording, and analyzing enterprise
data. Machine learning, for example, thrives
on more data, so smart machines can learn
more and faster. Immutable files and their
more loosely coupled nature will help humans
and machines to wrestle all the data they
acquire into usable form.
About PwC’s Technology Forecast
Published by PwC’s Center for Technology
and Innovation (CTI), the Technology Forecast
explores emerging technologies and trends
to help business and technology executives
develop strategies to capitalize on technology
opportunities.
Recent issues of the Technology Forecast have
explored a number of emerging technologies
and topics that have ultimately become
many of today’s leading technology and
business issues. To learn more about the
Technology Forecast, visit www.pwc.com/
technologyforecast.
About PwC
PwC US helps organizations and individuals
create the value they’re looking for. We’re a
member of the PwC network of firms in 157
countries with more than 195,000 people.
We’re committed to delivering quality in
assurance, tax and advisory services. Find
out more and tell us what matters to you by
visiting us at www.pwc.com.
© 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may
sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This
content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL