Download How NoSQL key-value and wide-column stores make in-image advertising possible

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
www.pwc.com/technologyforecast
Technology Forecast: Remapping the database landscape
Issue 1, 2015
How NoSQL key-value
and wide-column
stores make in-image
advertising possible
By Alan Morrison
Online ad innovators must process hundreds of terabytes
a day at the lowest possible cost. How do they do it?
To be useful, sometimes big data just needs a good traffic cop. NoSQL1 key-value stores and
wide-column stores serve that function. For example, they enable fast, scalable, targeted, and
more compelling online ad placement. Consider the following online ad by GumGum, a pioneer
of in-image advertising:
You’re surfing the web, looking at home improvement
and garden ideas. You land on an article about lawns.
The first thing you see is a close-up photo, front and
center, of a lush patch of green lawn with a pop-up
sprinkler at work. Wait, now the photo fades to black.
A new ad type called a canvas starts to appear
over the photo, showing pieces of lawn equipment
popping up one at a time, until the whole canvas
is complete. After a moment, it changes again.
The canvas collapses into an ad called a studio,
which has an interactive bar. You can see the photo
again. Out of curiosity, you click the Learn More
button and watch a video that shows a modular
approach to yard equipment—one chassis and
engine for several equipment types—a lawn mower,
a leaf blower, a snow blower. The message: This
modular solution takes up much less space in your
garage than three larger, standalone pieces.2
1 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only
structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed
environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational
distributed stores because it has become the default term of art. See the section “Database evolution becomes a revolution” in the article
“Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technologyforecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on relational versus
non-relational database technology.
2 Source: GumGum demo, April 3, 2015
2
PwC Technology Forecast
How NoSQL key-value and wide-column stores
make in-image advertising possible
The scenario just described is contextually
relevant, in-image advertising. Each in-image
ad complements the existing image on the
web page. In this case, a canvas ad for a
suite of yard-care equipment appears with a
photo of a lawn in an article about lawn care.
Now think about all the big data storage,
processing, and retrieval requirements
GumGum’s system must meet just to be able
to do what you’ve watched online:
“We have only
milliseconds to
make decisions
and to figure out
what ad we’ll
serve. So we need
to be able to access
that data quickly.”
—Ken Weiner,
GumGum
• Awareness and understanding of all
the photos and pages on websites
from thousands of publishers. These
photos are the available inventory that
GumGum matches ads to. Awareness
and understanding at scale implies image
recognition and text-mining capability—
that is, a way for machines to read and
recognize text and images.
• Awareness and understanding of the target
sub-audiences for each ad as conveyed by an
analysis of anonymous data from publishers
about their readers.
• Inexpensive but highly capable cloudbased, petabyte-scale3 storage, processing,
and retrieval of the data previously
described to support ad placement and
serving decisions in near real time.
A wide-column or key-value store such as
Cassandra, Amazon DynamoDB, Redis, or
Riak is central to each of these requirements.
“Latency is very important to any advertising,”
says GumGum CTO Ken Weiner in an
interview with PwC. “We must select and
show ads to users in as little time as possible.
GumGum also participates in real-time
bidding integrations with other companies,
where we have only milliseconds to make
decisions and to figure out what ad we’ll
serve. So we need to be able to access that
data quickly.”
Wide-column stores and key-value stores
are equally suitable for use cases that
require storing, organizing, and quickly
retrieving huge amounts of data that require
little analysis or modeling; for example,
personalizing retail website experiences and
organizing Internet of Things (IoT) data from
sensors.
Wide-column stores and key-value stores,
the topic of this article, are among many
innovations creating a sea change in database
technology. This issue of the Technology
Forecast explores the promise and upheaval
caused by these various new technologies.
3 Digital advertising companies process huge amounts of data, and GumGum is no exception. In 2014 in VentureBeat, John Koetsier
reported about AdRoll’s data storage and processing requirements: “Retargeting leader AdRoll announced that it is processing a
massive 130 terabytes of advertising data daily and has reached 10 petabytes of stored data from the last 12 months. That’s 90 times the
volume of data generated by all the stock exchanges in the U.S.” (John Koetsier, “AdRoll hits gigantic 130 terabytes of ad data processed
daily, says size matters,” VentureBeat, November 12, 2014, http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-addata-processed-daily-says-size-matters/, accessed April 3, 2015.)
3
PwC Technology Forecast
How NoSQL key-value and wide-column stores
make in-image advertising possible
What are key-value and
wide-column stores?
A key-value store is a highly simplified version
of a database that is also highly optimized
for a few primary capabilities. It stores
individual elements, which could be a digital
representation of any type of value, large
or small. Key-value stores are optimized for
speed and scalability and are often used in
caching applications that require extremely
high throughput. They derive their speed and
scalability from a simple data model and low
overall database complexity.
Key-value store
Key-value stores offer very high speed via the
least complicated data model—anything can
be stored as a value, as long as each v alue is
associated with a key or name.
Key
Value
Wide-column stores are also fast and are
nearly as simple as key-value stores. They
include a primary key, an optional secondary
key, and any grouping of digital bits stored
as a value. A wide-column store can be
considered part of the larger family of keyvalue stores. A wide-column store is also a row
store, despite the name. Each row contains a
different record.
Key-value and wide-column store designers
see advantages inherent in the simplest, most
stripped-down data models for certain use
cases that require application development
speed or flexibility. Consider the example
of Basho Technologies Riak, a key-value
store. Riak stores groups of keys and values
in “buckets,” or binary data objects. The
buckets can contain any data type. Basho
Technologies asserts that a simple data model
simplifies application development, and
that “new features to the application will
not require updating a schema or changing
the data model, ideal for applications where
rapid iterations are required and changes in
the underlying data model are undesirable.” 4
However, data ingested into a key-value store
must be in the form of key-value pairs.
Wide-column store
Wide-column stores are also fast and are nearly as simple as key-value stores. They include a primary
key, an optional s econdary key, and anything stored as a value.
Values
Primary key
Keys and values can be
sparse or numerous
Secondary key
4 Relational to Riak, Basho Technologies white paper, http://www.basho.com/assets/RelationaltoRiak.pdf, accessed April 3, 2015.
4
PwC Technology Forecast
How NoSQL key-value and wide-column stores
make in-image advertising possible
Data storage options for widecolumn and key-value stores
Wide-column and key-value stores vary and
are designed for particular classes of use cases
that determine their cost and performance
tradeoffs. Some of the main distinctions and
tradeoffs include these:
The viability of
some companies’
business models
depends on
choosing the
lowest-cost storage
options.
• Hard disk drive: Hard disk drive (HDD)
storage on large collections of distributed
compute clusters has become quite
inexpensive, and the viability of some
companies’ business models depends on
choosing the lowest-cost storage options.
As a small startup, GumGum, for example,
wants to minimize storage cost per bit and
yet capture the essential big data about all
the pages and photos in inventory produced
by the publishers it represents. So it uses a
public cloud infrastructure service with an
open-source version of Apache Cassandra
that is optimized for spinning disk and a
distributed environment. In Cassandra,
once the rows are committed to disk,
they’re immutable—later changes appear in
subsequent rows. With this approach, writes
to disk are non-blocking and faster.
5
PwC Technology Forecast
• DRAM and SSD with a caching layer:
In-memory databases are optimized to store
data in random access memory, or RAM,
for better performance. Databases can be
more or less “in-memory” depending on
the use case and storage architecture. The
lower-cost options rely more on disk plus an
in-memory caching layer.
DataStax Enterprise offers a distribution
of Cassandra with three options: spinning
disk, solid-state disk (SSD), or dynamic
RAM (DRAM) caching. Similarly, Apache
Accumulo adds two caching layers to the
spinning disk storage option, and Oracle
Berkeley DB and Amazon DynamoDB
include a caching layer with an SSD option.
• DRAM and/or SSD all in-memory: Some
databases load the entire key-value store
into main memory. That extra speed sought
through optimizing for RAM comes at a
cost, because RAM is much more expensive
than disk, and big data analytics implies
huge data volumes. With their in-memory
solutions, products such as Aerospike,
Redis, and Riak often claim latency of a
millisecond or less, and Aerospike offers
either DRAM with spinning disk persistence
or a hybrid DRAM/SSD alternative to
address the volatility associated with DRAM
(that is, the data loss potential of DRAM
if a power loss occurs). The latter takes
advantage of a proprietary file structure
designed to access flash RAM.
How NoSQL key-value and wide-column stores
make in-image advertising possible
Other considerations
There are many other factors to weigh when
thinking about how a key-value or widecolumn store might complement an existing
data architecture. These include strategies to
support overall data management goals and
cost versus performance.
In-Hadoop database designs. With Hadoop,
organizations can preserve many petabytes of
heterogeneous data in its original, full-fidelity
format in a unified, low-cost, distributed
computing environment. This capability
makes Hadoop attractive for a data lake
scenario5 and enterprise-wide, exploratory
analytics across silos. Some wide-column
stores—such as Apache HBase, Apache
Accumulo, and MapR-DB—are designed
to work on top of the Hadoop Distributed
File System (HDFS) and in conjunction
with other services in the Hadoop stack,
such as ZooKeeper and Thrift. In-Hadoop
databases do introduce a level of management
complexity that can be considerable, because
they’re built on the rapidly evolving, multi-tier
Hadoop stack.
GumGum, for instance, started with Apache
HBase on Hadoop, but then encountered
some perplexing troubleshooting challenges.
“HBase uses HDFS and ZooKeeper,” says
Vaibhav Puranik, director of engineering at
GumGum. “HBase runs multiple processes
on a node [region server], so whenever there
was a problem, we didn’t know whether the
HBase processes, the Hadoop processes,
or something else caused the problem. To
maintain HBase, you must maintain three or
four pieces of software together, whereas with
Cassandra, we have just one simple process
running on every single node.”
After a single-point failure caused data loss
(a known risk of using Hadoop 1.0 since
rectified by version 2.0), GumGum decided to
move to Cassandra alone. The main Hadoop
distributors do claim newer HBase and
Hadoop management functionality, which
may address some of the other problems
GumGum encountered.6
Cost-performance tradeoffs that affect
database performance. Other related
variables that have cost-performance tradeoffs
include networked versus direct-attached
storage, virtualized versus physical hardware,
and how the data is written or ingested.
Conclusion: Match the use
case to the key-value or widecolumn store alternative
Since 2006, when Google first published its
paper on BigTable,7 the database world has
seen a proliferation of wide-column and keyvalue store options that reflect a range of
alternatives. Users can match their specific
use cases to the database that best suits each
use case.
The data model simplicity of the key-value
store family helps companies that need to
process very large volumes of perishable,
structured data. They can spread out the
processing over large, distributed, commodity
computer clusters, and in-memory options
lower the level of latency inherent in
networked systems. Those clusters can scale
out linearly, allowing companies to affordably
process more and more data as the business
scales up.
5 See “Data lakes and the promise of unsiloed data,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technologyforecast/2014/cloud-computing/features/data-lakes.jhtml, for more information on data lake architecture and strategy.
6 For more information, see “MapR-DB In-Hadoop NoSQL Database” at https://www.mapr.com/products/mapr-db-in-hadoop-nosql,
“Cloudera Manager Backup and Disaster Recovery” at http://www.cloudera.com/content/cloudera/en/documentation/clouderamanager/v5-0-0/PDF/Cloudera-Manager-Backup-Data-Recovery.pdf, and “Apache HBase” at http://hortonworks.com/hadoop/hbase/,
accessed April 7, 2015.
7 Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes,
and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, Google research paper, 2006, http://static.
googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf, accessed April 7, 2015.
6
PwC Technology Forecast
How NoSQL key-value and wide-column stores
make in-image advertising possible
Case studies such as the GumGum example
described earlier help to highlight how many
different tradeoffs and capabilities factor into
choosing a database. In a polyglot persistence
environment, several database types are
used. Users tolerate somewhat more latency
in a shopping cart function than they would
in a gaming session. Some features of a
database—such as many different Redis data
types—lend themselves to numerical data,
whereas Cassandra seems more suited for
non-numerical data.8 Atomicity, consistency,
isolation, durability (ACID) compliance
might make sense for situations when a keyvalue store is used to augment a relational
database directly.
The remaining modules of this issue of the
Technology Forecast will examine some of
the more forward-looking trends, including
immutability, hybrids and the use of
innovative data store technology in datadriven application stacks. Some of these
frameworks and platforms offer dynamic data
models in which the model mapping occurs
in-memory. The next generation of NoSQL
and NewSQL databases promises to build on
the first.
8 Itamar Haber, “The Top 3 Game Changing Redis Use Cases,” Redis Labs blog, April 3, 2014, https://redislabs.com/blog/the-top-3-gamechanging-redis-use-cases, accessed April 7, 2015.
To have a deeper conversation
about remapping the database
landscape, please contact:
Gerard Verweij
Principal and US Technology
Consulting Leader
+1 (617) 530 7015
[email protected]
Chris Curran
Chief Technologist
+1 (214) 754 5055
[email protected]
Oliver Halter
Principal, Data and Analytics Practice
+1 (312) 298 6886
[email protected]
Bo Parker
Managing Director
Center for Technology and Innovation
+1 (408) 817 5733
[email protected]
About PwC’s Technology Forecast
Published by PwC’s Center for Technology
and Innovation (CTI), the Technology Forecast
explores emerging technologies and trends
to help business and technology executives
develop strategies to capitalize on technology
opportunities.
Recent issues of the Technology Forecast have
explored a number of emerging technologies
and topics that have ultimately become
many of today’s leading technology and
business issues. To learn more about the
Technology Forecast, visit www.pwc.com/
technologyforecast.
About PwC
PwC US helps organizations and individuals
create the value they’re looking for. We’re a
member of the PwC network of firms in 157
countries with more than 195,000 people.
We’re committed to delivering quality in
assurance, tax and advisory services. Find
out more and tell us what matters to you by
visiting us at www.pwc.com.
© 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may
sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This
content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL