Download How NoSQL key-value and wide-column stores make in-image advertising possible

www.pwc.com/technologyforecast Technology Forecast: Remapping the database landscape Issue 1, 2015 How NoSQL key-value and wide-column stores make in-image advertising possible By Alan Morrison Online ad innovators must process hundreds of terabytes a day at the lowest possible cost. How do they do it? To be useful, sometimes big data just needs a good traffic cop. NoSQL1 key-value stores and wide-column stores serve that function. For example, they enable fast, scalable, targeted, and more compelling online ad placement. Consider the following online ad by GumGum, a pioneer of in-image advertising: You’re surfing the web, looking at home improvement and garden ideas. You land on an article about lawns. The first thing you see is a close-up photo, front and center, of a lush patch of green lawn with a pop-up sprinkler at work. Wait, now the photo fades to black. A new ad type called a canvas starts to appear over the photo, showing pieces of lawn equipment popping up one at a time, until the whole canvas is complete. After a moment, it changes again. The canvas collapses into an ad called a studio, which has an interactive bar. You can see the photo again. Out of curiosity, you click the Learn More button and watch a video that shows a modular approach to yard equipment—one chassis and engine for several equipment types—a lawn mower, a leaf blower, a snow blower. The message: This modular solution takes up much less space in your garage than three larger, standalone pieces.2 1 Structured query language, or SQL, is the dominant query language associated with relational databases. NoSQL stands for not only structured query language. In practice, the term NoSQL is used loosely to refer to non-relational databases designed for distributed environments, rather than the associated query languages. PwC uses the term NoSQL, despite its inadequacies, to refer to non-relational distributed stores because it has become the default term of art. See the section “Database evolution becomes a revolution” in the article “Enterprises hedge their bets with NoSQL databases,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/us/en/technologyforecast/2015/remapping-database-landscape/features/enterprises-nosql-databases.jhtml, for more information on relational versus non-relational database technology. 2 Source: GumGum demo, April 3, 2015 2 PwC Technology Forecast How NoSQL key-value and wide-column stores make in-image advertising possible The scenario just described is contextually relevant, in-image advertising. Each in-image ad complements the existing image on the web page. In this case, a canvas ad for a suite of yard-care equipment appears with a photo of a lawn in an article about lawn care. Now think about all the big data storage, processing, and retrieval requirements GumGum’s system must meet just to be able to do what you’ve watched online: “We have only milliseconds to make decisions and to figure out what ad we’ll serve. So we need to be able to access that data quickly.” —Ken Weiner, GumGum • Awareness and understanding of all the photos and pages on websites from thousands of publishers. These photos are the available inventory that GumGum matches ads to. Awareness and understanding at scale implies image recognition and text-mining capability— that is, a way for machines to read and recognize text and images. • Awareness and understanding of the target sub-audiences for each ad as conveyed by an analysis of anonymous data from publishers about their readers. • Inexpensive but highly capable cloudbased, petabyte-scale3 storage, processing, and retrieval of the data previously described to support ad placement and serving decisions in near real time. A wide-column or key-value store such as Cassandra, Amazon DynamoDB, Redis, or Riak is central to each of these requirements. “Latency is very important to any advertising,” says GumGum CTO Ken Weiner in an interview with PwC. “We must select and show ads to users in as little time as possible. GumGum also participates in real-time bidding integrations with other companies, where we have only milliseconds to make decisions and to figure out what ad we’ll serve. So we need to be able to access that data quickly.” Wide-column stores and key-value stores are equally suitable for use cases that require storing, organizing, and quickly retrieving huge amounts of data that require little analysis or modeling; for example, personalizing retail website experiences and organizing Internet of Things (IoT) data from sensors. Wide-column stores and key-value stores, the topic of this article, are among many innovations creating a sea change in database technology. This issue of the Technology Forecast explores the promise and upheaval caused by these various new technologies. 3 Digital advertising companies process huge amounts of data, and GumGum is no exception. In 2014 in VentureBeat, John Koetsier reported about AdRoll’s data storage and processing requirements: “Retargeting leader AdRoll announced that it is processing a massive 130 terabytes of advertising data daily and has reached 10 petabytes of stored data from the last 12 months. That’s 90 times the volume of data generated by all the stock exchanges in the U.S.” (John Koetsier, “AdRoll hits gigantic 130 terabytes of ad data processed daily, says size matters,” VentureBeat, November 12, 2014, http://venturebeat.com/2014/11/12/adroll-hits-gigantic-130-terabytes-of-addata-processed-daily-says-size-matters/, accessed April 3, 2015.) 3 PwC Technology Forecast How NoSQL key-value and wide-column stores make in-image advertising possible What are key-value and wide-column stores? A key-value store is a highly simplified version of a database that is also highly optimized for a few primary capabilities. It stores individual elements, which could be a digital representation of any type of value, large or small. Key-value stores are optimized for speed and scalability and are often used in caching applications that require extremely high throughput. They derive their speed and scalability from a simple data model and low overall database complexity. Key-value store Key-value stores offer very high speed via the least complicated data model—anything can be stored as a value, as long as each v alue is associated with a key or name. Key Value Wide-column stores are also fast and are nearly as simple as key-value stores. They include a primary key, an optional secondary key, and any grouping of digital bits stored as a value. A wide-column store can be considered part of the larger family of keyvalue stores. A wide-column store is also a row store, despite the name. Each row contains a different record. Key-value and wide-column store designers see advantages inherent in the simplest, most stripped-down data models for certain use cases that require application development speed or flexibility. Consider the example of Basho Technologies Riak, a key-value store. Riak stores groups of keys and values in “buckets,” or binary data objects. The buckets can contain any data type. Basho Technologies asserts that a simple data model simplifies application development, and that “new features to the application will not require updating a schema or changing the data model, ideal for applications where rapid iterations are required and changes in the underlying data model are undesirable.” 4 However, data ingested into a key-value store must be in the form of key-value pairs. Wide-column store Wide-column stores are also fast and are nearly as simple as key-value stores. They include a primary key, an optional s econdary key, and anything stored as a value. Values Primary key Keys and values can be sparse or numerous Secondary key 4 Relational to Riak, Basho Technologies white paper, http://www.basho.com/assets/RelationaltoRiak.pdf, accessed April 3, 2015. 4 PwC Technology Forecast How NoSQL key-value and wide-column stores make in-image advertising possible Data storage options for widecolumn and key-value stores Wide-column and key-value stores vary and are designed for particular classes of use cases that determine their cost and performance tradeoffs. Some of the main distinctions and tradeoffs include these: The viability of some companies’ business models depends on choosing the lowest-cost storage options. • Hard disk drive: Hard disk drive (HDD) storage on large collections of distributed compute clusters has become quite inexpensive, and the viability of some companies’ business models depends on choosing the lowest-cost storage options. As a small startup, GumGum, for example, wants to minimize storage cost per bit and yet capture the essential big data about all the pages and photos in inventory produced by the publishers it represents. So it uses a public cloud infrastructure service with an open-source version of Apache Cassandra that is optimized for spinning disk and a distributed environment. In Cassandra, once the rows are committed to disk, they’re immutable—later changes appear in subsequent rows. With this approach, writes to disk are non-blocking and faster. 5 PwC Technology Forecast • DRAM and SSD with a caching layer: In-memory databases are optimized to store data in random access memory, or RAM, for better performance. Databases can be more or less “in-memory” depending on the use case and storage architecture. The lower-cost options rely more on disk plus an in-memory caching layer. DataStax Enterprise offers a distribution of Cassandra with three options: spinning disk, solid-state disk (SSD), or dynamic RAM (DRAM) caching. Similarly, Apache Accumulo adds two caching layers to the spinning disk storage option, and Oracle Berkeley DB and Amazon DynamoDB include a caching layer with an SSD option. • DRAM and/or SSD all in-memory: Some databases load the entire key-value store into main memory. That extra speed sought through optimizing for RAM comes at a cost, because RAM is much more expensive than disk, and big data analytics implies huge data volumes. With their in-memory solutions, products such as Aerospike, Redis, and Riak often claim latency of a millisecond or less, and Aerospike offers either DRAM with spinning disk persistence or a hybrid DRAM/SSD alternative to address the volatility associated with DRAM (that is, the data loss potential of DRAM if a power loss occurs). The latter takes advantage of a proprietary file structure designed to access flash RAM. How NoSQL key-value and wide-column stores make in-image advertising possible Other considerations There are many other factors to weigh when thinking about how a key-value or widecolumn store might complement an existing data architecture. These include strategies to support overall data management goals and cost versus performance. In-Hadoop database designs. With Hadoop, organizations can preserve many petabytes of heterogeneous data in its original, full-fidelity format in a unified, low-cost, distributed computing environment. This capability makes Hadoop attractive for a data lake scenario5 and enterprise-wide, exploratory analytics across silos. Some wide-column stores—such as Apache HBase, Apache Accumulo, and MapR-DB—are designed to work on top of the Hadoop Distributed File System (HDFS) and in conjunction with other services in the Hadoop stack, such as ZooKeeper and Thrift. In-Hadoop databases do introduce a level of management complexity that can be considerable, because they’re built on the rapidly evolving, multi-tier Hadoop stack. GumGum, for instance, started with Apache HBase on Hadoop, but then encountered some perplexing troubleshooting challenges. “HBase uses HDFS and ZooKeeper,” says Vaibhav Puranik, director of engineering at GumGum. “HBase runs multiple processes on a node [region server], so whenever there was a problem, we didn’t know whether the HBase processes, the Hadoop processes, or something else caused the problem. To maintain HBase, you must maintain three or four pieces of software together, whereas with Cassandra, we have just one simple process running on every single node.” After a single-point failure caused data loss (a known risk of using Hadoop 1.0 since rectified by version 2.0), GumGum decided to move to Cassandra alone. The main Hadoop distributors do claim newer HBase and Hadoop management functionality, which may address some of the other problems GumGum encountered.6 Cost-performance tradeoffs that affect database performance. Other related variables that have cost-performance tradeoffs include networked versus direct-attached storage, virtualized versus physical hardware, and how the data is written or ingested. Conclusion: Match the use case to the key-value or widecolumn store alternative Since 2006, when Google first published its paper on BigTable,7 the database world has seen a proliferation of wide-column and keyvalue store options that reflect a range of alternatives. Users can match their specific use cases to the database that best suits each use case. The data model simplicity of the key-value store family helps companies that need to process very large volumes of perishable, structured data. They can spread out the processing over large, distributed, commodity computer clusters, and in-memory options lower the level of latency inherent in networked systems. Those clusters can scale out linearly, allowing companies to affordably process more and more data as the business scales up. 5 See “Data lakes and the promise of unsiloed data,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technologyforecast/2014/cloud-computing/features/data-lakes.jhtml, for more information on data lake architecture and strategy. 6 For more information, see “MapR-DB In-Hadoop NoSQL Database” at https://www.mapr.com/products/mapr-db-in-hadoop-nosql, “Cloudera Manager Backup and Disaster Recovery” at http://www.cloudera.com/content/cloudera/en/documentation/clouderamanager/v5-0-0/PDF/Cloudera-Manager-Backup-Data-Recovery.pdf, and “Apache HBase” at http://hortonworks.com/hadoop/hbase/, accessed April 7, 2015. 7 Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, Google research paper, 2006, http://static. googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf, accessed April 7, 2015. 6 PwC Technology Forecast How NoSQL key-value and wide-column stores make in-image advertising possible Case studies such as the GumGum example described earlier help to highlight how many different tradeoffs and capabilities factor into choosing a database. In a polyglot persistence environment, several database types are used. Users tolerate somewhat more latency in a shopping cart function than they would in a gaming session. Some features of a database—such as many different Redis data types—lend themselves to numerical data, whereas Cassandra seems more suited for non-numerical data.8 Atomicity, consistency, isolation, durability (ACID) compliance might make sense for situations when a keyvalue store is used to augment a relational database directly. The remaining modules of this issue of the Technology Forecast will examine some of the more forward-looking trends, including immutability, hybrids and the use of innovative data store technology in datadriven application stacks. Some of these frameworks and platforms offer dynamic data models in which the model mapping occurs in-memory. The next generation of NoSQL and NewSQL databases promises to build on the first. 8 Itamar Haber, “The Top 3 Game Changing Redis Use Cases,” Redis Labs blog, April 3, 2014, https://redislabs.com/blog/the-top-3-gamechanging-redis-use-cases, accessed April 7, 2015. To have a deeper conversation about remapping the database landscape, please contact: Gerard Verweij Principal and US Technology Consulting Leader +1 (617) 530 7015 [email protected] Chris Curran Chief Technologist +1 (214) 754 5055 [email protected] Oliver Halter Principal, Data and Analytics Practice +1 (312) 298 6886 [email protected] Bo Parker Managing Director Center for Technology and Innovation +1 (408) 817 5733 [email protected] About PwC’s Technology Forecast Published by PwC’s Center for Technology and Innovation (CTI), the Technology Forecast explores emerging technologies and trends to help business and technology executives develop strategies to capitalize on technology opportunities. Recent issues of the Technology Forecast have explored a number of emerging technologies and topics that have ultimately become many of today’s leading technology and business issues. To learn more about the Technology Forecast, visit www.pwc.com/ technologyforecast. About PwC PwC US helps organizations and individuals create the value they’re looking for. We’re a member of the PwC network of firms in 157 countries with more than 195,000 people. We’re committed to delivering quality in assurance, tax and advisory services. Find out more and tell us what matters to you by visiting us at www.pwc.com. © 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. MW-15-1351 LL

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download How NoSQL key-value and wide-column stores make in-image advertising possible