Download Scaling online ad innovations with the help of a NoSQL wide- column database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

IMDb wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
www.pwc.com/technologyforecast
Technology Forecast: Remapping the database landscape
Issue 1, 2015
Scaling online ad
innovations with the
help of a NoSQL widecolumn database
Vaibhav Puranik and Ken Weiner of GumGum discuss
the challenges and benefits of open source databases
for in-image advertising.
Interview conducted by Alan Morrison, Bo Parker, and Tom Foth
PwC: What does GumGum do?
KW: GumGum sells advertising via its
in-image ad platform to brand advertisers in
the Fortune 500. In-image advertising is a
hybrid between display and native advertising;
it’s a way to overlay ads on a photo or an
image. These ads are usually contextually
targeted to complement the images.
Vaibhav Puranik
Vaibhav Puranik is director of
engineering, big data at GumGum.
GumGum works with a few thousand
publishers, and that’s how we secure
our digital inventory. We’re basically
able to sell ad impressions on images to
different advertisers and agencies.
PwC: How are NoSQL databases
important to your business?
KW: For GumGum to target ads properly, we
need an understanding of all of the photos
and images that we see on websites and of all
of the pages that those photos fit on. We also
need some anonymous targeting data that
Ken Weiner
Ken Weiner is CTO of GumGum.
we might associate with all the users who
look at all those photos and images. So we
need a large database to look up information
that we’ve already computed about each
photo, each page, and each user. Our ad
server uses that information in real time to
make decisions about which ads to serve.
VP: To give an example, a photograph of
actress Jennifer Lawrence might appear
on a particular web page. Our software
recognizes automatically that this is
Jennifer Lawrence’s photograph. Once it
does, we can display a trailer ad for The
Hunger Games on that photograph.
PwC: And you store that information
in the NoSQL database?
VP: In an Apache Cassandra database,1 we
save the information that this particular photo
is of Jennifer Lawrence. And then we can use
that information in real time to serve the ads.
“Low latency is very important to any advertising,
because a user’s attention is fleeting. If the ad
isn’t served really close in time to when the image
appears, it may never be seen.” —Ken Weiner
PwC: Is latency another factor?
KW: Yes, it’s definitely a factor. Low latency is
very important to any advertising, because a
user’s attention is fleeting. If the ad isn’t served
really close in time to when the image appears,
it may never be seen. So we must select and
show ads to users in as little time as possible.
GumGum also participates in real-time bidding
integrations with other companies, where
we have only milliseconds to make decisions
and to figure out what ad we’ll serve.
PwC: What was the challenge
GumGum faced that caused you
to move to a NoSQL database
such as Cassandra?
VP: In 2013, we were using another NoSQL
database called HBase. HBase uses the
Hadoop Distributed File System [HDFS]2 and
ZooKeeper. HBase runs multiple processes
on a node [region server], so whenever there
was a problem, we didn’t know whether the
HBase processes, the Hadoop processes,
or something else caused the problem. To
maintain HBase, you must maintain three
or four pieces of software together, whereas
with Cassandra, we have just one simple
process running on every single node.
2
PwC Technology Forecast
PwC: How do you query the
data in Cassandra?
VP: We have apps that would query the data
programmatically. For ad hoc purposes, we use
a tool called Presto, which allows us to write
SQL [structured query language] queries.
PwC: Are you also looking at
in-memory databases?
VP: One other thing we are looking into
is how we could use Apache Spark in
conjunction with Cassandra. Spark would
allow ad hoc querying on top of Cassandra.
Spark can load Cassandra data into memory
and then execute really, really fast queries on
top of it. Because Spark can work in memory,
it can perform 100 times faster. Spark can
also provide a query processing engine for
Cassandra.
PwC: Does Cassandra come with an
in-memory capability to begin with?
VP: Cassandra does come with in-memory
capability in its enterprise version.
Unfortunately, we are not using that
enterprise version right now, but rather the
Apache license version of Cassandra. I know
people who are using the enterprise version,
and they’re pretty happy with it.
Scaling online ad innovations with the help of a NoSQL wide-column database
PwC: If you went back in time 10
years and you didn’t have access to
these NoSQL options, what would you
have done? How dependent are you
on the new big data technologies just
to execute your business models?
KW: I think it might have been possible
to do 10 years ago, but it wasn’t as costeffective. There were solutions back then for
big, vertically powered databases, and you
could get a really powerful, expensive, single
machine. But beyond a certain point, I’m not
sure exactly how that would have worked out.
VP: The reason data is growing so fast is that
you can store it and process it in cheaper ways.
Ten years ago, most companies would not
process that much data because the cost of
processing that data was too high. And now
they are processing much more data, because
they can do it less expensively.
1 Apache Cassandra is a wide-column NoSQL database. For more on key-value and wide-column stores, see“How NoSQL key-value and
wide-column stores help manage data in high-volume environments,” PwC Technology Forecast 2015, Issue 1, http://www.pwc.com/
nosql.
2 See “Data lakes and the promise of unsiloed data,” PwC Technology Forecast 2014, Issue 1, http://www.pwc.com/us/en/technologyforecast/2014/cloud-computing/features/data-lakes.jhtml, for more information on Hadoop and HDFS.
To have a deeper conversation about remapping the database
landscape, please contact:
Gerard Verweij
Principal and US Technology
Consulting Leader
+1 (617) 530 7015
[email protected]
Chris Curran
Chief Technologist
+1 (214) 754 5055
[email protected]
Oliver Halter
Principal, Data and Analytics Practice
+1 (312) 298 6886
[email protected]
Bo Parker
Managing Director
Center for Technology and Innovation
+1 (408) 817 5733
[email protected]
About PwC’s Technology Forecast
Published by PwC’s Center for Technology
and Innovation (CTI), the Technology
Forecast explores emerging technologies
and trends to help business and technology
executives develop strategies to capitalize on
technology opportunities.
Recent issues of the Technology Forecast have
explored a number of emerging technologies
and topics that have ultimately become
many of today’s leading technology and
business issues. To learn more about the
Technology Forecast, visit www.pwc.com/
technologyforecast.
© 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm
is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with
professional advisors. MW-15-1351 LL