Download Database futures: How Apache Spark fits in to a larger unified data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Transcript
www.pwc.com/technologyforecast
Technology Forecast: Remapping the database landscape
Issue 1, 2015
Database futures: How
Apache Spark1 fits in to
a larger unified data
architecture
Mike Franklin of the University of California, Berkeley,
discusses the goals behind Spark and a more unified
cloud-data ecosystem.
Interview conducted by Alan Morrison, Bo Parker and Bud Mathaisel
PwC: What are you seeing in terms
of the data technology that’s coming
online, and how is that changing?
Mike Franklin
Mike Franklin is Director of the
Algorithms, Machines and People
Lab (AMPLab), and Chair of the
Computer Science Division at UC
Berkeley.
MF: I was trained as a database person. Most
of my research and work in industry has
been along the lines of traditional relational
databases, object oriented databases,
and so on. That’s my perspective. I think
things are in the process of changing in a
fundamental way. In the 30 years or so that
I’ve been involved in this area in various
ways, this is the biggest change I’ve seen
since the relational database revolution.
There wasn’t a generally accepted alternative
to the relational database in its heyday.
You either figured out how to use it, or you
were on your own. There were attempts
over the years to try to change that, such as
object-oriented databases, XML-oriented
databases, and the like. None of them really
got the traction that we’re starting to see
now in some of these newer systems.
More recently, the appreciation of the potential
value in data has spread everywhere, across
all industries. Basically, every department at
UC Berkeley, for instance, is doing data-driven
work. That’s led to an explosion in the sort of
use cases and the types of data that people
want to store and analyze. I’m not sure that
the formats of the data have changed all that
much, but the variety of the data that people
want to store is what’s really changed.
PwC: How has that impacted
the nature of databases?
MF: Users need flexibility more than anything
else, so you can take different approaches
to answering your question. One approach
is an idea expressed as store first, schema
later. The first step in a traditional database
“Our view of the Spark ecosystem is that you don’t
need all those isolated data systems.”
environment is to analyze your application
and organization, then do a data design and
schema design—where you figure out what
each piece of data has to look like and how
all of the data are interrelated. Only after
you’ve gotten through that process can you
start thinking about putting your system
together and loading any data into it.
Now you just collect as much data as
you can, because the alternatives in
storage are incredibly cheap relative to
what they used to be. So you just store
everything you can get your hands on.
Some people call this a data lake.2
About ten years ago, my colleagues and I
wrote a paper about something that we called
data spaces. which is exactly the same idea,
where you don’t impose structure and don’t
require data to conform to a structure in order
to be able to store it. With that concept, you
store the data and then figure out how much
structure you need in order to make sense of
the data and do what you need to with it.
The biggest gain in the flexibility of data
management is that you can store anything;
then try to make sense of it and work with it.
PwC: For the past couple of years,
the assumption has been that you’ve
got this heterogeneous environment
and there are different tools that
2
you can use, depending on what your
needs are. You could use a document
store here, a graph database there,
while continuing to use a relational
database for critical transaction
support. Is there a more unified
approach, or do we have to accept
database heterogeneity long-term?
MF: From my perspective it’s obvious that
heterogeneity is not the right way to do
things. If you think about the history of the
Spark project and the ecosystem we’ve built
around it, Spark was originally a research
project. We didn’t worry too much about
things that you would worry about if you
were building a product from Day One.
But our view of the Spark ecosystem is that you
don’t need all those isolated data systems. Once
you’ve got the data into our Spark environment,
you should be able to treat that data any
way you want to. If you want to run SQL
[structured query language] queries over it,
we’ll let you do that. If you want to look at the
data in the form of a graph, we have a system
called GraphX that will let you do that. If you
want to use computers that learn to process
data better from their own experience, along
with data clustering and recommendation
systems, we’ve got libraries that will let you
do that. And if you want to write low-level
code to do something to the data we haven’t
even imagined, you can do that, too.
PwC Technology Forecast Database futures: How Apache Spark fits in to a larger unified data architecture
“What’s changed today is that the same system can
support different points along the data-structure
spectrum.”
When we created the Spark ecosystem, we
had the advantage of not having to hit revenue
targets to please impatient investors. We
had the time to build out a comprehensive
ecosystem. We did it in an open way, so that
we’re not the only ones contributing and
extending the ecosystem for the benefit of all.
Of course, any organization is limited by
the technologies it has and knows how to
use. But there’s no technical reason why
you need a dozen different systems to do
a dozen different things. There’s enough
commonality in the patterns of data
acquisition and analytics to enable more
effective management of data with a single
enterprisewide data management framework.
PwC: What about meeting the
requirements of operational data,
which tends to need more structure?
MF: Data structure poses interesting questions:
When do you need it, and how much do you
need? I look at structure as a spectrum, from
completely unstructured data, like just a bag
of bytes, all the way to a relational schema or
maybe even something more sophisticated in
terms of structure at the opposite extreme.
3
You can think of data structure options as
a choice about what tradeoff you’ll accept
to meet your needs. On the unstructured
side, you get incredible flexibility. You get
the ability to bring in whatever data you
find, keep it and, hopefully, do something
useful with it. But as you move towards
the increasingly structured side of data
management, you gain more confidence
about what your data mean, and how that
data can be applied to benefit your business.
As you move to operational systems that
house valuable data, where consistency
in the structure and management of that
data is the top priority, you have to give
up flexibility in the way that you structure
and manage the data. You apply rules and
constraints to get more predictable results.
What’s changed today is that the same
system can support different points along
the data-structure spectrum. With systems
like Spark and some other NoSQL [not only
structured query language] environments,
you get to pick different points along
the data-structure spectrum in the same
environment. At least that’s the goal, because
I’m not sure that capability fully exists yet.
PwC Technology Forecast Database futures: How Apache Spark fits in to a larger unified data architecture
PwC: How does the Spark ecosystem
address the operational aspect?
MF: Spark is focused on analytics, but we’re
in the process of building out capabilities
that will be more operational. Spark is one
component of something we’re building called
the Berkeley Data Analytics Stack [BDAS]. And
Spark is the middle, or main, part of that stack.
But we’re building a new component of BDAS
called Velox. And Velox, if you look at the
architecture diagram [see below], sits right
next to Spark in the middle of the architecture.
The whole point of Velox is to handle more
operational processes. It will handle data
that gets updated as it’s available, rather
than loading data in bulk, as is common
when it’s analyzed. Our models also get
updated on a real-time basis when using
computers that learn as they process data.
PwC: What are the problems
you’re trying to solve on
the operational side?
MF: We’re trying to create a system that lets
you move all your data from one environment
to another as you go from the operational
side to analytics. When we succeed, you’ll
be able to move your data in a closed loop
between operations and analytics, so that
analytics can directly inform your operations.
PwC: How do you support
different user groups?
MF: The users of our tool include both frontline analysts in a security operations center
and more advanced security investigators and
incident handlers. Often, certain sensitive
types of data are not available to the front-line
analysts, but the more advanced investigators
would be able to see all the data.
The missing piece in BDAS
Training
BlinkDB
Spark
Streaming
Management + Serving
MLbase
Graph X
Spark SQL
ML library
Velox
Spark
Mesos
Tachyon
HDFS, S3, ...
Source: Dan Crankshaw, UC Berkeley AMPlab, http://www.slideshare.net/dscrankshaw/velox-at-sf-data-mining-meetup, 2015
4
PwC Technology Forecast Database futures: How Apache Spark fits in to a larger unified data architecture
PwC: Some people think that business
logic should not be contained in a
database. For example, say someone
decides not to order supplies from a
specific vendor for fear of violating
compliance regulations, even though
database systems have long been
used to assert compliance. Now the
practice of asserting compliance
through a database is being
questioned because the proof of
business logic is in its application,
rather than inside a database. Would
you agree that putting business logic
in the database itself was a bad idea
to start with, something we should
move away from?
MF: Your question makes me wonder if there’s
a third alternative, a shared system for business
logic other than the database. Maybe the right
idea all along was to pull business logic out
of individual applications, so you had a single
source of truth for business logic. In that case,
putting business logic in the database would
have been the wrong way to achieve that.
Maybe there is an ideal shared repository for
business logic that isn’t the database. I can
see that. You may have just given me my new
research project.
PwC: What caused the rise of
NoSQL? Is NoSQL enough?
MF: My view on the rise of NoSQL is that
database systems require too much upfront
work before you can get anything done. The
first time I hand-typed an HTML [HyperText
Markup Language] page, and then pointed
a browser at it, the page wasn’t quite right.
But a lot of the page looked okay. That was a
revelation to me as a database guy. Because if
you did that to a database, it would just return
a syntax error. It wouldn’t show you anything
until you did it perfectly right.
The NoSQL systems are much more forgiving.
But at some point you run into a wall with
NoSQL. The truth is that once you reach a
certain stage structured query language is a
really good tool for a lot of what people want
to do with data.
A flexible and incremental approach to
structure is the most valuable. You start by
loading whatever data you want. Just throw it
in. Make it super easy. Then, as you need more
structure and definitions to get the results
you need, and comply with any mandatory
procedures, you start imposing control on
your data.
1 Apache Spark is a big data processing engine originally conceived as an in-memory alternative to MapReduce. It can run in-memory 100
times faster than Hadoop. It can store results in any Hadoop-compatible file system or database, such as HDFS [Hadoop Distributed File
System], HBase, and Cassandra—or in Amazon’s Simple Storage Service [S3].
2 See “Data lakes and the promise of unsiloed data” at http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/
data-lakes.jhtml for more information on data lakes.
To have a deeper conversation about remapping the database
landscape, please contact:
Gerard Verweij
Principal and US Technology
Consulting Leader
+1 (617) 530 7015
[email protected]
Chris Curran
Chief Technologist
+1 (214) 754 5055
[email protected]
Oliver Halter
Principal, Data and Analytics Practice
+1 (312) 298 6886
[email protected]
Bo Parker
Managing Director
Center for Technology and Innovation
+1 (408) 817 5733
[email protected]
About PwC’s Technology Forecast
Published by PwC’s Center for Technology
and Innovation (CTI), the Technology
Forecast explores emerging technologies
and trends to help business and technology
executives develop strategies to capitalize on
technology opportunities.
Recent issues of the Technology Forecast have
explored a number of emerging technologies
and topics that have ultimately become
many of today’s leading technology and
business issues. To learn more about the
Technology Forecast, visit www.pwc.com/
technologyforecast.
© 2015 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm
is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with
professional advisors. MW-15-1351 LL