Download Architecting the data layer for analytic applications in today’s data-intensive, high-

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Architecting the data layer
for analytic applications
A look at solutions designed to work
in today’s data-intensive, highvolume, read-only environment
Spring 2011
Advisory Services
Table of contents
The heart of the matter
2
Strong analytical capabilities—
at the core of today’s business
An in-depth discussion
4
Exploring architectural alternatives—
Finding the right fit for your
unique organization
What this means for your business
Leveraging valid, timely data—
Seizing a competitive advantage
Spring 2011
14
The heart of the matter
Strong analytical
capabilities—at the core
of today’s business
The large volumes of both structured
and unstructured data currently
bombarding global organizations are
intensifying their need for stronger
analytical capabilities. Analytics are
a major component of any Business
Intelligence (BI) application, but traditional data strategies no longer work
in today’s data-intensive, high-volume,
read-only environment for large-scale
data analysis (BigData) and very large
data warehouses.
As a result, data architects are struggling to tailor a data processing and
storage strategy that not only fits their
organization’s unique analytic needs
and circumstances, but is also aligned
with managements’ overall business
goals. This is not an easy task. While
the emergence of non-traditional niche
architectures has expanded the scope
of possibilities from which to select, it
has also added more complexity to the
mix. Today’s challenging environment
requires efficiency in query throughput
and bulk data loading with a reasonable
latency acceptance. The mingling of
structured and unstructured data, along
with the advent of Semantic Web technologies, require experimentation with
some of these emerging technologies as
data processing and storage models.
As we see it, in today’s dynamic IT environment, strong analytical capabilities
are at the core of successful business
initiatives. Those organizations that
carefully plan their data-management
strategy with their overall business
goals in mind are the ones that will
find themselves positioned to seize
a competitive advantage in today’s
dynamic global marketplace. It is
imperative for the business to leverage
this information as a true asset. To
support smart business decisionmaking and maintain a competitive
edge, managers are now demanding
access to operational data, historical
data and data collected from extended
enterprises and multiple channels in
just about real-time. It is incumbent
on IT to provide faster access and
better visualization tools to empower
business users to effectively analyze
and discover or mine this data.
Business and IT strategy must be
aligned, and enterprise architects must
provide the bridge between the business and technology.
So where do you begin? There is no
one-size-fits-all strategy, nor is there
one obvious “best-choice” technology
to implement. Before setting out to
design any BI application, it is critical
to put in the necessary time and effort
up front to work your way through the
maze. Ask yourselves some key questions to determine which features are
most important to your company. With
those answers top of mind, it’s time
to compare the pros and cons of new
alternative technologies. Only then will
you be ready to plan your strategy and
implement the solution that is the best
fit for your unique company.
To help you navigate through the
maze of available alternatives, this
paper examines the data or persistence
layer in the BI stack, comparing and
contrasting some of the key architectures now available to you. Among
the options we address are: Parallel
Database Management Systems
(DBMS), Map-Reduce (MR) distributed
processing systems, In-Memory databases (IMDB), Distributed Hash Tables
(DHT), Hadoop Distributed File Systems
(HDFS) and Column Databases.
The heart of the matter
3
An in-depth discussion
Exploring architectural
alternatives—Finding
the right fit for your
unique organization
Asking—and answering—the right questions
Traditional data strategies no longer
work in the current data-intensive,
high-volume, read-only environment,
which requires efficiency in query
throughput and bulk data loading with
a reasonable latency acceptance. The
large volume of data flooding organizations today has intensified their need to
strengthen their BI analytical capabilities. New alternative architectures are
emerging to meet this challenge. The
mingling of structured and unstructured data—coupled with the advent
of Semantic Web technologies—is
driving data architects to experiment
with some of these new alternatives to
explore their potential as effective data
processing and storage models.
This paper takes a look at the data
layer to compare and contrast some
of the key architecture choices now
available to you. But to design the
right solution—one that will best meet
your organization’s unique needs and
circumstances—you must first ask
yourselves some key questions.
Data storage—Choosing
between Shared-Nothing or
Shared-Disk architecture
Traditionally, clustered solutions
have provided scalability and High
Availability (HA) or fail-over capabilities for data storage and retrieval.
Clustering is typically enabled by
deploying either Shared-Nothing
or Shared-Disk architecture for
data storage—the choice depends
upon whether scalability, consistency or failover is important for
your application.
Before designing any BI application, it is important to consider what
is important for your organization and what new technologies are out
there to meet those needs. Ask yourselves these and other architectural questions:
• What are some of the emerging trends and specialized data
storage techniques that are being tested and used by companies to
drive innovation and competitive advantage?
• When should we use clustered solutions for data stores and data
services?
• What applications are appropriate for the MR framework—
or would Parallel DBMS or DHTs be a better choice for our
organization?
• For analytical applications such as data warehousing, would it be
best to use column databases or appliances? Or would an IMDB,
Object Database or NoSQL solution be the best fit?
• Will Consistency, Availability, Partitioning (CAP) characteristics be sufficient for our needs, allowing us to sacrifice Atomicity,
Consistency, Isolation and Durability (ACID) characteristics?
To help you formulate answers to key architectural questions such as
these, we discuss the pros and cons of the alternative approaches now
available to you, presenting our point of view on their suitability under
various circumstances.
• Shared-Nothing is a data architecture for distributed data storage in a
clustered environment where data is
partitioned across a set of clustered
nodes and each node manages its
own data. Typical shared-nothing
deployments utilize hundreds of
nodes with private storage.
• Shared-Disk architecture is used
in clustered solutions for data
processing to enable HA of the
services provided by the cluster.
Although the services are run on
separate nodes, the actual data is
on a shared Storage Area Network
(SAN) storage location. Using
disk-mirroring techniques further
enhances the services failover.
One can implement HA by utilizing
multiple nodes and creating ActiveActive or Active-Passive configurations.
HA clusters usually use a heartbeat
private network connection to monitor
the health and status of each node in
the cluster. Most commercial databases
have a HA deployment configuration.
An in-depth discussion
5
In the following chart, we compare and
contrast these two architectures1
Shared-nothing architecture
Shared-disk architecture
Each node has its own mass storage as well
as main memory.
Each node has its own main memory, but
all nodes share mass storage—usually a
storage area network. In practice, each node
usually has separate multiple CPUs, memory
and disks connected through a high-speed
interconnect.
A careful partitioning strategy across nodes is
needed to implement distributed transactions
and two-phase commit.
Single Shared Disk presents issues with
partitioning data that must be handled during
data loading.
Write Performance: Data Partitioning alleviates this issue since writes can be isolated to
single nodes.
Write Performance: There are issues with
distributed locking, as write requests from
multiple nodes will need coordination.
Query Performance: This architecture works
really well if all data is available on a single
node.
Query Performance: When querying data from
a Shared Disk deployment, there is possibility
of Input/Output (I/O) disk contention from
multiple nodes. This necessitates careful data
partitioning during data load.
Shared-Nothing is a favorable deployment
model when scalability is important, but
it requires careful data partitioning across
nodes.
Shared-Disk is a favorable deployment model
when data consistency is important.
This architecture is used for Parallel Data
Processing.
This architecture is used for failover of Data
Services.
Shared-Nothing can be used when data
consistency can be sacrificed.
Shared-Disk is the better approach where
data consistency is required.
A shared-nothing architecture
requires an efficient parallel-processing
application service such as Parallel
DMBS or the MR programming for
parallel data processing. Both of these
approaches provide a Shared-Nothing
architecture and allow transparency from internal storage and access
methods for end users. Both achieve a
high degree of parallelism of various
1 The case for Shared Nothing http://db.cs.
berkeley.edu/papers/hpts85-nothing.pdf
6
Architecting the data layer for analytic applications
operations—such as loading data,
building indexes and evaluating
queries by dividing the data into
different nodes to facilitate parallel
processing. As robust, high-performance computing platforms, both
provide a high-level programming
environment that is inherently parallelizable, and both use a distributed and
partitioned data storage strategy.
In the chart below, we compare these two platform approaches looking at cost,
complexity, and architecture fit (latency-access) in an IT organization. As you
consider these alternatives, it is important to keep in mind the complexity of
the transaction model as well as data-partitioning techniques, and also to focus
on the update and query capabilities required from the application.
MR Systems
Parallel DBMS
Architecture Fit
Since MR programs use Key-Value pair formats, they
do not require users to define a schema for their data.
Complex data structures such as those involving
compound keys, must be built into application programs.
Parallel DBMS require pre-defined schemas and data
models representing data structures that can be queried
by application programs.
Architecture Fit
Since MR systems maintain cluster information in a single
named node, there is a single point of failure. Failover of
the name node requires another node that is mostly idle.
Database services are fault tolerant using Parallel DBMS.
Architecture Fit
Data integrity must be enforced by the MR programs.
Data integrity constraints are abstracted from programs,
since the database engine enforces data integrity that is
defined by the database catalogue and database structures, e.g., constraints, Primary and Foreign Keys.
Architecture Fit
MR requires setup of a distributed file system such as
HDFS/GFS for data storage as well as data compression and persistence. Additionally, HDFS allows data
partitioning and data replication (for redundancy) across
multiple nodes. NOTE: a DHT, discussed later, is an
interesting option that can be explored as an alternative
to HDFS.
Use the file system recommended by the database
vendor.
Complexity
MR can use native interfaces in languages such as C,
C++, Java, PHP and PERL. Other high-level interfaces
that simplify writing MR jobs include Hive, Pig, Scope,
and Dryad. For example, Hive provides a SQL interface to
MR. Hadoop Streaming allows the use of languages that
support reading and writing to standard input/output.
Parallel DBMS use the ubiquitous ANSI-SQL and a procedural language provided by database vendors. Use of
Query Optimizers and Query Parallelization is transparent
to the SQL developer.
Architecture Fit
Shared-Nothing makes MR systems very flexible due to
the lack of constraints. For this reason, it is less suitable
for shared-development environments.
This approach is less flexible when data structures need
to be shared, requiring consensus among developers in
design of data structures and tables.
Complexity
MR does introduce movement of partitioned data across
the network nodes as all records of a particular key are
sent to the same reducer. Hence, this is a poor platform
choice when the job must communicate extensively
among nodes.
This is a more favorable approach for data processing
when a job must communicate extensively among nodes,
which is often necessary due to poor data partitioning.
Complexity
Any indexes required to improve query performance must
be built by the MR programmer.
This architecture leverages indexes that are shared by
applications.
An in-depth discussion
7
Cost
MR Systems
Parallel DBMS
MR systems are open-source projects with no software licensing fees, which often present a challenge for
product support given that one has to rely on the opensource developer community.
Parallel DBMS require more expensive hardware and
software, but come with vendor support.
MR systems can be deployed on relatively in-expensive
commodity hardware. This combination provides a
cheaper, scalable, fault-tolerant and reliable storage plus
a data processing solution.
Architecture Fit
MR systems are efficient ETL systems that can be used
to quickly load and process large amounts of data in
an ad hoc manner. As such, it complements rather than
competes with DBMS technology.
Parallel DBMS are not ETL systems, although some
vendors would like to use the database engine for transformation when using the ELT approach.
Architecture Fit
MR systems are popular for search, indexing and applications for creation of social graphs, recommendation
engines, storing and processing semi-structured data
such as web-logs that can be easily represented as keyvalue pairs.
Parallel DBMS are suitable for OLTP and analytical warehouse environments, making this the better choice for
query-intensive workloads, but NOT the best choice for
unstructured data.
MR systems are efficient for many data-mining, dataclustering and machine learning applications where the
program must make multiple passes over the data. The
output of one MR job can be used as the input of another
MR job.
See the commentary on column databases later in this
paper.
This computing platform allows developers to build
complex analytical applications and complex eventprocessing applications that combine structured data
(in RDBMS), such as user profile, and unstructured
data, such as log data, video /audio data or streaming
data that tracks user action. This “mashup” can provide
input to machine learning algorithms and can also provide
a richer user experience and insights into data that
neither dataset can provide on its own.
Such applications cannot be structured as single SQL
aggregate queries. Instead, they require a complex dataflow program where the output of one part of the application is the input of another.
MR systems can be used for batch processing and noninteractive SQL, for example ad hoc queries on data.
MR systems efficiently handle streaming data such as
analysis and sessionization of web events for click-stream
analysis. They are useful for the calculation of charts and
web log analysis.
Architecture Fit
MR systems are good for applications requiring high
latency but low throughput.
Parallel DBMS are good for applications requiring high
throughput and low latency.
Complexity
Developers must learn alternate data serialization and
node-to-node communication strategies such as protocol
buffers.
Parallel DBMS are less complex.
Architecture Fit
MR systems can be used when CAP properties are more
important. See commentary later in this paper on ACID
versus CAP.
This architecture can be used when ACID properties are
more important.
8
Architecting the data layer for analytic applications
Apache hadoop an introduction todd lipcom - gluecon 2010
ETL Tools
BI Reporting
RDBMS
Pig (Data Flow)
Hive (SQL)
Sqoop
MapReduce (Job Scheduling/Execution System)
HBase (Column DB)
HDFS
(Hadoop Distributed File System)
Avro (Serialization)
Zookeepr (Coordination)
The Hadoop Ecosystem
cloudera
http://www.slideshare.net/cloudera/apachehadoop-an-introduction-todd-lipcomgluecon-2010( Slide 40)
MR systems and Parallel
DBMS—Complementary
architectures
Although it might seem that MR
systems and Parallel DBMS are
different, it is possible to write almost
any parallel-processing task as either
a set of database queries or a set of MR
jobs. In both cases, one brings data
close to the processing engine. So the
debate centers on whether a Parallel
DBMS or MR approach is the best fit.
Parallel DBMS improve processing
and input/output speeds by using
Central Processing Units (CPUs) and
disks in parallel. They apply horizontal
partitioning of relational tables, along
with the partitioned execution of SQL
queries using Query Optimizers.
Most of these architectural differences are the result of the different
focuses of the two classes of systems.
While Parallel DBMS excel at efficient
MR—A major component
of the Hadoop Ecosystem
Hadoop, which was popularized
by Google, is now an open-source
project under development by
Yahoo! and the Apache Foundation.
Hadoop consists of two major
components, 1) a file depository
or filestore, called HDFS, and 2)
a distributed processing system
(MR). The HDFS is a fault-tolerant
(replication), reliable, SharedNothing, scalable and elastic
storage that is rack-aware and
built using commodity hardware.
MR is a data processing system
that allows work to be split up
into a large number of servers and
carried out in parallel.
MR allows one to parallelize
relational Algebra Operations that
are also the building blocks of SQL
such as: group-by; order or sort;
intersect and union; aggregation
and joins. Each of these operations
can be mapped using key-value
pairs for input and output between
the map and reduce steps. Hadoop
Streaming allows communication
between the map and reduce steps
using standard input and output
techniques familiar to unix administrators. Some of the other components shown in the diagram above
are abstractions and higher-level
interfaces that are meant to assist
developers with routine tasks such
as workflow. Details on the inner
workings of MR can be found in
Hadoop-A Definitive Guide (O’Reilly
Publications, Second Edition).
An in-depth discussion
9
Defining the relevant
acronyms
ACID: Atomicity, Consistency,
Isolation, and Durability
BI: Business Intelligence
BLOB: Binary large object that can
hold a variable amount of data
CAP: Consistency, Availability,
Partitioning
CouchDB: Cluster of Unreliable
Commodity Hardware
CPU: Central Processing Unit
DBMS: Database Management
System
DHT: Distributed Hash Tables
ETL: Extract, Transform, Load
HA: High Availability
HDFS: Hadoop Distributed File
System
IB DB: Internet Book Database
IM DB: In-Memory Database
I/O: Input/Output
MR: Map-Reduce
NoSQL: Databases without any
sequel interfaces
OD: Object Database
OLTP: Online Transaction
Processing
RDBMS: Relational Database
Management Systems
(sequel-compliant)
querying of large data sets, MR
systems excel at complex analytics and
ETL tasks. The two technologies are
complementary, in that neither is good
at what the other does well and many
complex analytical problems require
the capabilities provided by both
systems. This requirement drives the
need for interfaces between Parallel
DBMS and MR systems, which will
allow each to do what it is good at.
From where we stand, a “hybrid”
solution could be your best bet—that
is, to interface an MR framework to a
DBMS so that the MR can do complex
analytics and data processing that
is typical of an ETL program, and
interface to a DBMS to do embedded
queries. HadoopDB, Hive, Aster,
Greenplum, Cloudera, and Vertica all
have commercially available products or prototypes in this category.
Facebook has implemented a large
data-warehouse system using MR
technology, Hive and HiveQL. HiveQL
is a SQL interface that was developed
by Facebook and is now available to the
open source community.2 This enables
analysts who are familiar with SQL to
access HDFS and Hive data stores. SQL
queries are translated to MR jobs in
the background.
A look at a few alternative
strategies
Several web companies have opted to
use key-value stores as a substitute for a
SAN: Storage Area Network
SLA: Service Level Agreement
10
2 Hadoop Summit 2010, hosted by Yahoo!
June 29, 2010. http://developer.yahoo.com/
events.hadoopsummit2010
Architecting the data layer for analytic applications
relational database—that is, an associative array where the hash of a key (array
of bytes) is mapped to a value (aka a
BLOB). Below, we provide a birds-eye
view of some of the alternatives available to you today—DHTs, Column
Databases and In-Memory Databases
(IMBD)—along with considerations to
bear in mind when choosing between
ACID and CAP solutions.
The DHT approach
More complex installations use a DHT
approach, thereby allowing a group
of distributed nodes to collectively
manage a mapping between keys and
values, without any fixed hierarchy and
with only minimal management overhead. A DHT allows one to store a key
and value pair and then look up a value
with the key. Values are not always
persisted on disk, although one could
certainly base a DHT on top of a persistent hash table, such as Berkeley DB.
The main characteristic of a DHT is
that storage and lookups are distributed among multiple machines. Unlike
existing master/slave database replication architectures, all nodes are peers
that can join and leave the network
freely. DHTs have been popular when
building peer-to-peer applications such
as file-sharing services and web caches.
Some implementations of DHT include
Cassandra, CouchDB and Valdemort.
For example, CouchDB is a schemafree (no tables), document-oriented
database used for applications such
as forums, bug tracking, wikis and
TagClouds. A CouchDB document is
an object that consists of named fields.
Field values are lists that may contain
strings, numbers and dates. A Couch
database is a flat collection of these
documents.
The NoSQL movement has popularized some of these DHT-based databases. DHT solutions require clients
to perform complex programming in
order to handle dirty reads and rebalance cluster and hashing. Typically,
DHT implementations do not have
compression, which affects I/O.
Memory cached key-value stores are
also being used by several web applications to improve applications’ performance by removing load from the
databases. Note that this technology
must be deployed carefully because
it lacks ACID support. There are both
advantages and disadvantages to using
a DHT approach.
DHT advantages:
• Decentralization: There is no
central coordinator or master node
providing independence and failure
transparency. This is designed to
distribute and minimize shared
components and eliminate the single
point of failure.
• Scalability: Hash tables are efficient
despite a large number of nodes and
fault tolerance, even when nodes
drop and join.
• Simplicity: This is a key-value store
without any SQL access and with no
schemas to worry about. Hash tables
have a lower memory footprint.
• Highly available for write. Hash
tables provide the ability to write to
many nodes at once. Depending on
the number of replicas maintained
(which is configurable), one should
always be able to write somewhere,
thereby avoiding write failures.
• Suitable for query applications.
Hash tables are especially efficient
for query applications where there is
a need to return sorted data quickly
using MR.
• Performance: Performance level
is high when using hash tables and
key-value pairs.
DHT disadvantages:
• Not ACID-compliant. Since ACID
compliance is a must for all relational databases, the use of DHT is
limited.
• Lack of complex data structures
required for joins (obtaining data
from multiple hashes) and analysis.
This makes a DHT solution unsuitable for data warehouses that must
support queries that aggregate and
retrieve small data results from large
datasets.
• Lack of queryable schemas. This lack
limits the usefulness of a DHT.
• Complex for clients. A DHT solution
requires clients to perform complex
programming. And, because this
technology lacks ACID support, it
must be carefully deployed.
An in-depth discussion
11
12
ACID or CAP?
When architecting applications for
large-scale data, it is necessary to
consider whether ACID properties—
atomicity, consistency, isolation and
durability of data and transactions—
are more important for your business
than are CAP characteristics—consistency, availability and partitioning of
data. Relational database vendors focus
on implementing ACID properties,
which may be harder to implement in
a distributed environment where there
is a possibility of node failure. In some
respects, this may limit the scalability
of these ACID-compliant databases for
updates and query purposes.
Recently, however, significant academic
research has rejuvenated this technology. Database architects have been
debating whether column databases or
row databases offer better performance
for large-scale data analysis. Vendors
like Oracle® (with its Exadata product)
and Greenplum are striving to achieve
convergence between column-store and
row-store technologies.
Alternatively, there is a class of applications that can have only CAP characteristics, thus allowing the sacrifice of
ACID properties. Since these systems
do not use relational database technologies, but instead use technologies
such as DHT and HDFS, they are very
scalable and function perfectly even
during node or network failures.
Column databases3 serialize data stored
in columns instead of in a complete
row, as is the case with row-oriented
databases. In other words, attribute
values belonging to the same column
are stored contiguously, compressed
and densely packed, as opposed to
traditional database systems that store
entire records (rows).
The question becomes whether you will
need to manage complex transactions,
or whether a simple model with no
transaction support will be sufficient
to meet your needs. It is important
that you determine whether or not it is
possible to deploy a data-partitioning
strategy, such as is used in HDFS and
DHTs, to minimize the network traffic
and make the nodes self-sufficient.
Column stores claim to be better-suited
for BI applications and data warehouses that typically involve a smaller
number of highly complex queries
requiring full fact-table scans, since
they only read the columns required by
the query. Column-database storage is
read-optimized for analytical application and uses compression techniques.
However, there are some issues with
Column databases
For many years, Sybase® IQ was the
only commercially available product
in the column-oriented DBMS class.
3 C-Store: A column-oriented DBMX,
Stonebraker et al, Proceedings of the 31st
VLDB Conference, Trondheim, Norway,
2005
Architecting the data layer for analytic applications
Another approach, widely used to
boost query performance, is the use of
a vertical partitioning scheme, where
all columns of a row-store are indexed,
allowing the columns to be accessed
independently.
column-stores. To cite just two examples, we have seen problems with:
• Database insert/update operations
that involve separating individual
columns in a row and writing to
different locations on disk
• Row construction, e.g., combining
data from multiple columns into
‘rows’ during query execution
In-memory databases (IMDB)
In this class of database technology, the
entire database is stored in memory.
IMDB repositories typically have a
low memory and CPU footprint. The
mainstream adoption of 64-bit architectures, coupled with lowering of memory
costs, has enabled larger databases to
be created in memory. Since disk I/O is
eliminated, IMDB repositories provide
superior query performance. Most
traditional databases use disk caching
to reduce this I/O, but they end up
using more memory and CPU. For this
reason, it is important that architects
responsible for BI applications do not
overlook this emerging technology—
which is rapidly gaining momentum for
enterprise analytical applications—as a
viable solution. As with other database
technologies, IMDB repositories are
proprietary and have practical limitations on the database size.
Given database volatility issues, IMDB
repositories have limited ACID properties since they do not use transaction
and re-do logs. However, one can alleviate this durability issue by utilizing
replication and periodic writes to disk
using save-point techniques. Further,
these databases do not require design
of OLAP models or constant building
of indexes. ETL jobs for loading these
databases is generally faster.
Traditionally, Internet Book Databases
(IBDB) were popular in embedded
systems such as set top boxes and
routers using real-time operating
systems. Now, however, financial
markets, e-commerce and social
networking sites are using IBDB repositories for special database applications
and IM analysis that require speed.
IM technology is increasingly being
applied to BI, data warehouses and
Analytics. QlikView, from BI vendor
QlikTech, uses IM database repositories. MicroStrategy9 includes standard in-memory ROLAP, as does the
Spotfire product from Tibco. All of the
mega vendors—Oracle ® (TimeTen),
IBM® (TM1), SAP® (NetWeaver
Business Warehouse Accelerator), and
Microsoft® (project Gemini)—have an
IM database and BI offerings.
Long story short…
There are no shortcuts. To design
the best data-management strategy,
leverage the sophisticated emerging
technologies and implement the right
analytic solutions for your unique
company, you have to do your homework. Those companies that put in
significant time and effort up front
will reap the results of that investment—the ability to efficiently slice,
dice and analyze the flood of data that
pass through the organization daily.
The ability to quickly access valid
information increases corporate agility,
resulting in a competitive advantage in
today’s dynamic global marketplace.
What’s new? Niche
products for the BI market
Traditional approaches to data
analysis have relied on relational
database technologies and parallel
database technologies offered
by major database vendors such
as Teradata, Oracle and DB2. In
recent years, however, companies such as Neteeza, GreenPlum,
Vertica and Aster Data have come
up with niche products for the BI
market. Similarly, the Microsoft
Research Project (Dryad) is investigating programming models for
writing parallel and distributed
programs to scale from a small
cluster to a large data center.
The growth in data volumes and
increased SLAs around ETL has
resulted in popularity of alternate
approaches, especially among
emerging web companies. There
has been a lot of buzz in the
industry around the use of distributed architectures and cluster
computing in order to increase
system throughput and improve
scalability. MR is one popular
approach. Another is Hadoop. As
mentioned earlier, Hadoop was
popularized by Google and is now
an open-source project under
development by Yahoo! and the
Apache Software Foundation.
What’s more, social media companies such as Facebook have
implemented a data warehouse
using MR instead of a traditional
RDBMS.
An in-depth discussion
13
What this means for your business
Leveraging valid,
timely data—Seizing a
competitive advantage
Traditionally, organizations have
been more comfortable with proven,
well-supported enterprise-class
software. Until recently, most have
been unwilling to work with the more
disruptive open-source technologies.
Now, however—given the deluge of
data bombarding companies today—
data architects are recognizing that
traditional analytical solutions no
longer work. In response to the challenge of analyzing a very large volume
of data in a timely and meaningful
manner, several cutting-edge companies have become pioneers in this
space, and they have made significant
headway using the new open-source
technology.
In our view, it is a mistake to ignore
the emerging solutions that are now
dotting the landscape. Disruptive
though they may be, the new opensource technologies provide a very
scalable approach to large-scale data
analysis. It’s time to explore the potential of these innovations to identify
the one that would be the best fit with
your organization’s unique needs and
circumstances. Your firm can benefit in
important ways.
Reaping the benefits of a
robust open-source solution
It pays to take a distributed-architecture approach to processing BigData.
Let’s look at just some of the key
benefits that we see accruing to early
adopters of these emerging solutions:
First, distributed architecture is a very
scalable approach. Unlike traditional
analytic technologies that require very
expensive high-end servers, some of
the open-source technologies discussed
in this paper are built to operate on less
expensive, commodity hardware.
What’s more, while most companies
using a traditional solution are accustomed to anticipating and building
their data processing capacity in
advance—a practice that can be costly
in terms of overestimating the need
and tying up capital sooner than
necessary—a distributed-architecture
approach allows you to add clusters
to an existing rack as your processing
needs grow. The new addition simply
gets folded into the mix of that
commode and begins processing.
Using a distributed-architecture
approach is less risky. Open-source
technologies have become very robust
and can easily handle any failovers.
Consider the risk involved in the traditional approach where you might have
only two high-end servers processing
huge amounts of data between them.
Where would you be if one of those
were to crash? In contrast, having
numerous less-expensive servers
sharing the load is much safer. Were
one or more of those to crash, the
others could easily pick up the slack.
Further, with a distributed architecture approach, you don’t have servers
sitting idle, waiting for disaster events
to occur.
The time to act is now
We believe that it is a mistake to ignore
the new solutions specifically designed
to work in today’s data-intensive, highvolume, read-only environment. It’s
time to explore the possibilities.
As you set out to design and implement
the data-management solution that
will most effectively strengthen your
organization’s analytical capabilities,
take the time up front to ask yourselves
the right questions. Only then can you
be sure that the IT strategy you design
will encompass all of your growing BI
needs, and that it will be aligned with
the organization’s overall business
goals.
The bottom line: Those organizations that carefully plan their IT
strategy with their business goals
in mind are the ones that will find
themselves positioned to seize the
competitive advantage in today’s
dynamic global marketplace.
What this means for your business
15
Some non-technical thoughts on open-source Web data processing systems
Alan Morrison, PwC Center for Technology and Innovation, San Jose CA
While it’s certainly essential to take a long, hard look at the pros and cons of each technology, it’s also important,
during any technology due diligence effort, to weigh other factors that may not be sufficiently illuminated by a
comparison of technical pros and cons.
In Making Sense of Big Data, Technology Forecast 2010, No. 2,4 PwC evaluates the benefits and challenges
of open-source Big Data—with a focus on Hadoop and complementary non-relational methods. The Forecast
considers several case studies, as well as the changing data analysis environment, development-method
strengths and weaknesses, and other factors. Following are some of the conclusions that we drew during the
research effort for that publication.
Hadoop, MapReduce and related NoSQL methods have significant momentum. This momentum is due
to a successful open-source effort that’s emerged to address the significant growth of Web data. Many
of the largest Web companies in the world—among them Yahoo, Twitter, Facebook, LinkedIn, and Amazon, not
to mention vendors and consultancies such as Cloudera and Concurrent—have collaborated for years inside the
Apache community on developing and refining Hadoop, Map Reduce, Hive, Pig, Lucene, Cascading and numerous
other NoSQL tools that are part of the ecosystem. The most popular tools benefit from the efforts of some talented
developers at these companies. The size and quality of this developer community matters a great deal.
A heterogenous data environment often demands different tools for different purposes. In general,
Hadoop, MapReduce and NoSQL tools were conceived primarily for Web environments. That’s not to say that
they can’t be used elsewhere, but their heritage is rooted in the Google® Cluster Architecture and the tools
Google developed internally to analyze the petabytes of data it collects. Much can be learned from Google and
the examples of similar Web companies, but their approaches to large-scale data analysis should be considered a
complement to, rather than a replacement for, relational data techniques.
Successful use of open source depends on a sound understanding of how the community operates. In
any open-source community, there are deep, strong currents and shallow, weak currents. Those who succeed
take the time to study those currents. They pick the tools that benefit from the highest level of participation, and
provide feedback to the community on how the software can be improved. That plays a large role in reaping the
benefits. Remember that most open-source efforts are essentially first efforts that haven’t yet gained popularity
or refinement.
Be wary of methods that may never be widely used. Many methods, including some launched by developers
with very good ideas and intentions, may never gain traction. User communities have biases that can keep
well-designed software from being widely used regardless of how technically viable it is. Meanwhile, some lessworthy software can gain followers for reasons that don’t have anything to do with pure technical viability.
Open source implies a different culture of development, and its users tend to be less passive, with more internal
development capability than those who just subscribe to services from large commercial Software-as-a-Service
(SaaS) providers or buy licenses from packaged software vendors. In the area of data analysis, open source users
will be more likely to be the ones who experiment and explore. Those who aren’t accustomed to these methods
should remember that it’s important to encourage this exploratory mentality regardless of the source of the software. It’s an essential component of any successful Big Data strategy.
4 Making Sense of Big Data, Technology Forecast 2010, No.2. Issue 3. PricewaterhouseCoopers. http://www.pwc.com/us/en/
technology-forecast/2010/issue3/index.jhtml
16
Architecting the data layer for analytic applications
www.pwc.com/us/advisory
For a deeper discussion around
identifying the right data
management solutions for your
organization, please contact:
David Patton
267 330 2635
[email protected]
Sanjeev Taran
408 817 1285
[email protected]
© 2011 PwC. All rights reserved.“PwC” and “PwC US” refer to PricewaterhouseCoopers LLP, a Delaware limited liability partnership, which is a member firm of
PricewaterhouseCoopers International Limited, each member firm of which is a separate legal entity. This document is for general information purposes only,
and should not be used as a substitute for consultation with professional advisors. NY-11-0809