Download Big Data Analytics Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Big Data Analytics Architecture
Putting All Your Eggs in Three Baskets
By Neil Raden
Hired Brains, Inc.
Sponsored by:
Table of Contents
Executive Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
The “Big Data” Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Types of Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Positioning Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
The Core of Hadoop is MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Integrated Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Case for Hybrid Relational + MapReduce Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
About The Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
EB-6513
>
0212
>
PAGE 2 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Executive Summary
The big story in analytics and information management in 2011
and a variety of reporting, analytical and decision support tools. It is
was big data. In 2012, the trend is accelerating. At the center of
for that reason that some Hadoop proponents see it as a replace-
discussions about managing huge amounts of novel sources of
ment for the current analytic architecture such as data warehousing.
information is Hadoop. One can perceive this effect as ousting
data warehouses as the premier location for the gathering and
To put this in perspective, it is important to understand what
analyzing of data, but it is only partially true. While the capabili-
analytics means and how it is used. In this paper we present a
ties and applications of Hadoop have been clearly demonstrated
formal definition of analytical “types” in order to exemplify the
for organizations that deal with massive amounts of raw data as
kinds of analytics and analytical processes that organizations
their primary business, especially Web-oriented ones , where it fits
employ. This helps to pinpoint where Hadoop is appropriate,
in other kinds of organizations is not quite as clear. The core of
where existing data warehousing and business intelligence envi-
Hadoop is a programming framework, MapReduce, which
ronments are appropriate. There is a third option as well, with the
crunches data from other sources and can deliver analysis (or
emergence of new, hybrid systems that are relational databases
sometimes just aggregated data to be used in other analytical
with MapReduce capabilities such as the Teradata® Aster MapRe-
systems). Because Hadoop consumes “used” data , its application
duce Platform. Others are also emerging, and competitors to
may tend to overlap with the way analytical processing has been
Hadoop itself are already in business.
1
2
done for decades – with data integration tools, data warehouses
1 Actually, data warehousing and business intelligence, though impressive, have for the most part been limited to subject areas involving finance, marketing, supply chain, customers and many other
intra-enterprise domains, (though there are countless exceptions too). Big data includes totally new areas such as genomics, machine-generated data and the vast volumes of data created online,
especially social networking.
2 “Used data” is not an industry standard term, rather, it’s a construction used by this author to distinguish data stored in the primary systems that capture it as opposed to data that is re-used for other
purposes. A CDR (Call Detail Record) in a telephone billing system is primary data. When it is pulled for analytical purposes, it is used data.
EB-6513
>
0212
>
PAGE 3 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
The “Big Data” Question
The problem of “big data” is hardly new. When we see charts like
systems, largely composed of historical data over time. It would be
that in Figure 1, (the data is just representational), the explosive
an exaggeration to say the volume, variety and velocity3 of the data
growth of data today is quite alarming.
was predictable, but it was at least comprehensible. The problem
facing companies was how to expand their relationally-based
It would appear from this graph that the amount of data in 2005
data warehouses to accommodate these new requirements and
was practically insignificant compared to today. However, consider
to provide the needed support for business intelligence and, in a
the exponential growth in data from the perspective of the year 2005,
larger sense, analytics.
when enterprise systems and customer-level data was flooding into
data warehouses, causing large-scale rethinking of methodologies.
Today, the problem is very different. Data sources are unpre-
That curve looked just as daunting as it does today (See Figure 2.).
dictable, multi-structured (emanating from organized systems)
So as alarming as it may have seemed in 2005, it clearly was the tip
and massive. Many are external to the enterprise. The techniques
of the iceberg for what was to follow. But analytic environments
for mining data from these sources, and even the platforms most
didn’t collapse; they evolved to deal with the rising tide of data.
appropriate for doing so, are now somewhat in question. With the
Today, they continue to evolve to meet the challenge of big data.
entry of Hadoop to the market – an approach that is separate from
the relational data warehouse – the issue facing decision makers
Something has changed, though, and changed dramatically. When
today is where and when to deploy these technologies for perform-
data warehouse sizes grew from 50GB to 1TB to 10TB, the data was
ing useful analytics. But first, a little clarity on what analytics means.
still the same – structured data culled from internal operational
Data Warehouse Data in TB
Data Warehouse Data in TB
100
1.6
1.4
80
1.2
1
60
0.8
40
0.6
0.4
20
0.2
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
0
0
2000
2001
2002
2003
2004
2005
Figure 2. 2000-2005: Years of massive adoption of traditional business analytics.
Figure 1. 2001-2012: Exponential data warehouse data growth. Big data analytics has
contributed significantly to this growth during the last three years of this period.
3 Doug Laney – “3D Data Management: Controlling Data Volume, Velocity and Variety”, MetaGroup, 2001, though the title obviously suggests that the 3 Vs were meant in a different context from
big data today.
EB-6513
>
0212
>
PAGE 4 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Types of Analytics
The term “analytics” is not particularly precise, so before fitting
mathematics behind the analytics, and may apply very complex
analytics into the correct technology, it helps to have some preci-
tools such as Lucene wrapper, loopy logic, path analysis, root cause
sion. Hired Brains employs a four-type definition as follows:
analysis, synthetic time series or Naïve Bayes derivatives that are
understood by a small number of practitioners. What differenti-
1. Type I Analytics: Quantitative Research and
Development
ates the Type II-A from Type I is not necessarily the depth of
The creation of theory and development of algorithms for all
not uncommon for Type IIs to have a PhD for example), it is that
forms of quantitative analysis deserves the title Type I. This is the
they also possess the business domain knowledge they apply, and
preserve of mathematicians, statisticians, and other pure quantita-
their goal is to develop specific models for the enterprise, not for
tive scientists. These are the people who delve into the workings
the general case as Type Is usually do.
knowledge they have about the formal methods of analytics (it is
and manifestations of Hidden Markov Support Vector Machines,
Linear Dynamical Systems, Spectral Clustering, Machine Learning
Type II-Bs, on the other hand may work with more common and
algorithms like canopy clustering, k-means, Naïve Bayes, and a
well-understood techniques such as logistic regression, ANOVA,
host of other exotic models. The discovery and enhancement of
and CHAID, and approach their analytical problems with more
computer-based algorithms for these concepts is mostly the realm
conventional best practices and/or packaged analytical solutions
of academia and other research institutions. Commercial, govern-
from third parties.
mental and other organizations (Wall Street for example) may
employ some staff with these very advanced skills, but in general,
We titled this category “Data Scientist,” which is a relatively new
most organizations are able to conduct their necessary analytics
title for quantitatively adept people with accompanying business
without them, or employing the results of their research. An
skills. The ability to formulate and apply tools to classification,
obvious example is the FICO score, developed by Type I experts
prediction and even optimization, coupled with fairly deep
but employed widely in credit-granting institutions and even
understanding of the business itself, is clearly in the realm of Type
Human Resource organizations.
II efforts. However, it seems pretty likely that most so-called data
scientists will lean more towards the IT aspects of quantitative and
2. Type II Analytics: Data Scientists
data-oriented subjects than business planning and strategy. The
More practical than theoretical, Type II is the incorporation of
reason for this is that the term data scientist emerged from those
advanced analytical approaches derived from Type I activities. This
businesses like Google or Facebook where the data actually is the
includes commercial software companies, vertical software imple-
business; so understanding the data is equivalent to understanding
mentations, and even the heavy “quants” who work in industry,
the business. This is clearly not the case for most organizations.
who apply these methods specifically to the work they do, since
We see very few Type II data scientists with the in-depth know-
they operate in much the same way as commercial software
ledge of the whole business as, say, actuaries in the insurance
companies, but for just one customer (though they often start
business, whose extensive training should be a model for the newly
their own software companies, too).
designated data scientists (see our blog on this topic at
www.informationweek.com/blog/228901009).
In fact, Type II could actually be broken down into two subtypes,
Type II-A and Type II-B. While both perform roughly the same
It is absolutely essential that someone in the organization has the
function, providing guidance and expertise in the application of
role of chief communicator – someone comfortable working with
quantitative analysis, they are differentiated by the sophistication
quants, analysts and programmers, deconstructing their method-
of the techniques applied. Type II-A practitioners understand the
ologies and processes, distilling them, and then rendering them in
EB-6513
>
0212
>
PAGE 5 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
language that other stakeholders understand. Companies often fail
today’s business environment, time is perhaps the scarcest com-
to see that there is almost never anything to be gained by trying to
modity of all. Whether a decision-making system notifies people
put a PhD statistician into the role of managing a group of
or machines, it must confine those messages to those that are the
analysts and developers.
most relevant and useful.
3. Type III Analytics: Operational Analytics
False negatives are quite a bit more problematic as they can lead to
This is the part of analytics we’re most interested in and most
transactions passing through that should not have. Large banks
familiar with. For example, a Type II expert may develop a scoring
have gone under by not catching trades that cost billions of
model for his/her company. In Type III activity, parameters are
dollars. Think of false negatives as being asleep at the wheel.
chosen by the Type III analyst and are input into the model,
generating the scores calculated by the Type II models and embed-
Packaged applications that embed quantitative methods such as
ded into an operational system that, say, generates offers for credit
predictive modeling or optimizations are also Type III in that the
cards. Models developed by Type IIs can be applied and embedded
intricacies and the operation of the statistical or stochastic method
in an almost infinite number of ways today. The application of
are mostly hidden in a sort of “black box.” As analytics using
Type II applications into real work is the realm of Type III. In very
advanced quantitative methods becomes more acceptable to
complex applications, real-time data can be streamed into applica-
management over time, these packages become more popular.
tions based on Type II models with outcomes instantaneously
derived through decision-making tools such as rules engines4.
4. Type IV Analytics: Business Intelligence and Discovery
Type III analytics aren’t of much value if their application in real
Decision-making systems that are reliant on quantitative methods
business situations cannot be evaluated for their effectiveness. This
that are not well understood by the operators can lead to trouble.
is the analytical work we are most familiar with via reports, OLAP,
They must be carefully designed (and improved) to avoid overly
dashboards and visualizations. This includes almost any activity
burdening the recipients of useless or irrelevant information.
that reviews information to understand what happened or how
This was a lesson learned in the early days of data mining, that
something performed, or to scan and free associate what patterns
generating “interesting” results without understanding what was
appear from analysis. The mathematics involved is simple.
relevant usually led to flagging interest in the technology. In
Descriptive Title
Quantitative
Sophistication/Numeracy
Sample Roles
Type I
Quantitative R&D
PhD or equivalent
Creation of theory, development of algorithms. Academic/research.
Often employed in business or government for very specialized roles.
Type II
Data Scientist or
Quantitative Analyst
Advanced Math/Stat, not
necessarily PhD
Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge.
Type III
Operational
Analytics
Good business domain, background in statistics optional
Running and managing analytical models. Strong skills in and/or project
management of analytical systems implementation.
Type IV
Business Intelligence/Discovery
Data- and numbers-oriented, but
no special advanced statistical
skills
Reporting, dashboard, OLAP and visualization use, possibly design.
Performing posterior analysis of results driven by quantitative methods.
Spreadsheets prevail. Newer “business discovery tools” gaining traction.
Figure 3: Summary of analytical types.
4 Taylor and Raden, “Smart Enough Systems,” Prentice Hall, 2007.
EB-6513
>
0212
>
PAGE 6 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Positioning Hadoop
Hadoop “data warehouses” do not resemble the data warehouse/
analytics that are common in organizations today. They exist in
businesses like Google and Amazon for web log parsing, indexing,
and other batch data processing, as well as for storing enormous
amounts of unfiltered data. Petabyte-size data warehouses in
Hadoop are not data warehouses as we know them; they are a
collection of files on a distributed file system designed for parallel
processing. To call these file systems “a data warehouse” is mislead-
...it is important to distinguish between
supporting these applications and actually
performing them. Hadoop comes out of the box
with no facilities at all…instead, it requires
extensive software engineering…to do this
work. In no case can these be considered a
seamless bundle of software.
ing because a data warehouse exists to serve a broad swath of uses
and people, particularly in business intelligence, which is both
in physics, biology and genomics and all forms of data mining (see
interactive and iterative. MapReduce is a programming paradigm
Figure 4). While it is demonstrable that Hadoop has been applied
with a single data flow type that takes the form of directed acyclic
to all of the domains and more, it is important to distinguish
graph of operators5. These platforms lack built-in support for
between supporting these applications and actually performing
iterative programs, quite different from the operations of a
them. Hadoop comes out of the box with no facilities at all to do
relational database. To put it in layman’s terms, there are things
most of this analysis. Instead, it requires the application of libraries
that Hadoop is exceptionally well designed for that relational
available either through the open source community at forge.com
databases would struggle to do. Conversely, a relational database-
or from the commercial distributions of Hadoop. In no case can
based data warehouse performs a multitude of useful functions
these be considered a seamless bundle of software that is easy to
that Hadoop does not yet possess.
deploy in the enterprise. A more accurate description is that
Hadoop facilitates these applications by grinding through data
Hadoop is described as a solution to a myriad of applications in
sources that were previously too expensive to mine. In many cases,
web log analysis, visitor behavior, image processing, search indexes,
the end result of a MapReduce job is the creation of a new data set
analyzing and indexing textual content, for research in natural
that is either loaded into a data warehouse or used directly by
language processing and machine learning, scientific applications
programs such as SAS or Tableau.
Engineers
Data Scientists
~5 concurrent users
Statisticians
~25 concurrent users
Business Analysts
~100+ concurrent users
Discover and Explore
Ingest, Transform, Archive
>
>
>
>
Fast data loading
ELT/ETL
Image processing
Online archival
Hadoop
Batch
Analyze and Execute
> Data discovery and
investigative analytics
> Multi-structured data
> SQL and MapReduce
Teradata Aster
Interactive
>
>
>
>
Ad-Hoc/OLAP
Predictive analytics
Spatial/temporal
Active execution
Teradata
Active
Figure 4. Best-of-breed big data architecture.
5 Without getting too technical, this simply means that MapReduce works in a flow pattern that, unlike a relational database, does not lend itself to varied and dynamic workloads that are interactive
and iterative.
EB-6513
>
0212
>
PAGE 7 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
The Core of Hadoop is MapReduce
The MapReduce architecture provides automatic parallelization
could consider it a hammer, not a precision machine tool. It requires
and distribution, fault recovery, I/O scheduling, monitoring, and
data parsing and fullscan in its operation; it sacrifices disk I/O
status updates. It is both a programming model and a framework
to avoid schemas, indexes, and optimizers; intermediate results
for massively parallel processing of large datasets across many low-
are materialized on local disks. Runtime scheduling is based on
end nodes. In operation, it is analogous to “Group By” aggregation
speculative execution, considerably less sophisticated than today’s
in relational databases. Its ability to spread very large jobs across a
relational analytical platforms.
cluster of ordinary servers is perhaps its best feature. In addition,
Even though Hadoop is evolving, and the community is adding
it has excellent retry/failure semantics.
capabilities rapidly, it lacks most of the security, resource manageMapReduce at the programming level (or embedded in SQL as
ment, concurrency, reliability and interactive capabilities of a data
Teradata SQL and some others have implemented) is simple and
warehouse. Hadoop’s most basic components – the Hadoop
easy to use. Programmers code only Map() and Reduce() functions
Distributed File System (HDFS) and MapReduce framework – are
and are not involved with how the job is distributed. There is no
purpose built for understanding and processing multi-structured
data model, and there is no schema. The subject of a MapReduce
data. The file system is crude in comparison to a mature relational
job can be any irregular data. Because the assumption is that
database system which when compared to the universal use of SQL
MapReduce clusters are composed of commodity hardware, and
is a limiting factor. However, its capabilities, which have just begun
there are so many of them, it is normal for faults to occur during
to be appreciated, override these limitations and tremendous
a job, and Hadoop handles a few faults automatically, shifting the
energy is apparent in the community that continues to enhance
work to other resources.
and expand Hadoop.
But there are some drawbacks. Because MapReduce is a single fixed
See Figure 5 for a sampling of tasks compatible with MapReduce
data flow, has a lack of schema, index and high-level language, one
processing in Hadoop.
EB-6513
>
0212
>
PAGE 8 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Hadoop MapReduce with the HDFS is not an integrated data
management system. In fact, though it processes data across
multiple nodes in parallel, it is not a complete massively parallel
processing (MPP) system. It lacks almost every characteristic of an
Hadoop Distributed File System (HDFS)…is
simply not architected to provide the services
that relational databases do today.
MPP system, with the exception of scalability and reliability.
Hadoop stores multiple copies of the data it is processing, and the
interactive knowledge worker tool, and the Hadoop community is
failure of a node can rollover to another node with the same data,
making progress, but HDFS, which is the core data management
though there is also a single point of failure at the HDFS Name
feature of Hadoop, is simply not architected to provide the services
Node, which the Hadoop community is looking to address in the
that relational databases do today. And those relational database
long term. Today, NetApp provides a hardware-centric fail-over
platforms for analytics are innovating just as rapidly with:
solution for the Name Node. It lacks security, load balancing and
• Hybrid row and columnar orientation.
an optimizer. Data warehouse operators today will find Hadoop to
be primitive and brittle to set up and operate, and users will find
its performance lacking. In fact, its interactive features are limited
• Temporal and spatial data types.
• Dynamic workload management.
to a pseudo-relational database, Hive, whose performance would
• Large memory and solid-state drives.
be unacceptable to those accustomed to today’s data warehouse
• Hot/warm/cold storage.
standards. In fairness, MapReduce was never conceived as an
• Almost limitless scalability.
Web crawling
Search Indexes
• Crawl Blog posts and later process them
• Process documents from a continuous web crawl and distributed
training of support vector machines
• Image-based video copyright protection
• Image processing and conversions
• Parses and indexes
mail logs for search
Visitor behavior
Analyze and index
textual information
•
•
•
•
Recommender system for behavioral targeting
Session analysis and report generation
Analyzing similarities of user’s behavior
Filtering and indexing listing, processing log analysis, for
recommendation data
• Storage, log analysis, and pattern analysis
• Logs/Streaming Queues
• Filter and index listings, removing exact duplicates and grouping
similar ones
Web log analysis
Scientific
• Charts calculation and web log analysis
• Process clickstream and demographic data to create web analytic
reports
• Research for Ad Systems and Web Search
• Analyze user’s actions, click flow, and links
• Process data relating to people on the web
• Aggregate, store, and analyze data related to in-stream viewing of
Internet video
Research on natural language processing and machine
learning
• Particle physics, genomics, disease research, astronomy (NASA), etc.
• Crawling, processing, and log analysis
• Store copies of internal log and dimension data sources and use as
sources for reporting/analytics and machine learning
Image processing
• Image content-based advertising and auto-tagging for
social media
• Facial similarity and recognition
• Gathering WWW DNS data to discover content distribution
networks and configuration issues
Data mining
• Build scalable machine learning algorithms like canopy clustering,
k-means, Naïve Bayes, etc.
Figure 5. Partial list of tasks compatible with MapReduce processing.
EB-6513
>
0212
>
PAGE 9 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Integrated Data Warehouse
In determining the right architecture for your analytical work,
your decision should turn primarily, not on technical details
of functionality checklists, but on the business (or science)
problem you are trying to solve. These four use cases clarify
the different approaches:
Enterprise reporting of internal and external information for a broad cross section of
stakeholders both inside and beyond the firewall with extensive security, load balancing,
dynamic workload management and scalability to 100s of terabytes.
Relational database data
warehouse
Capturing large amounts of data in native format (without schema) for storage and
staging for analysis. Batch processing primarily for data transformations as well as understanding/investigation of novel, internal and external (though mostly external) data via
data scientists skilled in programming, analytical methods and data management with
sufficient domain expertise to communicate findings to the organization in business.
Hadoop
Deep data discovery and investigative analytics via data scientists and business users with
SQL skills, integrating typical enterprise data with novel multi-structured data from web
logs, sensors, social networks, etc.
Hybrid system such as Teradata
Aster with SQL-MapReduce®
Analysis of non-standard datasets such as unstructured and semi-structured data with or
without integration with enterprise structured data by a wide audience familiar with SQL
or SQL-generating tools such as Microstrategy.
Hybrid System such as Teradata
Aster with SQL-MapReduce
Figure 6. Use Cases.
EB-6513
>
0212
>
PAGE 10 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Case for Hybrid Relational + MapReduce Systems
Hadoop is appealing because it is open source, therefore there is
moment, configuring and managing a Hadoop environment is
no software license fee. However, most enterprises that are con-
complicated with many configuration settings and often missing
sidering Hadoop beyond the proof of concept phase are turning to
or conflicting documentation. Clearly, the time it takes to write
vendors that offer an enterprise version (e.g., Cloudera and
and test a program for MapReduce takes longer than creating a
Hortonworks), which, though still reasonably priced, is not free. In
query in SQL. Indeed, most SQL is not written at all, it is emitted
addition, the large, sometimes very large, clusters of servers needed
by BI tools interacting with business users.
to process data, while providing Hadoop with scalability and
redundant processors, even though they are “commodity” servers,
In addition, those features that are common in today’s relational
can add up to significant cost due to the sheer number of them
platforms for analytics are lacking in Hadoop. File compression is
that is used. It is becoming increasingly clear that big data process-
just one example; MapReduce cannot transparently process com-
ing has a natural affinity for the cloud, but there are many cases
pressed files (There are some workarounds, but they require manual
where companies will choose to use and/or deploy their own
effort). Compression, security, load balancing, workload manage-
clusters. Storage vendors like NetApp – a Teradata partner –
ment – all of these are expected from a modern platform for data
among others, provide storage and server infrastructure for small
warehousing. Hadoop provides a great set of functionality and
or large clusters, providing enterprise scale and reliability. At the
processing power, but it is far behind operator “creature comforts.”
Modules
Path Analysis
Discover patterns in rows of
sequential data
SQL-MapReduce Analytic Functions
nPath: complex sequential analysis for time series and behavioral patterns
nPath Extensions: count entrants, track exit paths, count children, and generate subsequences
Sessionization: identifies sessions from time series data in single pass
Graph and Relational Analysis Graph analysis: finds shortest path from distinct node to all other nodes in graph
Analyze patterns across rows of nTree: new functions for performing operations on tree hierarchies
data
Other: triangle finding, square finding, clustering coefficient
Text Analysis
Derive patterns in textual data
Sentiment Analysis: classify content is positive or negative (for product review, customer feedback)
Text Categorization: used to label content as spam/not spam
Entity Extraction/Rules Engine: identify addresses, phone number, names from textual data
Text Processing: counts occurrences of words, identifies roots, and tracks relative positions of
words and multi-word phrases
nGram: split an input stream of text into individual words and phrases
Levenshtein Distance: computes the distance between two words
Data Transformation
Transform data for more
advanced analysis
Pivot: convert columns to rows or rows to columns
Log parser: generalized tool for parsing Apache logs
Unpack: extracts nested data for further analysis
Pack: compress multi-column data into a single column
Antiselect: returns all columns except for specified column
Multicase: case statement that supports row match for multiple cases
Figure 7. Sample Teradata Aster SQL-MapReduce® packaged analytical functions.
EB-6513
>
0212
>
PAGE 11 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Conclusion: Hadoop is good for aggregating lots of data and
A hybrid relational database system that offers all the advantages of a
simple batch jobs, but Hadoop is a poor choice when the work can
relational database, but is also able to process MapReduce requests
be done with SQL and the capabilities of a relational database. But
would seem to be ideal. Not as a replacement for pure Hadoop jobs,
when the challenge is daunting for a relational database, such as
but for those tasks that require the flexibility of MapReduce process-
no existing schema for the data source, no existing mapping for
ing – but delivered in the almost universally known SQL language –
the data source into existing schema, unstructured or semi-
in a platform designed as a true MPP system with scalability,
structured data, and very large volumes, is Hadoop the only
dynamic workload management, and security.
answer? There is an alternative – the Teradata Aster MapReduce
Platform with the SQL-MapReduce® framework which internalizes
Also, the Teradata Aster solution provides more than 50 analytic
MapReduce and provides it as just another service from the
“modules” pre-built for SQL-MapReduce, including some that can-
database. Hadoop does not offer interactive analytics; it has poor
not be found in other libraries. Examples include Teradata Aster’s
provision for concurrent users; and optimizers are crude. Even
graph analysis and path analysis (Aster nPath), and because they
MapReduce itself is lean in terms of functionality, but there are
are integral to the database system, they are managed by the same
libraries to do quite a few analytical tasks.
optimizers, compression algorithms, and parallelism, unlike Hadoop.
To summarize, Figure 8 describes the roles of three platforms with
respect to the four types of analytics:
Hadoop
Relational
Hybrid (Teradata Aster)
Type I
Quantitative
R&D
For understanding multi-structured
data, creating test data, using
libraries such as R to research;
clearly life scientists/genomics, e.g.
Generally for extracting data for
analysis and study
Understanding large-scale analytics, raw multi-structured data,
preparing data for testing, application of native analytical “modules,”
some unique such as nPath
Type II
Data Scientist
or
Quantitative
Analyst
With callable libraries, can run
extensive statistical routines,
machine learning algorithms with
significant (manual) programming,
cluster management and load
balancing
In-database analytics with true MPP
scale up and scale out. Seamless
integration between data integration, model execution and action
invocation, extreme system reliability and manageability
Native processing of multistructured data; Function same
as relational
Type III
Operational
Analytics
Use of storage and integration
capabilities, extraction to databases
via HiveQL or other third-party
tools
Current portfolio of data warehousing and business intelligence tools
and analytical apps
Same as relational with added
MapReduce functionality through
SQL, specifically SQL-MapReduce
Type IV
Business
Intelligence
and Discovery
As a data source only
Data source plus existing connectors to many analytical tools
Data source plus existing connectors to many analytical tools
Figure 8. Recommended platform based on analytic types.
EB-6513
>
0212
>
PAGE 12 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.
Conclusion
About the Author
New information usually comes from unexpected places. Big leaps
Neil Raden, based in Santa Fe, New Mexico,
in understanding arise from unanticipated discoveries, but
is an industry analyst and active consultant,
unanticipated does not imply sloppy or accidental process. On the
widely published author and speaker,
contrary, usable discoveries have to be verifiable, but the desire for
and the founder of Hired Brains, Inc.,
knowledge is a drive for innovation by exploring new sources of
www.hiredbrains.com. Hired Brains pro-
information that can alter our perceptions and outlooks. It is easy
vides consulting, systems integration and
to see that this quest is a driving force in understanding the
implementation services in data warehous-
content of information sources that we haven’t explored before.
ing, business intelligence, big data, decision
This is in stark contrast to data warehousing, where we provide
automation and advanced analytics for clients worldwide. Hired
the means and the data to explore and analyze data that is already
Brains Research provides consulting, market research, product
known, cleaned, integrated and modeled into a stable logical
marketing and advisory services to the software industry.
schema. Unraveling the content of “big data” lacking obvious
structure or composed of things that are alien to data warehous-
Neil was a contributing author to one of the first (1995) books on
ing, such as voice or images, begs for some new approaches, such
designing data warehouses, and he is more recently the co-author
as Hadoop and MapReduce processing, in general.
of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall, 2007. He
But the two disciplines are largely complementary. Hadoop may
welcomes your comments at [email protected] or at his
replace some of the analytic environment such as data integration
blog at www.informationweek.com/authors/
and ETL in some cases, but Hadoop does not replace relational
showAuthor.jhtml?authorID=656
databases. Careful analysis is needed to decide if the application
areas by either approach are actually served or merely provisioned.
Luckily for us, there are good choices from either perspective and,
thanks to Teradata Aster (and others now emerging) there are
good choices in the middle, too, combining excellent aspects of
both technologies.
Hadoop typically enters the enterprise either in an isolated
application, typically based in the cloud, or as a replacement or
extension of ETL for the data warehouse. Other than the aforementioned businesses that use data as their core product, few
enterprises have chosen Hadoop on an enterprise basis to support
business intelligence and decision making. However, the usefulness of MapReduce programs is undeniable and will grow quickly.
Application of MapReduce functionality within the confines of
the familiar relational database and SQL, with full security,
performance, scalability and concurrency has considerable merit
and should be an easy choice for organizations to make.
SQL-MapReduce, Teradata, and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and/or worldwide.
EB-6513
>
0212
>
PAGE 13 OF 13
© 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.