Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Analytics Architecture Putting All Your Eggs in Three Baskets By Neil Raden Hired Brains, Inc. Sponsored by: Table of Contents Executive Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The “Big Data” Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Types of Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Positioning Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The Core of Hadoop is MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Integrated Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Case for Hybrid Relational + MapReduce Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 About The Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 EB-6513 > 0212 > PAGE 2 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Executive Summary The big story in analytics and information management in 2011 and a variety of reporting, analytical and decision support tools. It is was big data. In 2012, the trend is accelerating. At the center of for that reason that some Hadoop proponents see it as a replace- discussions about managing huge amounts of novel sources of ment for the current analytic architecture such as data warehousing. information is Hadoop. One can perceive this effect as ousting data warehouses as the premier location for the gathering and To put this in perspective, it is important to understand what analyzing of data, but it is only partially true. While the capabili- analytics means and how it is used. In this paper we present a ties and applications of Hadoop have been clearly demonstrated formal definition of analytical “types” in order to exemplify the for organizations that deal with massive amounts of raw data as kinds of analytics and analytical processes that organizations their primary business, especially Web-oriented ones , where it fits employ. This helps to pinpoint where Hadoop is appropriate, in other kinds of organizations is not quite as clear. The core of where existing data warehousing and business intelligence envi- Hadoop is a programming framework, MapReduce, which ronments are appropriate. There is a third option as well, with the crunches data from other sources and can deliver analysis (or emergence of new, hybrid systems that are relational databases sometimes just aggregated data to be used in other analytical with MapReduce capabilities such as the Teradata® Aster MapRe- systems). Because Hadoop consumes “used” data , its application duce Platform. Others are also emerging, and competitors to may tend to overlap with the way analytical processing has been Hadoop itself are already in business. 1 2 done for decades – with data integration tools, data warehouses 1 Actually, data warehousing and business intelligence, though impressive, have for the most part been limited to subject areas involving finance, marketing, supply chain, customers and many other intra-enterprise domains, (though there are countless exceptions too). Big data includes totally new areas such as genomics, machine-generated data and the vast volumes of data created online, especially social networking. 2 “Used data” is not an industry standard term, rather, it’s a construction used by this author to distinguish data stored in the primary systems that capture it as opposed to data that is re-used for other purposes. A CDR (Call Detail Record) in a telephone billing system is primary data. When it is pulled for analytical purposes, it is used data. EB-6513 > 0212 > PAGE 3 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. The “Big Data” Question The problem of “big data” is hardly new. When we see charts like systems, largely composed of historical data over time. It would be that in Figure 1, (the data is just representational), the explosive an exaggeration to say the volume, variety and velocity3 of the data growth of data today is quite alarming. was predictable, but it was at least comprehensible. The problem facing companies was how to expand their relationally-based It would appear from this graph that the amount of data in 2005 data warehouses to accommodate these new requirements and was practically insignificant compared to today. However, consider to provide the needed support for business intelligence and, in a the exponential growth in data from the perspective of the year 2005, larger sense, analytics. when enterprise systems and customer-level data was flooding into data warehouses, causing large-scale rethinking of methodologies. Today, the problem is very different. Data sources are unpre- That curve looked just as daunting as it does today (See Figure 2.). dictable, multi-structured (emanating from organized systems) So as alarming as it may have seemed in 2005, it clearly was the tip and massive. Many are external to the enterprise. The techniques of the iceberg for what was to follow. But analytic environments for mining data from these sources, and even the platforms most didn’t collapse; they evolved to deal with the rising tide of data. appropriate for doing so, are now somewhat in question. With the Today, they continue to evolve to meet the challenge of big data. entry of Hadoop to the market – an approach that is separate from the relational data warehouse – the issue facing decision makers Something has changed, though, and changed dramatically. When today is where and when to deploy these technologies for perform- data warehouse sizes grew from 50GB to 1TB to 10TB, the data was ing useful analytics. But first, a little clarity on what analytics means. still the same – structured data culled from internal operational Data Warehouse Data in TB Data Warehouse Data in TB 100 1.6 1.4 80 1.2 1 60 0.8 40 0.6 0.4 20 0.2 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 0 0 2000 2001 2002 2003 2004 2005 Figure 2. 2000-2005: Years of massive adoption of traditional business analytics. Figure 1. 2001-2012: Exponential data warehouse data growth. Big data analytics has contributed significantly to this growth during the last three years of this period. 3 Doug Laney – “3D Data Management: Controlling Data Volume, Velocity and Variety”, MetaGroup, 2001, though the title obviously suggests that the 3 Vs were meant in a different context from big data today. EB-6513 > 0212 > PAGE 4 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Types of Analytics The term “analytics” is not particularly precise, so before fitting mathematics behind the analytics, and may apply very complex analytics into the correct technology, it helps to have some preci- tools such as Lucene wrapper, loopy logic, path analysis, root cause sion. Hired Brains employs a four-type definition as follows: analysis, synthetic time series or Naïve Bayes derivatives that are understood by a small number of practitioners. What differenti- 1. Type I Analytics: Quantitative Research and Development ates the Type II-A from Type I is not necessarily the depth of The creation of theory and development of algorithms for all not uncommon for Type IIs to have a PhD for example), it is that forms of quantitative analysis deserves the title Type I. This is the they also possess the business domain knowledge they apply, and preserve of mathematicians, statisticians, and other pure quantita- their goal is to develop specific models for the enterprise, not for tive scientists. These are the people who delve into the workings the general case as Type Is usually do. knowledge they have about the formal methods of analytics (it is and manifestations of Hidden Markov Support Vector Machines, Linear Dynamical Systems, Spectral Clustering, Machine Learning Type II-Bs, on the other hand may work with more common and algorithms like canopy clustering, k-means, Naïve Bayes, and a well-understood techniques such as logistic regression, ANOVA, host of other exotic models. The discovery and enhancement of and CHAID, and approach their analytical problems with more computer-based algorithms for these concepts is mostly the realm conventional best practices and/or packaged analytical solutions of academia and other research institutions. Commercial, govern- from third parties. mental and other organizations (Wall Street for example) may employ some staff with these very advanced skills, but in general, We titled this category “Data Scientist,” which is a relatively new most organizations are able to conduct their necessary analytics title for quantitatively adept people with accompanying business without them, or employing the results of their research. An skills. The ability to formulate and apply tools to classification, obvious example is the FICO score, developed by Type I experts prediction and even optimization, coupled with fairly deep but employed widely in credit-granting institutions and even understanding of the business itself, is clearly in the realm of Type Human Resource organizations. II efforts. However, it seems pretty likely that most so-called data scientists will lean more towards the IT aspects of quantitative and 2. Type II Analytics: Data Scientists data-oriented subjects than business planning and strategy. The More practical than theoretical, Type II is the incorporation of reason for this is that the term data scientist emerged from those advanced analytical approaches derived from Type I activities. This businesses like Google or Facebook where the data actually is the includes commercial software companies, vertical software imple- business; so understanding the data is equivalent to understanding mentations, and even the heavy “quants” who work in industry, the business. This is clearly not the case for most organizations. who apply these methods specifically to the work they do, since We see very few Type II data scientists with the in-depth know- they operate in much the same way as commercial software ledge of the whole business as, say, actuaries in the insurance companies, but for just one customer (though they often start business, whose extensive training should be a model for the newly their own software companies, too). designated data scientists (see our blog on this topic at www.informationweek.com/blog/228901009). In fact, Type II could actually be broken down into two subtypes, Type II-A and Type II-B. While both perform roughly the same It is absolutely essential that someone in the organization has the function, providing guidance and expertise in the application of role of chief communicator – someone comfortable working with quantitative analysis, they are differentiated by the sophistication quants, analysts and programmers, deconstructing their method- of the techniques applied. Type II-A practitioners understand the ologies and processes, distilling them, and then rendering them in EB-6513 > 0212 > PAGE 5 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. language that other stakeholders understand. Companies often fail today’s business environment, time is perhaps the scarcest com- to see that there is almost never anything to be gained by trying to modity of all. Whether a decision-making system notifies people put a PhD statistician into the role of managing a group of or machines, it must confine those messages to those that are the analysts and developers. most relevant and useful. 3. Type III Analytics: Operational Analytics False negatives are quite a bit more problematic as they can lead to This is the part of analytics we’re most interested in and most transactions passing through that should not have. Large banks familiar with. For example, a Type II expert may develop a scoring have gone under by not catching trades that cost billions of model for his/her company. In Type III activity, parameters are dollars. Think of false negatives as being asleep at the wheel. chosen by the Type III analyst and are input into the model, generating the scores calculated by the Type II models and embed- Packaged applications that embed quantitative methods such as ded into an operational system that, say, generates offers for credit predictive modeling or optimizations are also Type III in that the cards. Models developed by Type IIs can be applied and embedded intricacies and the operation of the statistical or stochastic method in an almost infinite number of ways today. The application of are mostly hidden in a sort of “black box.” As analytics using Type II applications into real work is the realm of Type III. In very advanced quantitative methods becomes more acceptable to complex applications, real-time data can be streamed into applica- management over time, these packages become more popular. tions based on Type II models with outcomes instantaneously derived through decision-making tools such as rules engines4. 4. Type IV Analytics: Business Intelligence and Discovery Type III analytics aren’t of much value if their application in real Decision-making systems that are reliant on quantitative methods business situations cannot be evaluated for their effectiveness. This that are not well understood by the operators can lead to trouble. is the analytical work we are most familiar with via reports, OLAP, They must be carefully designed (and improved) to avoid overly dashboards and visualizations. This includes almost any activity burdening the recipients of useless or irrelevant information. that reviews information to understand what happened or how This was a lesson learned in the early days of data mining, that something performed, or to scan and free associate what patterns generating “interesting” results without understanding what was appear from analysis. The mathematics involved is simple. relevant usually led to flagging interest in the technology. In Descriptive Title Quantitative Sophistication/Numeracy Sample Roles Type I Quantitative R&D PhD or equivalent Creation of theory, development of algorithms. Academic/research. Often employed in business or government for very specialized roles. Type II Data Scientist or Quantitative Analyst Advanced Math/Stat, not necessarily PhD Internal expert in statistical and mathematical modeling and development, with solid business domain knowledge. Type III Operational Analytics Good business domain, background in statistics optional Running and managing analytical models. Strong skills in and/or project management of analytical systems implementation. Type IV Business Intelligence/Discovery Data- and numbers-oriented, but no special advanced statistical skills Reporting, dashboard, OLAP and visualization use, possibly design. Performing posterior analysis of results driven by quantitative methods. Spreadsheets prevail. Newer “business discovery tools” gaining traction. Figure 3: Summary of analytical types. 4 Taylor and Raden, “Smart Enough Systems,” Prentice Hall, 2007. EB-6513 > 0212 > PAGE 6 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Positioning Hadoop Hadoop “data warehouses” do not resemble the data warehouse/ analytics that are common in organizations today. They exist in businesses like Google and Amazon for web log parsing, indexing, and other batch data processing, as well as for storing enormous amounts of unfiltered data. Petabyte-size data warehouses in Hadoop are not data warehouses as we know them; they are a collection of files on a distributed file system designed for parallel processing. To call these file systems “a data warehouse” is mislead- ...it is important to distinguish between supporting these applications and actually performing them. Hadoop comes out of the box with no facilities at all…instead, it requires extensive software engineering…to do this work. In no case can these be considered a seamless bundle of software. ing because a data warehouse exists to serve a broad swath of uses and people, particularly in business intelligence, which is both in physics, biology and genomics and all forms of data mining (see interactive and iterative. MapReduce is a programming paradigm Figure 4). While it is demonstrable that Hadoop has been applied with a single data flow type that takes the form of directed acyclic to all of the domains and more, it is important to distinguish graph of operators5. These platforms lack built-in support for between supporting these applications and actually performing iterative programs, quite different from the operations of a them. Hadoop comes out of the box with no facilities at all to do relational database. To put it in layman’s terms, there are things most of this analysis. Instead, it requires the application of libraries that Hadoop is exceptionally well designed for that relational available either through the open source community at forge.com databases would struggle to do. Conversely, a relational database- or from the commercial distributions of Hadoop. In no case can based data warehouse performs a multitude of useful functions these be considered a seamless bundle of software that is easy to that Hadoop does not yet possess. deploy in the enterprise. A more accurate description is that Hadoop facilitates these applications by grinding through data Hadoop is described as a solution to a myriad of applications in sources that were previously too expensive to mine. In many cases, web log analysis, visitor behavior, image processing, search indexes, the end result of a MapReduce job is the creation of a new data set analyzing and indexing textual content, for research in natural that is either loaded into a data warehouse or used directly by language processing and machine learning, scientific applications programs such as SAS or Tableau. Engineers Data Scientists ~5 concurrent users Statisticians ~25 concurrent users Business Analysts ~100+ concurrent users Discover and Explore Ingest, Transform, Archive > > > > Fast data loading ELT/ETL Image processing Online archival Hadoop Batch Analyze and Execute > Data discovery and investigative analytics > Multi-structured data > SQL and MapReduce Teradata Aster Interactive > > > > Ad-Hoc/OLAP Predictive analytics Spatial/temporal Active execution Teradata Active Figure 4. Best-of-breed big data architecture. 5 Without getting too technical, this simply means that MapReduce works in a flow pattern that, unlike a relational database, does not lend itself to varied and dynamic workloads that are interactive and iterative. EB-6513 > 0212 > PAGE 7 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. The Core of Hadoop is MapReduce The MapReduce architecture provides automatic parallelization could consider it a hammer, not a precision machine tool. It requires and distribution, fault recovery, I/O scheduling, monitoring, and data parsing and fullscan in its operation; it sacrifices disk I/O status updates. It is both a programming model and a framework to avoid schemas, indexes, and optimizers; intermediate results for massively parallel processing of large datasets across many low- are materialized on local disks. Runtime scheduling is based on end nodes. In operation, it is analogous to “Group By” aggregation speculative execution, considerably less sophisticated than today’s in relational databases. Its ability to spread very large jobs across a relational analytical platforms. cluster of ordinary servers is perhaps its best feature. In addition, Even though Hadoop is evolving, and the community is adding it has excellent retry/failure semantics. capabilities rapidly, it lacks most of the security, resource manageMapReduce at the programming level (or embedded in SQL as ment, concurrency, reliability and interactive capabilities of a data Teradata SQL and some others have implemented) is simple and warehouse. Hadoop’s most basic components – the Hadoop easy to use. Programmers code only Map() and Reduce() functions Distributed File System (HDFS) and MapReduce framework – are and are not involved with how the job is distributed. There is no purpose built for understanding and processing multi-structured data model, and there is no schema. The subject of a MapReduce data. The file system is crude in comparison to a mature relational job can be any irregular data. Because the assumption is that database system which when compared to the universal use of SQL MapReduce clusters are composed of commodity hardware, and is a limiting factor. However, its capabilities, which have just begun there are so many of them, it is normal for faults to occur during to be appreciated, override these limitations and tremendous a job, and Hadoop handles a few faults automatically, shifting the energy is apparent in the community that continues to enhance work to other resources. and expand Hadoop. But there are some drawbacks. Because MapReduce is a single fixed See Figure 5 for a sampling of tasks compatible with MapReduce data flow, has a lack of schema, index and high-level language, one processing in Hadoop. EB-6513 > 0212 > PAGE 8 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Hadoop MapReduce with the HDFS is not an integrated data management system. In fact, though it processes data across multiple nodes in parallel, it is not a complete massively parallel processing (MPP) system. It lacks almost every characteristic of an Hadoop Distributed File System (HDFS)…is simply not architected to provide the services that relational databases do today. MPP system, with the exception of scalability and reliability. Hadoop stores multiple copies of the data it is processing, and the interactive knowledge worker tool, and the Hadoop community is failure of a node can rollover to another node with the same data, making progress, but HDFS, which is the core data management though there is also a single point of failure at the HDFS Name feature of Hadoop, is simply not architected to provide the services Node, which the Hadoop community is looking to address in the that relational databases do today. And those relational database long term. Today, NetApp provides a hardware-centric fail-over platforms for analytics are innovating just as rapidly with: solution for the Name Node. It lacks security, load balancing and • Hybrid row and columnar orientation. an optimizer. Data warehouse operators today will find Hadoop to be primitive and brittle to set up and operate, and users will find its performance lacking. In fact, its interactive features are limited • Temporal and spatial data types. • Dynamic workload management. to a pseudo-relational database, Hive, whose performance would • Large memory and solid-state drives. be unacceptable to those accustomed to today’s data warehouse • Hot/warm/cold storage. standards. In fairness, MapReduce was never conceived as an • Almost limitless scalability. Web crawling Search Indexes • Crawl Blog posts and later process them • Process documents from a continuous web crawl and distributed training of support vector machines • Image-based video copyright protection • Image processing and conversions • Parses and indexes mail logs for search Visitor behavior Analyze and index textual information • • • • Recommender system for behavioral targeting Session analysis and report generation Analyzing similarities of user’s behavior Filtering and indexing listing, processing log analysis, for recommendation data • Storage, log analysis, and pattern analysis • Logs/Streaming Queues • Filter and index listings, removing exact duplicates and grouping similar ones Web log analysis Scientific • Charts calculation and web log analysis • Process clickstream and demographic data to create web analytic reports • Research for Ad Systems and Web Search • Analyze user’s actions, click flow, and links • Process data relating to people on the web • Aggregate, store, and analyze data related to in-stream viewing of Internet video Research on natural language processing and machine learning • Particle physics, genomics, disease research, astronomy (NASA), etc. • Crawling, processing, and log analysis • Store copies of internal log and dimension data sources and use as sources for reporting/analytics and machine learning Image processing • Image content-based advertising and auto-tagging for social media • Facial similarity and recognition • Gathering WWW DNS data to discover content distribution networks and configuration issues Data mining • Build scalable machine learning algorithms like canopy clustering, k-means, Naïve Bayes, etc. Figure 5. Partial list of tasks compatible with MapReduce processing. EB-6513 > 0212 > PAGE 9 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Integrated Data Warehouse In determining the right architecture for your analytical work, your decision should turn primarily, not on technical details of functionality checklists, but on the business (or science) problem you are trying to solve. These four use cases clarify the different approaches: Enterprise reporting of internal and external information for a broad cross section of stakeholders both inside and beyond the firewall with extensive security, load balancing, dynamic workload management and scalability to 100s of terabytes. Relational database data warehouse Capturing large amounts of data in native format (without schema) for storage and staging for analysis. Batch processing primarily for data transformations as well as understanding/investigation of novel, internal and external (though mostly external) data via data scientists skilled in programming, analytical methods and data management with sufficient domain expertise to communicate findings to the organization in business. Hadoop Deep data discovery and investigative analytics via data scientists and business users with SQL skills, integrating typical enterprise data with novel multi-structured data from web logs, sensors, social networks, etc. Hybrid system such as Teradata Aster with SQL-MapReduce® Analysis of non-standard datasets such as unstructured and semi-structured data with or without integration with enterprise structured data by a wide audience familiar with SQL or SQL-generating tools such as Microstrategy. Hybrid System such as Teradata Aster with SQL-MapReduce Figure 6. Use Cases. EB-6513 > 0212 > PAGE 10 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Case for Hybrid Relational + MapReduce Systems Hadoop is appealing because it is open source, therefore there is moment, configuring and managing a Hadoop environment is no software license fee. However, most enterprises that are con- complicated with many configuration settings and often missing sidering Hadoop beyond the proof of concept phase are turning to or conflicting documentation. Clearly, the time it takes to write vendors that offer an enterprise version (e.g., Cloudera and and test a program for MapReduce takes longer than creating a Hortonworks), which, though still reasonably priced, is not free. In query in SQL. Indeed, most SQL is not written at all, it is emitted addition, the large, sometimes very large, clusters of servers needed by BI tools interacting with business users. to process data, while providing Hadoop with scalability and redundant processors, even though they are “commodity” servers, In addition, those features that are common in today’s relational can add up to significant cost due to the sheer number of them platforms for analytics are lacking in Hadoop. File compression is that is used. It is becoming increasingly clear that big data process- just one example; MapReduce cannot transparently process com- ing has a natural affinity for the cloud, but there are many cases pressed files (There are some workarounds, but they require manual where companies will choose to use and/or deploy their own effort). Compression, security, load balancing, workload manage- clusters. Storage vendors like NetApp – a Teradata partner – ment – all of these are expected from a modern platform for data among others, provide storage and server infrastructure for small warehousing. Hadoop provides a great set of functionality and or large clusters, providing enterprise scale and reliability. At the processing power, but it is far behind operator “creature comforts.” Modules Path Analysis Discover patterns in rows of sequential data SQL-MapReduce Analytic Functions nPath: complex sequential analysis for time series and behavioral patterns nPath Extensions: count entrants, track exit paths, count children, and generate subsequences Sessionization: identifies sessions from time series data in single pass Graph and Relational Analysis Graph analysis: finds shortest path from distinct node to all other nodes in graph Analyze patterns across rows of nTree: new functions for performing operations on tree hierarchies data Other: triangle finding, square finding, clustering coefficient Text Analysis Derive patterns in textual data Sentiment Analysis: classify content is positive or negative (for product review, customer feedback) Text Categorization: used to label content as spam/not spam Entity Extraction/Rules Engine: identify addresses, phone number, names from textual data Text Processing: counts occurrences of words, identifies roots, and tracks relative positions of words and multi-word phrases nGram: split an input stream of text into individual words and phrases Levenshtein Distance: computes the distance between two words Data Transformation Transform data for more advanced analysis Pivot: convert columns to rows or rows to columns Log parser: generalized tool for parsing Apache logs Unpack: extracts nested data for further analysis Pack: compress multi-column data into a single column Antiselect: returns all columns except for specified column Multicase: case statement that supports row match for multiple cases Figure 7. Sample Teradata Aster SQL-MapReduce® packaged analytical functions. EB-6513 > 0212 > PAGE 11 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Conclusion: Hadoop is good for aggregating lots of data and A hybrid relational database system that offers all the advantages of a simple batch jobs, but Hadoop is a poor choice when the work can relational database, but is also able to process MapReduce requests be done with SQL and the capabilities of a relational database. But would seem to be ideal. Not as a replacement for pure Hadoop jobs, when the challenge is daunting for a relational database, such as but for those tasks that require the flexibility of MapReduce process- no existing schema for the data source, no existing mapping for ing – but delivered in the almost universally known SQL language – the data source into existing schema, unstructured or semi- in a platform designed as a true MPP system with scalability, structured data, and very large volumes, is Hadoop the only dynamic workload management, and security. answer? There is an alternative – the Teradata Aster MapReduce Platform with the SQL-MapReduce® framework which internalizes Also, the Teradata Aster solution provides more than 50 analytic MapReduce and provides it as just another service from the “modules” pre-built for SQL-MapReduce, including some that can- database. Hadoop does not offer interactive analytics; it has poor not be found in other libraries. Examples include Teradata Aster’s provision for concurrent users; and optimizers are crude. Even graph analysis and path analysis (Aster nPath), and because they MapReduce itself is lean in terms of functionality, but there are are integral to the database system, they are managed by the same libraries to do quite a few analytical tasks. optimizers, compression algorithms, and parallelism, unlike Hadoop. To summarize, Figure 8 describes the roles of three platforms with respect to the four types of analytics: Hadoop Relational Hybrid (Teradata Aster) Type I Quantitative R&D For understanding multi-structured data, creating test data, using libraries such as R to research; clearly life scientists/genomics, e.g. Generally for extracting data for analysis and study Understanding large-scale analytics, raw multi-structured data, preparing data for testing, application of native analytical “modules,” some unique such as nPath Type II Data Scientist or Quantitative Analyst With callable libraries, can run extensive statistical routines, machine learning algorithms with significant (manual) programming, cluster management and load balancing In-database analytics with true MPP scale up and scale out. Seamless integration between data integration, model execution and action invocation, extreme system reliability and manageability Native processing of multistructured data; Function same as relational Type III Operational Analytics Use of storage and integration capabilities, extraction to databases via HiveQL or other third-party tools Current portfolio of data warehousing and business intelligence tools and analytical apps Same as relational with added MapReduce functionality through SQL, specifically SQL-MapReduce Type IV Business Intelligence and Discovery As a data source only Data source plus existing connectors to many analytical tools Data source plus existing connectors to many analytical tools Figure 8. Recommended platform based on analytic types. EB-6513 > 0212 > PAGE 12 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission. Conclusion About the Author New information usually comes from unexpected places. Big leaps Neil Raden, based in Santa Fe, New Mexico, in understanding arise from unanticipated discoveries, but is an industry analyst and active consultant, unanticipated does not imply sloppy or accidental process. On the widely published author and speaker, contrary, usable discoveries have to be verifiable, but the desire for and the founder of Hired Brains, Inc., knowledge is a drive for innovation by exploring new sources of www.hiredbrains.com. Hired Brains pro- information that can alter our perceptions and outlooks. It is easy vides consulting, systems integration and to see that this quest is a driving force in understanding the implementation services in data warehous- content of information sources that we haven’t explored before. ing, business intelligence, big data, decision This is in stark contrast to data warehousing, where we provide automation and advanced analytics for clients worldwide. Hired the means and the data to explore and analyze data that is already Brains Research provides consulting, market research, product known, cleaned, integrated and modeled into a stable logical marketing and advisory services to the software industry. schema. Unraveling the content of “big data” lacking obvious structure or composed of things that are alien to data warehous- Neil was a contributing author to one of the first (1995) books on ing, such as voice or images, begs for some new approaches, such designing data warehouses, and he is more recently the co-author as Hadoop and MapReduce processing, in general. of Smart (Enough) Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions, Prentice-Hall, 2007. He But the two disciplines are largely complementary. Hadoop may welcomes your comments at [email protected] or at his replace some of the analytic environment such as data integration blog at www.informationweek.com/authors/ and ETL in some cases, but Hadoop does not replace relational showAuthor.jhtml?authorID=656 databases. Careful analysis is needed to decide if the application areas by either approach are actually served or merely provisioned. Luckily for us, there are good choices from either perspective and, thanks to Teradata Aster (and others now emerging) there are good choices in the middle, too, combining excellent aspects of both technologies. Hadoop typically enters the enterprise either in an isolated application, typically based in the cloud, or as a replacement or extension of ETL for the data warehouse. Other than the aforementioned businesses that use data as their core product, few enterprises have chosen Hadoop on an enterprise basis to support business intelligence and decision making. However, the usefulness of MapReduce programs is undeniable and will grow quickly. Application of MapReduce functionality within the confines of the familiar relational database and SQL, with full security, performance, scalability and concurrency has considerable merit and should be an easy choice for organizations to make. SQL-MapReduce, Teradata, and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and/or worldwide. EB-6513 > 0212 > PAGE 13 OF 13 © 2012, Hired Brains, Inc. No portion of this report may be reproduced or stored without prior written permission.