* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Architecting the data layer for analytic applications in today’s data-intensive, high-
Survey
Document related concepts
Transcript
Architecting the data layer for analytic applications A look at solutions designed to work in today’s data-intensive, highvolume, read-only environment Spring 2011 Advisory Services Table of contents The heart of the matter 2 Strong analytical capabilities— at the core of today’s business An in-depth discussion 4 Exploring architectural alternatives— Finding the right fit for your unique organization What this means for your business Leveraging valid, timely data— Seizing a competitive advantage Spring 2011 14 The heart of the matter Strong analytical capabilities—at the core of today’s business The large volumes of both structured and unstructured data currently bombarding global organizations are intensifying their need for stronger analytical capabilities. Analytics are a major component of any Business Intelligence (BI) application, but traditional data strategies no longer work in today’s data-intensive, high-volume, read-only environment for large-scale data analysis (BigData) and very large data warehouses. As a result, data architects are struggling to tailor a data processing and storage strategy that not only fits their organization’s unique analytic needs and circumstances, but is also aligned with managements’ overall business goals. This is not an easy task. While the emergence of non-traditional niche architectures has expanded the scope of possibilities from which to select, it has also added more complexity to the mix. Today’s challenging environment requires efficiency in query throughput and bulk data loading with a reasonable latency acceptance. The mingling of structured and unstructured data, along with the advent of Semantic Web technologies, require experimentation with some of these emerging technologies as data processing and storage models. As we see it, in today’s dynamic IT environment, strong analytical capabilities are at the core of successful business initiatives. Those organizations that carefully plan their data-management strategy with their overall business goals in mind are the ones that will find themselves positioned to seize a competitive advantage in today’s dynamic global marketplace. It is imperative for the business to leverage this information as a true asset. To support smart business decisionmaking and maintain a competitive edge, managers are now demanding access to operational data, historical data and data collected from extended enterprises and multiple channels in just about real-time. It is incumbent on IT to provide faster access and better visualization tools to empower business users to effectively analyze and discover or mine this data. Business and IT strategy must be aligned, and enterprise architects must provide the bridge between the business and technology. So where do you begin? There is no one-size-fits-all strategy, nor is there one obvious “best-choice” technology to implement. Before setting out to design any BI application, it is critical to put in the necessary time and effort up front to work your way through the maze. Ask yourselves some key questions to determine which features are most important to your company. With those answers top of mind, it’s time to compare the pros and cons of new alternative technologies. Only then will you be ready to plan your strategy and implement the solution that is the best fit for your unique company. To help you navigate through the maze of available alternatives, this paper examines the data or persistence layer in the BI stack, comparing and contrasting some of the key architectures now available to you. Among the options we address are: Parallel Database Management Systems (DBMS), Map-Reduce (MR) distributed processing systems, In-Memory databases (IMDB), Distributed Hash Tables (DHT), Hadoop Distributed File Systems (HDFS) and Column Databases. The heart of the matter 3 An in-depth discussion Exploring architectural alternatives—Finding the right fit for your unique organization Asking—and answering—the right questions Traditional data strategies no longer work in the current data-intensive, high-volume, read-only environment, which requires efficiency in query throughput and bulk data loading with a reasonable latency acceptance. The large volume of data flooding organizations today has intensified their need to strengthen their BI analytical capabilities. New alternative architectures are emerging to meet this challenge. The mingling of structured and unstructured data—coupled with the advent of Semantic Web technologies—is driving data architects to experiment with some of these new alternatives to explore their potential as effective data processing and storage models. This paper takes a look at the data layer to compare and contrast some of the key architecture choices now available to you. But to design the right solution—one that will best meet your organization’s unique needs and circumstances—you must first ask yourselves some key questions. Data storage—Choosing between Shared-Nothing or Shared-Disk architecture Traditionally, clustered solutions have provided scalability and High Availability (HA) or fail-over capabilities for data storage and retrieval. Clustering is typically enabled by deploying either Shared-Nothing or Shared-Disk architecture for data storage—the choice depends upon whether scalability, consistency or failover is important for your application. Before designing any BI application, it is important to consider what is important for your organization and what new technologies are out there to meet those needs. Ask yourselves these and other architectural questions: • What are some of the emerging trends and specialized data storage techniques that are being tested and used by companies to drive innovation and competitive advantage? • When should we use clustered solutions for data stores and data services? • What applications are appropriate for the MR framework— or would Parallel DBMS or DHTs be a better choice for our organization? • For analytical applications such as data warehousing, would it be best to use column databases or appliances? Or would an IMDB, Object Database or NoSQL solution be the best fit? • Will Consistency, Availability, Partitioning (CAP) characteristics be sufficient for our needs, allowing us to sacrifice Atomicity, Consistency, Isolation and Durability (ACID) characteristics? To help you formulate answers to key architectural questions such as these, we discuss the pros and cons of the alternative approaches now available to you, presenting our point of view on their suitability under various circumstances. • Shared-Nothing is a data architecture for distributed data storage in a clustered environment where data is partitioned across a set of clustered nodes and each node manages its own data. Typical shared-nothing deployments utilize hundreds of nodes with private storage. • Shared-Disk architecture is used in clustered solutions for data processing to enable HA of the services provided by the cluster. Although the services are run on separate nodes, the actual data is on a shared Storage Area Network (SAN) storage location. Using disk-mirroring techniques further enhances the services failover. One can implement HA by utilizing multiple nodes and creating ActiveActive or Active-Passive configurations. HA clusters usually use a heartbeat private network connection to monitor the health and status of each node in the cluster. Most commercial databases have a HA deployment configuration. An in-depth discussion 5 In the following chart, we compare and contrast these two architectures1 Shared-nothing architecture Shared-disk architecture Each node has its own mass storage as well as main memory. Each node has its own main memory, but all nodes share mass storage—usually a storage area network. In practice, each node usually has separate multiple CPUs, memory and disks connected through a high-speed interconnect. A careful partitioning strategy across nodes is needed to implement distributed transactions and two-phase commit. Single Shared Disk presents issues with partitioning data that must be handled during data loading. Write Performance: Data Partitioning alleviates this issue since writes can be isolated to single nodes. Write Performance: There are issues with distributed locking, as write requests from multiple nodes will need coordination. Query Performance: This architecture works really well if all data is available on a single node. Query Performance: When querying data from a Shared Disk deployment, there is possibility of Input/Output (I/O) disk contention from multiple nodes. This necessitates careful data partitioning during data load. Shared-Nothing is a favorable deployment model when scalability is important, but it requires careful data partitioning across nodes. Shared-Disk is a favorable deployment model when data consistency is important. This architecture is used for Parallel Data Processing. This architecture is used for failover of Data Services. Shared-Nothing can be used when data consistency can be sacrificed. Shared-Disk is the better approach where data consistency is required. A shared-nothing architecture requires an efficient parallel-processing application service such as Parallel DMBS or the MR programming for parallel data processing. Both of these approaches provide a Shared-Nothing architecture and allow transparency from internal storage and access methods for end users. Both achieve a high degree of parallelism of various 1 The case for Shared Nothing http://db.cs. berkeley.edu/papers/hpts85-nothing.pdf 6 Architecting the data layer for analytic applications operations—such as loading data, building indexes and evaluating queries by dividing the data into different nodes to facilitate parallel processing. As robust, high-performance computing platforms, both provide a high-level programming environment that is inherently parallelizable, and both use a distributed and partitioned data storage strategy. In the chart below, we compare these two platform approaches looking at cost, complexity, and architecture fit (latency-access) in an IT organization. As you consider these alternatives, it is important to keep in mind the complexity of the transaction model as well as data-partitioning techniques, and also to focus on the update and query capabilities required from the application. MR Systems Parallel DBMS Architecture Fit Since MR programs use Key-Value pair formats, they do not require users to define a schema for their data. Complex data structures such as those involving compound keys, must be built into application programs. Parallel DBMS require pre-defined schemas and data models representing data structures that can be queried by application programs. Architecture Fit Since MR systems maintain cluster information in a single named node, there is a single point of failure. Failover of the name node requires another node that is mostly idle. Database services are fault tolerant using Parallel DBMS. Architecture Fit Data integrity must be enforced by the MR programs. Data integrity constraints are abstracted from programs, since the database engine enforces data integrity that is defined by the database catalogue and database structures, e.g., constraints, Primary and Foreign Keys. Architecture Fit MR requires setup of a distributed file system such as HDFS/GFS for data storage as well as data compression and persistence. Additionally, HDFS allows data partitioning and data replication (for redundancy) across multiple nodes. NOTE: a DHT, discussed later, is an interesting option that can be explored as an alternative to HDFS. Use the file system recommended by the database vendor. Complexity MR can use native interfaces in languages such as C, C++, Java, PHP and PERL. Other high-level interfaces that simplify writing MR jobs include Hive, Pig, Scope, and Dryad. For example, Hive provides a SQL interface to MR. Hadoop Streaming allows the use of languages that support reading and writing to standard input/output. Parallel DBMS use the ubiquitous ANSI-SQL and a procedural language provided by database vendors. Use of Query Optimizers and Query Parallelization is transparent to the SQL developer. Architecture Fit Shared-Nothing makes MR systems very flexible due to the lack of constraints. For this reason, it is less suitable for shared-development environments. This approach is less flexible when data structures need to be shared, requiring consensus among developers in design of data structures and tables. Complexity MR does introduce movement of partitioned data across the network nodes as all records of a particular key are sent to the same reducer. Hence, this is a poor platform choice when the job must communicate extensively among nodes. This is a more favorable approach for data processing when a job must communicate extensively among nodes, which is often necessary due to poor data partitioning. Complexity Any indexes required to improve query performance must be built by the MR programmer. This architecture leverages indexes that are shared by applications. An in-depth discussion 7 Cost MR Systems Parallel DBMS MR systems are open-source projects with no software licensing fees, which often present a challenge for product support given that one has to rely on the opensource developer community. Parallel DBMS require more expensive hardware and software, but come with vendor support. MR systems can be deployed on relatively in-expensive commodity hardware. This combination provides a cheaper, scalable, fault-tolerant and reliable storage plus a data processing solution. Architecture Fit MR systems are efficient ETL systems that can be used to quickly load and process large amounts of data in an ad hoc manner. As such, it complements rather than competes with DBMS technology. Parallel DBMS are not ETL systems, although some vendors would like to use the database engine for transformation when using the ELT approach. Architecture Fit MR systems are popular for search, indexing and applications for creation of social graphs, recommendation engines, storing and processing semi-structured data such as web-logs that can be easily represented as keyvalue pairs. Parallel DBMS are suitable for OLTP and analytical warehouse environments, making this the better choice for query-intensive workloads, but NOT the best choice for unstructured data. MR systems are efficient for many data-mining, dataclustering and machine learning applications where the program must make multiple passes over the data. The output of one MR job can be used as the input of another MR job. See the commentary on column databases later in this paper. This computing platform allows developers to build complex analytical applications and complex eventprocessing applications that combine structured data (in RDBMS), such as user profile, and unstructured data, such as log data, video /audio data or streaming data that tracks user action. This “mashup” can provide input to machine learning algorithms and can also provide a richer user experience and insights into data that neither dataset can provide on its own. Such applications cannot be structured as single SQL aggregate queries. Instead, they require a complex dataflow program where the output of one part of the application is the input of another. MR systems can be used for batch processing and noninteractive SQL, for example ad hoc queries on data. MR systems efficiently handle streaming data such as analysis and sessionization of web events for click-stream analysis. They are useful for the calculation of charts and web log analysis. Architecture Fit MR systems are good for applications requiring high latency but low throughput. Parallel DBMS are good for applications requiring high throughput and low latency. Complexity Developers must learn alternate data serialization and node-to-node communication strategies such as protocol buffers. Parallel DBMS are less complex. Architecture Fit MR systems can be used when CAP properties are more important. See commentary later in this paper on ACID versus CAP. This architecture can be used when ACID properties are more important. 8 Architecting the data layer for analytic applications Apache hadoop an introduction todd lipcom - gluecon 2010 ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop MapReduce (Job Scheduling/Execution System) HBase (Column DB) HDFS (Hadoop Distributed File System) Avro (Serialization) Zookeepr (Coordination) The Hadoop Ecosystem cloudera http://www.slideshare.net/cloudera/apachehadoop-an-introduction-todd-lipcomgluecon-2010( Slide 40) MR systems and Parallel DBMS—Complementary architectures Although it might seem that MR systems and Parallel DBMS are different, it is possible to write almost any parallel-processing task as either a set of database queries or a set of MR jobs. In both cases, one brings data close to the processing engine. So the debate centers on whether a Parallel DBMS or MR approach is the best fit. Parallel DBMS improve processing and input/output speeds by using Central Processing Units (CPUs) and disks in parallel. They apply horizontal partitioning of relational tables, along with the partitioned execution of SQL queries using Query Optimizers. Most of these architectural differences are the result of the different focuses of the two classes of systems. While Parallel DBMS excel at efficient MR—A major component of the Hadoop Ecosystem Hadoop, which was popularized by Google, is now an open-source project under development by Yahoo! and the Apache Foundation. Hadoop consists of two major components, 1) a file depository or filestore, called HDFS, and 2) a distributed processing system (MR). The HDFS is a fault-tolerant (replication), reliable, SharedNothing, scalable and elastic storage that is rack-aware and built using commodity hardware. MR is a data processing system that allows work to be split up into a large number of servers and carried out in parallel. MR allows one to parallelize relational Algebra Operations that are also the building blocks of SQL such as: group-by; order or sort; intersect and union; aggregation and joins. Each of these operations can be mapped using key-value pairs for input and output between the map and reduce steps. Hadoop Streaming allows communication between the map and reduce steps using standard input and output techniques familiar to unix administrators. Some of the other components shown in the diagram above are abstractions and higher-level interfaces that are meant to assist developers with routine tasks such as workflow. Details on the inner workings of MR can be found in Hadoop-A Definitive Guide (O’Reilly Publications, Second Edition). An in-depth discussion 9 Defining the relevant acronyms ACID: Atomicity, Consistency, Isolation, and Durability BI: Business Intelligence BLOB: Binary large object that can hold a variable amount of data CAP: Consistency, Availability, Partitioning CouchDB: Cluster of Unreliable Commodity Hardware CPU: Central Processing Unit DBMS: Database Management System DHT: Distributed Hash Tables ETL: Extract, Transform, Load HA: High Availability HDFS: Hadoop Distributed File System IB DB: Internet Book Database IM DB: In-Memory Database I/O: Input/Output MR: Map-Reduce NoSQL: Databases without any sequel interfaces OD: Object Database OLTP: Online Transaction Processing RDBMS: Relational Database Management Systems (sequel-compliant) querying of large data sets, MR systems excel at complex analytics and ETL tasks. The two technologies are complementary, in that neither is good at what the other does well and many complex analytical problems require the capabilities provided by both systems. This requirement drives the need for interfaces between Parallel DBMS and MR systems, which will allow each to do what it is good at. From where we stand, a “hybrid” solution could be your best bet—that is, to interface an MR framework to a DBMS so that the MR can do complex analytics and data processing that is typical of an ETL program, and interface to a DBMS to do embedded queries. HadoopDB, Hive, Aster, Greenplum, Cloudera, and Vertica all have commercially available products or prototypes in this category. Facebook has implemented a large data-warehouse system using MR technology, Hive and HiveQL. HiveQL is a SQL interface that was developed by Facebook and is now available to the open source community.2 This enables analysts who are familiar with SQL to access HDFS and Hive data stores. SQL queries are translated to MR jobs in the background. A look at a few alternative strategies Several web companies have opted to use key-value stores as a substitute for a SAN: Storage Area Network SLA: Service Level Agreement 10 2 Hadoop Summit 2010, hosted by Yahoo! June 29, 2010. http://developer.yahoo.com/ events.hadoopsummit2010 Architecting the data layer for analytic applications relational database—that is, an associative array where the hash of a key (array of bytes) is mapped to a value (aka a BLOB). Below, we provide a birds-eye view of some of the alternatives available to you today—DHTs, Column Databases and In-Memory Databases (IMBD)—along with considerations to bear in mind when choosing between ACID and CAP solutions. The DHT approach More complex installations use a DHT approach, thereby allowing a group of distributed nodes to collectively manage a mapping between keys and values, without any fixed hierarchy and with only minimal management overhead. A DHT allows one to store a key and value pair and then look up a value with the key. Values are not always persisted on disk, although one could certainly base a DHT on top of a persistent hash table, such as Berkeley DB. The main characteristic of a DHT is that storage and lookups are distributed among multiple machines. Unlike existing master/slave database replication architectures, all nodes are peers that can join and leave the network freely. DHTs have been popular when building peer-to-peer applications such as file-sharing services and web caches. Some implementations of DHT include Cassandra, CouchDB and Valdemort. For example, CouchDB is a schemafree (no tables), document-oriented database used for applications such as forums, bug tracking, wikis and TagClouds. A CouchDB document is an object that consists of named fields. Field values are lists that may contain strings, numbers and dates. A Couch database is a flat collection of these documents. The NoSQL movement has popularized some of these DHT-based databases. DHT solutions require clients to perform complex programming in order to handle dirty reads and rebalance cluster and hashing. Typically, DHT implementations do not have compression, which affects I/O. Memory cached key-value stores are also being used by several web applications to improve applications’ performance by removing load from the databases. Note that this technology must be deployed carefully because it lacks ACID support. There are both advantages and disadvantages to using a DHT approach. DHT advantages: • Decentralization: There is no central coordinator or master node providing independence and failure transparency. This is designed to distribute and minimize shared components and eliminate the single point of failure. • Scalability: Hash tables are efficient despite a large number of nodes and fault tolerance, even when nodes drop and join. • Simplicity: This is a key-value store without any SQL access and with no schemas to worry about. Hash tables have a lower memory footprint. • Highly available for write. Hash tables provide the ability to write to many nodes at once. Depending on the number of replicas maintained (which is configurable), one should always be able to write somewhere, thereby avoiding write failures. • Suitable for query applications. Hash tables are especially efficient for query applications where there is a need to return sorted data quickly using MR. • Performance: Performance level is high when using hash tables and key-value pairs. DHT disadvantages: • Not ACID-compliant. Since ACID compliance is a must for all relational databases, the use of DHT is limited. • Lack of complex data structures required for joins (obtaining data from multiple hashes) and analysis. This makes a DHT solution unsuitable for data warehouses that must support queries that aggregate and retrieve small data results from large datasets. • Lack of queryable schemas. This lack limits the usefulness of a DHT. • Complex for clients. A DHT solution requires clients to perform complex programming. And, because this technology lacks ACID support, it must be carefully deployed. An in-depth discussion 11 12 ACID or CAP? When architecting applications for large-scale data, it is necessary to consider whether ACID properties— atomicity, consistency, isolation and durability of data and transactions— are more important for your business than are CAP characteristics—consistency, availability and partitioning of data. Relational database vendors focus on implementing ACID properties, which may be harder to implement in a distributed environment where there is a possibility of node failure. In some respects, this may limit the scalability of these ACID-compliant databases for updates and query purposes. Recently, however, significant academic research has rejuvenated this technology. Database architects have been debating whether column databases or row databases offer better performance for large-scale data analysis. Vendors like Oracle® (with its Exadata product) and Greenplum are striving to achieve convergence between column-store and row-store technologies. Alternatively, there is a class of applications that can have only CAP characteristics, thus allowing the sacrifice of ACID properties. Since these systems do not use relational database technologies, but instead use technologies such as DHT and HDFS, they are very scalable and function perfectly even during node or network failures. Column databases3 serialize data stored in columns instead of in a complete row, as is the case with row-oriented databases. In other words, attribute values belonging to the same column are stored contiguously, compressed and densely packed, as opposed to traditional database systems that store entire records (rows). The question becomes whether you will need to manage complex transactions, or whether a simple model with no transaction support will be sufficient to meet your needs. It is important that you determine whether or not it is possible to deploy a data-partitioning strategy, such as is used in HDFS and DHTs, to minimize the network traffic and make the nodes self-sufficient. Column stores claim to be better-suited for BI applications and data warehouses that typically involve a smaller number of highly complex queries requiring full fact-table scans, since they only read the columns required by the query. Column-database storage is read-optimized for analytical application and uses compression techniques. However, there are some issues with Column databases For many years, Sybase® IQ was the only commercially available product in the column-oriented DBMS class. 3 C-Store: A column-oriented DBMX, Stonebraker et al, Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 Architecting the data layer for analytic applications Another approach, widely used to boost query performance, is the use of a vertical partitioning scheme, where all columns of a row-store are indexed, allowing the columns to be accessed independently. column-stores. To cite just two examples, we have seen problems with: • Database insert/update operations that involve separating individual columns in a row and writing to different locations on disk • Row construction, e.g., combining data from multiple columns into ‘rows’ during query execution In-memory databases (IMDB) In this class of database technology, the entire database is stored in memory. IMDB repositories typically have a low memory and CPU footprint. The mainstream adoption of 64-bit architectures, coupled with lowering of memory costs, has enabled larger databases to be created in memory. Since disk I/O is eliminated, IMDB repositories provide superior query performance. Most traditional databases use disk caching to reduce this I/O, but they end up using more memory and CPU. For this reason, it is important that architects responsible for BI applications do not overlook this emerging technology— which is rapidly gaining momentum for enterprise analytical applications—as a viable solution. As with other database technologies, IMDB repositories are proprietary and have practical limitations on the database size. Given database volatility issues, IMDB repositories have limited ACID properties since they do not use transaction and re-do logs. However, one can alleviate this durability issue by utilizing replication and periodic writes to disk using save-point techniques. Further, these databases do not require design of OLAP models or constant building of indexes. ETL jobs for loading these databases is generally faster. Traditionally, Internet Book Databases (IBDB) were popular in embedded systems such as set top boxes and routers using real-time operating systems. Now, however, financial markets, e-commerce and social networking sites are using IBDB repositories for special database applications and IM analysis that require speed. IM technology is increasingly being applied to BI, data warehouses and Analytics. QlikView, from BI vendor QlikTech, uses IM database repositories. MicroStrategy9 includes standard in-memory ROLAP, as does the Spotfire product from Tibco. All of the mega vendors—Oracle ® (TimeTen), IBM® (TM1), SAP® (NetWeaver Business Warehouse Accelerator), and Microsoft® (project Gemini)—have an IM database and BI offerings. Long story short… There are no shortcuts. To design the best data-management strategy, leverage the sophisticated emerging technologies and implement the right analytic solutions for your unique company, you have to do your homework. Those companies that put in significant time and effort up front will reap the results of that investment—the ability to efficiently slice, dice and analyze the flood of data that pass through the organization daily. The ability to quickly access valid information increases corporate agility, resulting in a competitive advantage in today’s dynamic global marketplace. What’s new? Niche products for the BI market Traditional approaches to data analysis have relied on relational database technologies and parallel database technologies offered by major database vendors such as Teradata, Oracle and DB2. In recent years, however, companies such as Neteeza, GreenPlum, Vertica and Aster Data have come up with niche products for the BI market. Similarly, the Microsoft Research Project (Dryad) is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data center. The growth in data volumes and increased SLAs around ETL has resulted in popularity of alternate approaches, especially among emerging web companies. There has been a lot of buzz in the industry around the use of distributed architectures and cluster computing in order to increase system throughput and improve scalability. MR is one popular approach. Another is Hadoop. As mentioned earlier, Hadoop was popularized by Google and is now an open-source project under development by Yahoo! and the Apache Software Foundation. What’s more, social media companies such as Facebook have implemented a data warehouse using MR instead of a traditional RDBMS. An in-depth discussion 13 What this means for your business Leveraging valid, timely data—Seizing a competitive advantage Traditionally, organizations have been more comfortable with proven, well-supported enterprise-class software. Until recently, most have been unwilling to work with the more disruptive open-source technologies. Now, however—given the deluge of data bombarding companies today— data architects are recognizing that traditional analytical solutions no longer work. In response to the challenge of analyzing a very large volume of data in a timely and meaningful manner, several cutting-edge companies have become pioneers in this space, and they have made significant headway using the new open-source technology. In our view, it is a mistake to ignore the emerging solutions that are now dotting the landscape. Disruptive though they may be, the new opensource technologies provide a very scalable approach to large-scale data analysis. It’s time to explore the potential of these innovations to identify the one that would be the best fit with your organization’s unique needs and circumstances. Your firm can benefit in important ways. Reaping the benefits of a robust open-source solution It pays to take a distributed-architecture approach to processing BigData. Let’s look at just some of the key benefits that we see accruing to early adopters of these emerging solutions: First, distributed architecture is a very scalable approach. Unlike traditional analytic technologies that require very expensive high-end servers, some of the open-source technologies discussed in this paper are built to operate on less expensive, commodity hardware. What’s more, while most companies using a traditional solution are accustomed to anticipating and building their data processing capacity in advance—a practice that can be costly in terms of overestimating the need and tying up capital sooner than necessary—a distributed-architecture approach allows you to add clusters to an existing rack as your processing needs grow. The new addition simply gets folded into the mix of that commode and begins processing. Using a distributed-architecture approach is less risky. Open-source technologies have become very robust and can easily handle any failovers. Consider the risk involved in the traditional approach where you might have only two high-end servers processing huge amounts of data between them. Where would you be if one of those were to crash? In contrast, having numerous less-expensive servers sharing the load is much safer. Were one or more of those to crash, the others could easily pick up the slack. Further, with a distributed architecture approach, you don’t have servers sitting idle, waiting for disaster events to occur. The time to act is now We believe that it is a mistake to ignore the new solutions specifically designed to work in today’s data-intensive, highvolume, read-only environment. It’s time to explore the possibilities. As you set out to design and implement the data-management solution that will most effectively strengthen your organization’s analytical capabilities, take the time up front to ask yourselves the right questions. Only then can you be sure that the IT strategy you design will encompass all of your growing BI needs, and that it will be aligned with the organization’s overall business goals. The bottom line: Those organizations that carefully plan their IT strategy with their business goals in mind are the ones that will find themselves positioned to seize the competitive advantage in today’s dynamic global marketplace. What this means for your business 15 Some non-technical thoughts on open-source Web data processing systems Alan Morrison, PwC Center for Technology and Innovation, San Jose CA While it’s certainly essential to take a long, hard look at the pros and cons of each technology, it’s also important, during any technology due diligence effort, to weigh other factors that may not be sufficiently illuminated by a comparison of technical pros and cons. In Making Sense of Big Data, Technology Forecast 2010, No. 2,4 PwC evaluates the benefits and challenges of open-source Big Data—with a focus on Hadoop and complementary non-relational methods. The Forecast considers several case studies, as well as the changing data analysis environment, development-method strengths and weaknesses, and other factors. Following are some of the conclusions that we drew during the research effort for that publication. Hadoop, MapReduce and related NoSQL methods have significant momentum. This momentum is due to a successful open-source effort that’s emerged to address the significant growth of Web data. Many of the largest Web companies in the world—among them Yahoo, Twitter, Facebook, LinkedIn, and Amazon, not to mention vendors and consultancies such as Cloudera and Concurrent—have collaborated for years inside the Apache community on developing and refining Hadoop, Map Reduce, Hive, Pig, Lucene, Cascading and numerous other NoSQL tools that are part of the ecosystem. The most popular tools benefit from the efforts of some talented developers at these companies. The size and quality of this developer community matters a great deal. A heterogenous data environment often demands different tools for different purposes. In general, Hadoop, MapReduce and NoSQL tools were conceived primarily for Web environments. That’s not to say that they can’t be used elsewhere, but their heritage is rooted in the Google® Cluster Architecture and the tools Google developed internally to analyze the petabytes of data it collects. Much can be learned from Google and the examples of similar Web companies, but their approaches to large-scale data analysis should be considered a complement to, rather than a replacement for, relational data techniques. Successful use of open source depends on a sound understanding of how the community operates. In any open-source community, there are deep, strong currents and shallow, weak currents. Those who succeed take the time to study those currents. They pick the tools that benefit from the highest level of participation, and provide feedback to the community on how the software can be improved. That plays a large role in reaping the benefits. Remember that most open-source efforts are essentially first efforts that haven’t yet gained popularity or refinement. Be wary of methods that may never be widely used. Many methods, including some launched by developers with very good ideas and intentions, may never gain traction. User communities have biases that can keep well-designed software from being widely used regardless of how technically viable it is. Meanwhile, some lessworthy software can gain followers for reasons that don’t have anything to do with pure technical viability. Open source implies a different culture of development, and its users tend to be less passive, with more internal development capability than those who just subscribe to services from large commercial Software-as-a-Service (SaaS) providers or buy licenses from packaged software vendors. In the area of data analysis, open source users will be more likely to be the ones who experiment and explore. Those who aren’t accustomed to these methods should remember that it’s important to encourage this exploratory mentality regardless of the source of the software. It’s an essential component of any successful Big Data strategy. 4 Making Sense of Big Data, Technology Forecast 2010, No.2. Issue 3. PricewaterhouseCoopers. http://www.pwc.com/us/en/ technology-forecast/2010/issue3/index.jhtml 16 Architecting the data layer for analytic applications www.pwc.com/us/advisory For a deeper discussion around identifying the right data management solutions for your organization, please contact: David Patton 267 330 2635 [email protected] Sanjeev Taran 408 817 1285 [email protected] © 2011 PwC. All rights reserved.“PwC” and “PwC US” refer to PricewaterhouseCoopers LLP, a Delaware limited liability partnership, which is a member firm of PricewaterhouseCoopers International Limited, each member firm of which is a separate legal entity. This document is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. NY-11-0809