Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database – Business Intelligence BUILDING A TERABYTE DATA WAREHOUSE, USING LINUX AND RAC George Lumpkin, Oracle Corporation EXECUTIVE SUMMARY Data warehouses are not only among the most strategically important IT systems, they are also among the largest and most costly. Given the costs associated with a multi-terabyte data warehouse, many customers are examining how they may be able to reduce the costs of their current environments. Other customers, who may not have previously been able to afford a multi-terabyte warehouse system, are at the some time examining how low-cost solutions may enable them to have their cake and eat it too, by scaling their data warehouse to multiple terabytes while still staying within their IT budgets. Oracle with Real Application Clusters running on clusters of Linux-based commodity servers is the solution for scalable, cost-effective data warehousing. Oracle is the leading data warehouse RDBMS on open systems today, providing industry leading performance, manageability, security, and reliability. By running Oracle on Linux, customers can meet all of their data warehouse requirements cost-effectively. This paper describes how Oracle is well suited for Linux clusters for data-warehousing workloads, and then provides guidelines for configuring Linux-RAC data warehouses. KEY DATABASE TECHNOLOGIES FOR SCALABLE LINUX DATA WAREHOUSES ORACLE REAL APPLICATION CLUSTER FOR DATA WAREHOUSING Real Applications Clusters provides three key advantages for data warehousing: The first advantage of Real Application Clusters is of course scalability. Real Application Clusters provides linear scalability and availability for high volume, mission-critical applications, thus providing unlimited power and growth for your systems. The cost benefits of using Linux for data warehousing are accrued by using multiple less-expensive hardware servers rather than a single large server, and Real Application Clusters is the component that allows Oracle to manage a single database across a cluster, fully exploiting all processor, memory and disk resources of the entire cluster for the data warehouse workload. A second advantage of Real Application Clusters is availability. A cluster provides redundancy so that if any node server goes down, the other nodes will continue to execute. The database is not affected, and users who were connected to failed node can be transparently migrated to a surviving node. More significantly for data warehousing, the performance impact of a failure is exactly proportional to the processing power lost in the failure. That is, in an 8-node cluster, if one node fails, then Oracle will continue to execute at approximately 7/8th the performance of full cluster. While this seems like an intuitive statement, other database systems, based on the shared-nothing architectures, suffer more severe performance degradations in the case of a node failure. A third advantage of Real Application Clusters is flexibility. An Oracle data warehouse can leverage a cluster in two different ways: Every node of the cluster is symmetrical. They equally share all aspects of the data warehouse workload. Nodes of the cluster are assigned to different tasks. For example, some nodes can be assigned for ETL processing while other nodes are dedicated to query processing. This is a mechanism for implementing workload partitioning, and guaranteeing that ETL processing does not impact query processing or vice versa. Oracle has production data warehouse customers running clusters using both models. Paper #40177 Database – Business Intelligence PARALLEL EXECUTION IN A REAL APPLICATION CLUSTERS ENVIRONMENT Parallel execution is the ability for Oracle to execute a single SQL operation across multiple processors and multiple nodes, thus delivering vastly improved performance. Parallel execution is a cornerstone feature for data warehouse, since parallelism is the only way to manipulate the hundreds of gigabytes or terabytes of data common in data warehouse environments. ORACLE'S PARALLEL EXECUTION ARCHITECTURE PRESERVES INTERCONNECT BANDWIDTH When a large operation is submitted to the Oracle database, Oracle considers whether or not execute that operation in parallel and Oracle chooses an appropriate degree of parallelism for that operation. Oracle then utilizes multiple processes (called 'parallel query slaves') to execute the database operation. Each parallel query executes a portion of the database operation, working in conjunction with all of the other parallel query slaves. Inter-process communication between the query slaves within an SMP (implemented within Oracle via shared memory) is fast, scalable, and efficient. However, on a cluster, if two processes resides on separate nodes, then inter-process is via the interconnect and is slower. If the Oracle database were to execute parallel operations on a cluster in exactly the same way that it executed parallel operations on an SMP, the Oracle database might face performance challenges. However, Oracle's parallel query architecture minimizes interconnect traffic in two ways. First, Oracle leverages data partitioning (especially hash partitioning) using well-known techniques such as collocated joins. These techniques allow multiple processes on separate nodes to execute large queries while minimizing (although not eliminating) the amount of interconnect traffic between the processes. But more importantly, Oracle provides a second more effective mechanism for minimizing interconnect traffic. Using Real Application Clusters, Oracle determines at runtime whether it will run parallel execution server processes on only one instance, or whether it will run processes on multiple instances. In general, Oracle tries to use only one instance when sufficient resources are available. This reduces the cross-instance message traffic and synchronization. This concept is best illustrated with a simple example, using 4-node cluster with 4 processors each. If there were 20 concurrent queries executing on this cluster, each with 4 parallel query slaves, then one way to execute this workload is by having one query slave from each node participate in each query. However, using that approach, every query slave will need to be communicating with 3 other query slaves on 3 other nodes. A better approach to query parallelism is to place 5 queries on each node. Each query is still utilizing 4 parallel query slaves, but now all of the query slaves for a given query reside on the same node, drastically reducing inter-node communication. Oracle automatically balances the workload across all nodes of the cluster. Oracle does this by allocating servers to the nodes that are running the fewest number of processes. In cases where there are not enough available resources on a single node to execute a database operation, Oracle will then utilize multiple nodes. ORACLE’S PARALLEL EXECUTION ARCHITECTURE ENABLES EASY GROWTH A key requirement for data warehouses is growth. Over time, data warehouses add more data, more applications, and more users, and this growing demand invariably necessitates adding new hardware. In a cluster, hardware resources are added by adding one or more new nodes to the cluster. Oracle’s parallel execution architecture can immediately leverage these additional nodes to improve scalability and performance. This seems intuitive, but alas not every database operates this way; with most other data warehouse databases, data must be repartitioned in order to leverage new nodes. Just as Oracle’s parallel query architecture is uniquely able to adapt to unexpected loss of a node in an outage, so also is Oracle’s parallel query architecture able to rapidly subsume additional nodes. BASIC PRINCIPLES FOR CONFIGURING A RAC LINUX CLUSTER There are essentially three ingredients that make up the recipe for a high-performance data warehouse cluster: processing power, IO bandwidth, and interconnect bandwidth. PROCESSORS The logical place to start configuring any data warehouse system is to ask how much horsepower is anticipated to Paper #40177 Database – Business Intelligence satisfy the query and load requirements of this system. In an ideal world, there would be a simple formula (N processors for every terabyte of data); unfortunately, there are no hard and fast rules determining the system size because of the considerable variation in the type of workloads supported by data warehouses and their performance requirements. Very large Oracle data warehouse customers using Oracle8i or Oracle9i with between 5 and 10 TB of raw data range have systems ranging between 12 processors to over 200 processors for example. However, most sites with previous data warehouse experience often have good guidelines for determining their processor requirements and in the absence of that, the experiences of other customers, consultants and vendors can help to determine the appropriate number of processors for a proposed workloads. Once a rough estimate of the number of processors is determined, the next configuration issue is to determine the number of nodes. Currently, most Linux servers range from 1 to 8 processors. For an Oracle RAC DW environment, servers with 4 to 8 processors are able to handle reasonably complex operations within a single node, and thus these servers usually provide the optimal combination of performance, manageability, and cost-savings. I/O I/O performance should always be a key consideration for data warehouse designers and administrators. The typical workload in a data warehouse is especially I/O intensive, with operations such as large data loads and index builds, creation of materialized views, and queries over large volumes of data. The underlying I/O system for a data warehouse should be designed to meet these heavy requirements. In fact, one of the leading causes of performance issues in a data warehouse is poor I/O configuration. Database administrators who have previously managed other systems will likely need to pay more careful attention to the I/O configuration for a data warehouse than they may have previously done for other environments. For any data warehouse environment, the storage configuration should be chosen based on the I/O bandwidth that it can deliver, rather than the amount of disk capacity it can provide. That is, the goal of the storage subsystem is to be able to 'feed' the processors on each server in order to keep those processors fully utilized. One basic rule of thumb is that each Ghz of processor power can drive at least 100 MB/sec of I/O throughput. So, the I/O subsystem for a cluster in which each node contains 4 2-Ghz processors should be configured for 800 MB/sec per node. Note that these systems may be able to achieve even higher throughput, and that these estimates represent a good baseline target for I/O configurations. The I/O throughput that a system can deliver is based upon much more than simply the number of disks. When choosing an I/O system, the entire configuration must be considered, ranging from the host-bus adapters on the servers to the fibre channels and switches which connect the server to the storage array to the storage arrays themselves. The overall I/O subsystem can perform only as well as the weakest part of the subsystem. The fact that the storage arrays may be able to deliver the equivalent of 1 GB/sec to a given node is irrelevant if that node is configured with adapters that can only handle a total of 500 MB/sec. Therefore, not only should the storage configuration be able to handle the appropriate number of arrays, but the servers should be selected based on both the number of processors as well as the number I/O channels that they can support. INTERCONNECT The final consideration is the interconnect. A scalable interconnect is a necessity for data warehousing; data warehouse operations often involve large amounts of data, and although Oracle has many optimizations to minimize interconnect traffic, there are occasions when a high-bandwidth interconnect is necessary. For today's environments, most clusters (particularly clusters with 4 or more nodes) should be configured a high-bandwidth interconnect such as a 1 Gigabit ethernet. The emergence of faster interconnect technologies, such as 10 Gigabit ethernet as well as Infiniband, will prove useful for larger clusters. By following these simple guidelines, a data warehouse administrator can insure that their Linux-RAC cluster will be well suited for handling scalable data warehouse workloads CONCLUSION Oracle has delivered the most powerful and flexible database technology for data warehousing for many years, and Paper #40177 Database – Business Intelligence thus it is no surprise that Oracle customers are increasingly evaluating Oracle for their data warehouses as they examine the benefits of Linux. This paper describes the basic technology benefits of Oracle for Linux-based data warehouses, and furthermore provides recommendations and guidelines for customers who are implementing such systems. Paper #40177