Download Type Your Title Here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prognostics wikipedia , lookup

Intelligent maintenance system wikipedia , lookup

Transcript
Database – Business Intelligence
BUILDING A TERABYTE DATA WAREHOUSE, USING LINUX AND
RAC
George Lumpkin, Oracle Corporation
EXECUTIVE SUMMARY
Data warehouses are not only among the most strategically important IT systems, they are also among the largest and
most costly. Given the costs associated with a multi-terabyte data warehouse, many customers are examining how
they may be able to reduce the costs of their current environments. Other customers, who may not have previously
been able to afford a multi-terabyte warehouse system, are at the some time examining how low-cost solutions may
enable them to have their cake and eat it too, by scaling their data warehouse to multiple terabytes while still staying
within their IT budgets.
Oracle with Real Application Clusters running on clusters of Linux-based commodity servers is the solution for
scalable, cost-effective data warehousing. Oracle is the leading data warehouse RDBMS on open systems today,
providing industry leading performance, manageability, security, and reliability. By running Oracle on Linux,
customers can meet all of their data warehouse requirements cost-effectively.
This paper describes how Oracle is well suited for Linux clusters for data-warehousing workloads, and then provides
guidelines for configuring Linux-RAC data warehouses.
KEY DATABASE TECHNOLOGIES FOR SCALABLE LINUX DATA WAREHOUSES
ORACLE REAL APPLICATION CLUSTER FOR DATA WAREHOUSING
Real Applications Clusters provides three key advantages for data warehousing:

The first advantage of Real Application Clusters is of course scalability. Real Application Clusters provides linear
scalability and availability for high volume, mission-critical applications, thus providing unlimited power and
growth for your systems. The cost benefits of using Linux for data warehousing are accrued by using multiple
less-expensive hardware servers rather than a single large server, and Real Application Clusters is the component
that allows Oracle to manage a single database across a cluster, fully exploiting all processor, memory and disk
resources of the entire cluster for the data warehouse workload.

A second advantage of Real Application Clusters is availability. A cluster provides redundancy so that if any node
server goes down, the other nodes will continue to execute. The database is not affected, and users who were
connected to failed node can be transparently migrated to a surviving node. More significantly for data
warehousing, the performance impact of a failure is exactly proportional to the processing power lost in the
failure. That is, in an 8-node cluster, if one node fails, then Oracle will continue to execute at approximately 7/8th
the performance of full cluster. While this seems like an intuitive statement, other database systems, based on the
shared-nothing architectures, suffer more severe performance degradations in the case of a node failure.

A third advantage of Real Application Clusters is flexibility. An Oracle data warehouse can leverage a cluster in
two different ways:

Every node of the cluster is symmetrical. They equally share all aspects of the data warehouse workload.

Nodes of the cluster are assigned to different tasks. For example, some nodes can be assigned for ETL
processing while other nodes are dedicated to query processing. This is a mechanism for implementing
workload partitioning, and guaranteeing that ETL processing does not impact query processing or vice versa.
Oracle has production data warehouse customers running clusters using both models.
Paper #40177
Database – Business Intelligence
PARALLEL EXECUTION IN A REAL APPLICATION CLUSTERS ENVIRONMENT
Parallel execution is the ability for Oracle to execute a single SQL operation across multiple processors and multiple
nodes, thus delivering vastly improved performance. Parallel execution is a cornerstone feature for data warehouse,
since parallelism is the only way to manipulate the hundreds of gigabytes or terabytes of data common in data
warehouse environments.
ORACLE'S PARALLEL EXECUTION ARCHITECTURE PRESERVES INTERCONNECT BANDWIDTH
When a large operation is submitted to the Oracle database, Oracle considers whether or not execute that operation
in parallel and Oracle chooses an appropriate degree of parallelism for that operation. Oracle then utilizes multiple
processes (called 'parallel query slaves') to execute the database operation. Each parallel query executes a portion of
the database operation, working in conjunction with all of the other parallel query slaves. Inter-process
communication between the query slaves within an SMP (implemented within Oracle via shared memory) is fast,
scalable, and efficient. However, on a cluster, if two processes resides on separate nodes, then inter-process is via the
interconnect and is slower.
If the Oracle database were to execute parallel operations on a cluster in exactly the same way that it executed parallel
operations on an SMP, the Oracle database might face performance challenges. However, Oracle's parallel query
architecture minimizes interconnect traffic in two ways. First, Oracle leverages data partitioning (especially hash
partitioning) using well-known techniques such as collocated joins. These techniques allow multiple processes on
separate nodes to execute large queries while minimizing (although not eliminating) the amount of interconnect traffic
between the processes.
But more importantly, Oracle provides a second more effective mechanism for minimizing interconnect traffic. Using
Real Application Clusters, Oracle determines at runtime whether it will run parallel execution server processes on only
one instance, or whether it will run processes on multiple instances. In general, Oracle tries to use only one instance
when sufficient resources are available. This reduces the cross-instance message traffic and synchronization.
This concept is best illustrated with a simple example, using 4-node cluster with 4 processors each. If there were 20
concurrent queries executing on this cluster, each with 4 parallel query slaves, then one way to execute this workload
is by having one query slave from each node participate in each query. However, using that approach, every query
slave will need to be communicating with 3 other query slaves on 3 other nodes.
A better approach to query parallelism is to place 5 queries on each node. Each query is still utilizing 4 parallel query
slaves, but now all of the query slaves for a given query reside on the same node, drastically reducing inter-node
communication.
Oracle automatically balances the workload across all nodes of the cluster. Oracle does this by allocating servers to
the nodes that are running the fewest number of processes. In cases where there are not enough available resources
on a single node to execute a database operation, Oracle will then utilize multiple nodes.
ORACLE’S PARALLEL EXECUTION ARCHITECTURE ENABLES EASY GROWTH
A key requirement for data warehouses is growth. Over time, data warehouses add more data, more applications, and
more users, and this growing demand invariably necessitates adding new hardware. In a cluster, hardware resources
are added by adding one or more new nodes to the cluster. Oracle’s parallel execution architecture can immediately
leverage these additional nodes to improve scalability and performance. This seems intuitive, but alas not every
database operates this way; with most other data warehouse databases, data must be repartitioned in order to leverage
new nodes. Just as Oracle’s parallel query architecture is uniquely able to adapt to unexpected loss of a node in an
outage, so also is Oracle’s parallel query architecture able to rapidly subsume additional nodes.
BASIC PRINCIPLES FOR CONFIGURING A RAC LINUX CLUSTER
There are essentially three ingredients that make up the recipe for a high-performance data warehouse cluster:
processing power, IO bandwidth, and interconnect bandwidth.
PROCESSORS
The logical place to start configuring any data warehouse system is to ask how much horsepower is anticipated to
Paper #40177
Database – Business Intelligence
satisfy the query and load requirements of this system. In an ideal world, there would be a simple formula (N
processors for every terabyte of data); unfortunately, there are no hard and fast rules determining the system size
because of the considerable variation in the type of workloads supported by data warehouses and their performance
requirements. Very large Oracle data warehouse customers using Oracle8i or Oracle9i with between 5 and 10 TB of
raw data range have systems ranging between 12 processors to over 200 processors for example. However, most sites
with previous data warehouse experience often have good guidelines for determining their processor requirements
and in the absence of that, the experiences of other customers, consultants and vendors can help to determine the
appropriate number of processors for a proposed workloads.
Once a rough estimate of the number of processors is determined, the next configuration issue is to determine the
number of nodes. Currently, most Linux servers range from 1 to 8 processors. For an Oracle RAC DW environment,
servers with 4 to 8 processors are able to handle reasonably complex operations within a single node, and thus these
servers usually provide the optimal combination of performance, manageability, and cost-savings.
I/O
I/O performance should always be a key consideration for data warehouse designers and administrators. The typical
workload in a data warehouse is especially I/O intensive, with operations such as large data loads and index builds,
creation of materialized views, and queries over large volumes of data. The underlying I/O system for a data
warehouse should be designed to meet these heavy requirements.
In fact, one of the leading causes of performance issues in a data warehouse is poor I/O configuration. Database
administrators who have previously managed other systems will likely need to pay more careful attention to the I/O
configuration for a data warehouse than they may have previously done for other environments.
For any data warehouse environment, the storage configuration should be chosen based on the I/O bandwidth that it
can deliver, rather than the amount of disk capacity it can provide. That is, the goal of the storage subsystem is to be
able to 'feed' the processors on each server in order to keep those processors fully utilized. One basic rule of thumb is
that each Ghz of processor power can drive at least 100 MB/sec of I/O throughput. So, the I/O subsystem for a
cluster in which each node contains 4 2-Ghz processors should be configured for 800 MB/sec per node. Note that
these systems may be able to achieve even higher throughput, and that these estimates represent a good baseline
target for I/O configurations.
The I/O throughput that a system can deliver is based upon much more than simply the number of disks. When
choosing an I/O system, the entire configuration must be considered, ranging from the host-bus adapters on the
servers to the fibre channels and switches which connect the server to the storage array to the storage arrays
themselves.
The overall I/O subsystem can perform only as well as the weakest part of the subsystem. The fact that the storage
arrays may be able to deliver the equivalent of 1 GB/sec to a given node is irrelevant if that node is configured with
adapters that can only handle a total of 500 MB/sec. Therefore, not only should the storage configuration be able to
handle the appropriate number of arrays, but the servers should be selected based on both the number of processors
as well as the number I/O channels that they can support.
INTERCONNECT
The final consideration is the interconnect. A scalable interconnect is a necessity for data warehousing; data
warehouse operations often involve large amounts of data, and although Oracle has many optimizations to minimize
interconnect traffic, there are occasions when a high-bandwidth interconnect is necessary. For today's environments,
most clusters (particularly clusters with 4 or more nodes) should be configured a high-bandwidth interconnect such as
a 1 Gigabit ethernet. The emergence of faster interconnect technologies, such as 10 Gigabit ethernet as well as
Infiniband, will prove useful for larger clusters.
By following these simple guidelines, a data warehouse administrator can insure that their Linux-RAC cluster will be
well suited for handling scalable data warehouse workloads
CONCLUSION
Oracle has delivered the most powerful and flexible database technology for data warehousing for many years, and
Paper #40177
Database – Business Intelligence
thus it is no surprise that Oracle customers are increasingly evaluating Oracle for their data warehouses as they
examine the benefits of Linux. This paper describes the basic technology benefits of Oracle for Linux-based data
warehouses, and furthermore provides recommendations and guidelines for customers who are implementing such
systems.
Paper #40177