Reviewed by Oracle Certified Master Korea Community ( http://www.ocmkorea.com http://cafe.daum.net/oraclemanager ) PARALLEL EXECUTION FACILITY CONFIGURATION AND USE Introduction Beginning with Oracle Enterprise Server 7.1, it became possible to utilize more than a single server (shadow) process to execute a query, which was available as the Parallel Query Option (PQO). The goal was to return results to the user more quickly than with serial execution, at the expense of consuming additional system resources. This capability was greatly expanded with each new release of Oracle Enterprise Server. In versions 8 and 9, an impressive array of parallel operations are possible that include not only parallel query, but also parallel DML, parallel DDL, parallel direct-load using SQL*Loader, an numerous maintenance tasks. This expanded capability is now collectively referred to as Parallel Execution (PX). The purpose of this paper is to provide practical knowledge for configuring PX, monitoring performance once it has been configured, and to highlight some of the most common uses in the data warehousing environment. The scope of this paper will include the following topic areas to achieve that goal: • Overview of Parallel Execution architecture • Host and database design considerations • Configuring Parallel Execution • Implementing parallel operations • Common parallel data warehouse operations • Run-time performance monitoring Many configuration parameters and recommended values are discussed. While the Oracle documentation is general in nature and is applicable to all host vendors and configurations, this paper is geared towards the most common data warehouse configurations encountered by the author as follows: • Sun Microsystems or Hewlett-Packard enterprise class host that has between 4 and 64 CPUs • High performance storage array such as EMC Symmetrix1 • Single instance environment (eg. Oracle Parallel Server or Real Application Clusters not implemented) • Solaris2 or HP-UX3 operating system Overview of PX Architecture Before leaning about the configuration of the PX facility, it is necessary to achieve a good understanding of the process architecture. To begin with a more simple case, consider non-parallel execution of a SQL statement. Several steps are involved. The client process provides the interface to the end-user. The client (user) process may be SQL*Plus, or a custom application written in Java, Perl, Visual Basic, or some other language. The client process may reside on the database server or some other computer such as an end-user’s workstation. Each client process communicates with an Oracle server process. The server process is responsible for interacting with the database on behalf of the client process. It accepts SQL statements from the client process, parses and executes the statement against the database, and returns the results to the client process. This is illustrated in figure 1. 1 Symmetrix is a registered trademark of EMC Corporation Solaris is a registered trademark of Sun Microsystems, Inc. 3 HP-UX is a registered trademark of Hewlett-Packard Company 2 PX Facility Configuration and Use Maresh R e s u lts C lie nt P ro c e s s S e rve r P ro c e s s D a ta b a s e SQL Figure 1. Serial SQL Execution Figure 2 illustrates the process configuration when a parallel query is executed. First, the server process that interacts with the client process is promoted to the role of Query Coordinator (QC). The QC is primarily responsible for handling communications with the client and managing a set of PX slave processes that perform most of the work against the database. The QC is responsible for acquiring multiple PX slave processes, dividing up the workload between the PX slaves, and passing the results back to the client. PX Slave Process Results SQL Results Results Query Coordinator Client Process SQL SQL PX Slave Process Database Results SQL PX Slave Process Figure 2. Parallel SQL Execution with one PX slave per execution thread When a parallel query is initiated by the client process, the QC determines how to divide the workload between the various PX slave processes. It shouldn’t be surprising that the QC rewrites the SQL statement submitted by the client process into one or more statements that are passed to the PX slaves for execution. In the above figure, there are three threads of execution, each of which has one PX slave process. This is referred as intra-operational parallelism. When a parallel operation is initiated, PX slaves are borrowed from a common slave pool that is available to all users of the database instance. Once the parallel operation has completed, the slaves are released back to the pool for other parallel operations. The maximum number of intra-operational PX slaves that will be brought to bear on a particular query can be PX Facility Configuration and Use Maresh controlled at the table or index level using the ALTER TABLE or ALTER INDEX directive respectively, or by SQL statement hints. In certain cases, twice the degree of PX slaves will be used to satisfy a query. This configuration is illustrated in Figure 3 where two PX slaves are utilized for each execution thread. This configuration occurs when executing either a merge or hash join, or when sorting or other aggregation operations are present in the original query. It will also occur when a parallel DML statement is executed with a correlated parallel SELECT statement. In this case, each set of intra-operational PX slaves in an execution thread act in a producer-consumer relationship. In the above case, the PX slaves that access the database directly produce data that are consumed by the second set of intra-operational PX slaves. For example, in a sorting operation, the first set of PX slaves would retrieve rows from the database and apply any limiting conditions that appear in the query. The resulting rows would be fed to the second set of PX slaves for sorting. Each of the second set of PX slaves is responsible for sorting rows within a particular range. Each of the PX slaves that accessed the data directly sends their results to the particular slave according to the value of the sort key. When two PX slaves communicate with each other within the same thread of execution, it is referred to as inter-operational parallelism. Client Process QC PX Slave PX Slave PX Slave PX Slave PX Slave PX Slave Database Figure 3. Parallel SQL Execution with two PX slaves per execution thread The above example contains two parallel operations. The first operation retrieves rows from tables in the database while the second operation sorts or aggregates the results. When executing complex joins, multiple inter-operational parallel operations may occur. In this case, the cost-based optimizer will split the original SQL statement into the optimal number of operations, each of which is executed in parallel in succession. Consider the execution plan below involving three tables that are hash joined, followed by a sort operation. SORT HASH JOIN TABLE C HASH JOIN TABLE A TABLE B Step Step Step Step Step Step 4, 3, 3, 2, 1, 2, PX PX PX PX PX PX Slave slave slave slave slave slave set set set set set set 2 1 1 2 1 2 The optimizer may choose to perform four parallel operations in the following order using two sets of PX slaves. 1. The first set of PX slaves produces rows from table A using intra-operational parallelism. 2. The results of the step 1 are consumed by the second set of PX slaves, which also retrieve rows from table B using intra-operational parallelism and perform the first hash join. Since steps 1 and 2 occur simultaneously, it represents an inter-operational process. 3. The results produced in step 2 are consumed by the first set of PX slaves which have completed executing step 1. The same set of PX slaves also retrieves rows from Table C using intra-operational parallelism and performs the second hash join. Since steps 2 and 3 occur simultaneously, it represents an inter-operational process. PX Facility Configuration and Use 4. Maresh The results from step 3 are consumed by the second set of PX slaves that have completed executing step 2. The PX slaves perform the sort and return the result to the Query Coordinator. This begins to sound hopelessly complex but there are several patterns that enable one to both clearly understand what is occurring in each step. When SQL statements require more than two parallel operations to return the result, each set of PX slaves is reused as necessary. In the absence of subqueries, union, intersect, or minus operations , the maximum number of PX slaves required to satisfy the original statement is no more than twice the highest Degree of Parallelism specified on any of the tables in the query. Host and Database Design Considerations Nobody will argue with the fact that it is much easier to build a system correctly from the ground up rather than to redesign and retrofit when performance problems present themselves. It is a frustrating experience to spend a lot of effort setting up a data warehouse with the intention of heavily utilizing PX only to find that it’s not working as expected. Many times, this is caused because the DBAs and System Administrators designing the data warehouse have most of their experience with OLTP systems that have very different requirements than data warehouse systems. The most common performance problems encountered when implementing PX have two sources. The first source is that the hardware may not be suitable or configured properly for use as a data warehouse where parallel operations are heavily utilized. The second source of problems is caused by a database physical design that is not properly optimized to support parallel operations. The usual culprit in both cases is resource contention. When designing and configuring a host system and database that will rely heavily on PX, a number of general design principals should be employed. Many of these design points are only applicable if a new system will be procured, but others can be employed on an existing system. • A RAID 1 (mirroring) disk configuration offers more tuning options than RAID 0+1 (striping+mirroring) because objects can be manually striped across multiple physical devices. Arguably, RAID1 takes a bit more administration because of the manual striping effort, but the resulting performance is worth it. • For a fixed quantity of disk capacity, more smaller mount points are likely to perform better than a few large ones. For example, for a 700GB disk configuration, 10 – 70GB disks will perform better than 4 – 180GB disks. Having more disks reduces physical disk contention. • Adding high throughput disk controllers (eg. Fiber Channel) will increase throughput between the disk subsystem and the host. PX utilizes Direct Path operations whenever possible. This typically occurs when performing full table and partition scans, and when specified on INSERT statements. In the absence of the buffer cache bottleneck, remarkably high I/O throughput rates can be achieved if the I/O subsystem is properly designed and configured. • For a fixed quantity of CPU capacity, more less powerful CPUs are likely to perform better than fewer more powerful ones. For a fixed amount of performance (eg. SpecINT92, SPECbase, SPECrate), if the choice is between a 4-CPU or a 12-CPU configuration, the 12-CPU configuration is likely to perform better by reducing CPU contention. The assumption here is that there is adequate bus bandwidth to support the higher number of CPUs. • When constructing tablespaces that will house large tables, more smaller datafiles will perform better than a few large ones if they are striped over many mount points. This once again reduces disk contention. For example, if a 10GB tablespace is required, 40 – 256MB data files will perform better than 5 – 2GB data files. • The same paradigm holds true when constructing tablespaces to hold temporary and rollback segments. While the amount of data sorted will usually remain the same whether or not a query utilizes PX, the disk sort activity at a point in time when a parallel sort is running will be increased by a factor approximately equal to the Degree of Parallelism of the query. Rollback segments will experience a similar increase in activity when a parallel DML statement is executed. • Continuing on the theme of reducing disk contention, the online redo and archive log destinations should be used exclusively for those purposes. Although much space may remain unused, contention on the online redo and archive log disks during parallel DML operations can be the bottleneck on an otherwise well designed system. • When creating large tables or table partitions, many smaller extents will perform better than a few large ones. For example, if a partitioned table will be created, 50 to 100 smaller extents per partition will perform better than with a one or two large extents. Each new extent is allocated round robin style from tablespace datafiles that have space available. If datafiles have been striped across many mount points, parallel full-table and partition scans will PX Facility Configuration and Use Maresh perform better because the data is spread across many physical disks. Here, Locally Managed Tablespaces should be employed. • When creating partitioned tables, many smaller partitions will perform better than a few large ones. For example, if a partitioned table will be created to house 5-years worth of data, a partitioned table created with 260 weekly partitions or 60 monthly partitions will likely perform better than a table with 5 annual partitions. Choosing the partition granularity also requires careful analysis of the types and volumes of SQL statements that will access the table, data loading, and backup and recovery strategies. In each of these examples, spreading the work over more objects reduces resource contention and potentially improves PX throughput. If the optimizer determines that the specified Degree of Parallelism will degrade query performance because of resource contention, or there are an insufficient number of objects over which to spread the work, the optimizer will reduce the Degree of Parallelism behind the scenes, much to the chagrin of the user. Configuring Parallel Execution To this point, we’ve looked at an overview of the PX process architecture and its host requirements. In this section, more details about the process and memory architecture will be discussed, and the relevant configuration parameters will be explained. All of these parameters appear in the database parameter file (init.ora). Some may be changed dynamically using ALTER SYSTEM and ALTER SESSION commands. Others require the database to be restarted to take affect. The first set of parameters that will be discussed are those that affect the number of slaves in the slave pool, and how they are used by statements that utilize them. PX Slave configuration When the optimizer determines that an operation will run using PX, the Query Coordinator recruits slaves from the PX slave pool. It can only use slaves that are currently idle, or inactive. Once idle slaves are found, each of the slaves that will be used is marked as active, and no other query can use them until they are released when the current parallel operation has completed. Since PX slaves can only be used by one Query Coordinator at a time, the number of slaves available in the slave pool is one of the primary limiting factors on the number of concurrent parallel operations. Additionally, PX slaves are borrowed from the slave pool on a first-come, first-served basis. The PX slave pool consists of a number of PX slaves that are available to all operations that will utilize PX on the database instance. The PARALLEL_MIN_SERVERS parameter is used to determine the minimum number of PX slaves in the pool and PARALLEL_MAX_SERVERS determines the maximum number. When the database instance is started, Oracle creates a PX slave pool with the minimum number of slaves. As various PX operations begin to execute, the number of slaves in the pool that are idle eventually becomes depleted. If there is an insufficient number of PX slaves available in the pool to support a parallel operation, the value PARALLEL_MAX_SERVERS is inspected. If the value is higher than the value of PARALLEL_MIN_SERVERS, additional PX slave processes will be dynamically created (spawned), added to the pool, and recruited by parallel operations that need them. Therefore, the maximum number of PX slaves that will ever reside in the pool is PARALLEL_MAX_SERVERS. As each Query Coordinator completes it’s parallel operations, the PX slaves become idle again and are now available for use by other parallel operations. If there are more slaves than PARALLEL_MIN_SERVERS in the PX slave pool, the excessive slaves will be terminated after a five minute period of inactivity. For example, let’s say that PARALLEL_MIN_SERVERS=16 and PARALLEL_MAX_SERVERS=64 and all 64 slaves are active in the PX slave pool because many PX queries were running concurrently. After this period of high activity, no parallel statements are run for a while. When this occurs, the number of PX slaves in the slave pool will be reduced from 64 to 16. The value for PARALLEL_MIN_SERVERS should be set to a value such that there are enough slaves readily available in the slave pool to cover the average PX operation load on the system. The value for PARALLEL_MAX_SERVERS should be set to the maximum number of slaves that can be created, and is usually based upon practical CPU and I/O limits. Some consideration should be given to choosing the two values. If the minimum number of servers is set too low, then some queries will always be waiting for additional PX slaves to be created, causing them to run longer. Setting the value too high simply causes more system memory to be consumed because the idle processes remain in the slave pool. If memory is not much of an issue, then it is acceptable to set the two parameters to the same value so that the maximum number of PX slaves will always be available for use. PX Facility Configuration and Use Maresh A good reason why one might choose two different values is when the PX operation load varies greatly over time. Suppose that most of the data are loaded nightly during a four-hour window. Many of the operations are parallelized including data loads, materialized view and index rebuilds, and the creation of summary tables. There are few total sessions running on the system during this period, but they are all active. This represents the peak-processing load on the system. However during the day, there are hundreds of concurrent users on the system but only a few of them utilize PX for running parallel queries. Here, it would be appropriate to choose a low value for PARALLEL_MIN_SERVERS to support the requirements during the day and a much higher value for PARALLEL_MAX_SERVERS to support nightly processing requirements. During the day, additional slaves would be dynamically added to the pool if required to support parallel queries, up to the value of PARALLEL_MAX_SERVERS. Otherwise, the memory would be available to support the higher volume of concurrent users. During the nightly load window, a much higher number of PX slaves will remain in the pool because of their frequent use. Now what happens as the number of slaves currently being used approaches the value of PARALLEL_MAX_SERVERS? If there are an insufficient number of PX slaves available to satisfy the requirements of a particular parallel operation, it will consider using whatever slaves are available. The value of the PARALLEL_MIN_PERCENT parameter is used to decide whether or not an operation will proceed using PX when the requested number of PX slaves is not available. Consider a query where 16 slaves have been requested, but only 10 are available. If the value of PARALLEL_MIN_PERCENT was 50, meaning that at least 50% of the requested slaves must be available for the operation to proceed in parallel, the operation will proceed if there are a minimum of 8 slaves available. All 10 slaves will be used in this case. If the value of PARALLEL_MIN_PERCENT was 75, the minimum number of slaves required to proceed in parallel is 12. Now the statement will fail to execute and produce the following error message. ORA-12827: insufficient parallel query slaves available. If PARALLEL_MIN_PERCENT is set to 0, queries will utilize whatever PX slaves are available and proceed in parallel. If no PX slaves are available, the query will execute non-parallel with no contrary warnings indications, other than a longer elapsed time. When PARALLEL_MIN_PERCENT is set to a value of 100, queries will not proceed in parallel unless all PX slaves requested are actually acquired. Setting PARALLEL_MIN_PERCENT to a value of 100 assures that parallel operations will behave more predictably. Since all PX slaves must be acquired for the operation to proceed in parallel, queries that run successfully will do so with similar execution times. The purpose of the PARALLEL_ADAPTIVE_MULTI_USER parameter is to help preserve overall database performance during periods of high system load. It permits the requested Degree of Parallelism to be automatically reduced on new parallel queries during periods of high database activity. If for example, the requested Degree of Parallelism is 16 but the system load is high when the parallel operation begins to execute, the Degree of Parallelism might be reduced to 8. This reduction will cause the query to run longer than if the original requested Degree of Parallelism were employed. The PARALLEL_THREADS_PER_CPU parameter is used to adjust the load on each CPU when PARALLEL_ADAPTIVE_MULTI_USER is enabled. The value represents the average number of PX slaves that each CPU can process concurrently. The default value is 2 is usually adequate. If the host system has a few high-powered CPUs rather than many lower performance CPUs, increasing the value may improve throughput. Likewise, if the host system has a slower I/O subsystem, increasing the value may improve PX throughput. The PARALLEL_BROADCAST_ENABLED parameter can significantly improve parallel query performance at the expense of additional PGA memory. If a small row source is being joined to a large row source in either a merge or hash join, each of the PX slaves processing the large row source may each require the entire small row source for the join. For example, if the DOP for a hash join is 8 and a small table must be scanned by each of the PX slaves, the small table would be scanned 8 times if PARALLEL_BROADCAST_ENABLED is set to false, the default value. If the parameter is set to true, the table will be scanned once by the Query Coordinator and the results will be delivered to each of the 8 PX slaves. Enabling this parameter results in more memory usage within the PGA of each PX slave process. SGA Memory Configuration The next set of parameters to be discussed affect the SGA memory structures associated with PX. Oracle uses message buffers to pass data between the various processes during inter-operational parallel operations. These buffers reside in the shared pool inside the SGA if PARALLEL_AUTOMATIC_TUNING is set to FALSE, and in the large pool if the parameter value is set to a value of TRUE. A fixed number of buffers is required for each producer-consumer connection. The number of connections for a particular query varies with the square of the maximum Degree of Parallelism of the query since each PX Facility Configuration and Use Maresh producer has a connection to each consumer. If the maximum DOP of the query is 4, then there could be as many as 16 connections. If the maximum Degree of Parallelism were 8, there would be 64 connections. Based upon this relationship, one can see that memory requirements for message buffers increase greatly as the maximum value of the DOP increases. The system-wide amount of memory required for the message buffers also depends upon the number of concurrent queries executing parallel statements, and the size of the message buffer. Based upon the variables, computing the buffer space is quite inexact and is based upon a number of assumptions. We are simply interested in computing a ballpark number. We want to know if the requirement is 20MB vs. 200MB, not 20.2MB vs. 20.5MB. The following formula from the Oracle documentation can be used to compute the amount of memory required for PX message buffers on SMP systems: buffer space = (3 x size x users x groups x connections) where size = PARALLEL_EXECUTION_MESSAGE_SIZE users = The maximum number of sessions that will be concurrently issuing PX queries. groups = The number of PX groups used per query. A nominal value of 2 is a good nominal value. connections = (DOP x DOP) + (2 x DOP) Where DOP is the highest Degree of Parallelism that will be used for any query. The PARALLEL_EXECUTION_MESSAGE_SIZE parameter controls the size of the messages and has a default value of 2KB if PARALLEL_AUTOMATIC_TUNING is set to FALSE, and 4KB if the parameter value is TRUE. Within certain limits, increasing the size of the message buffer increases throughput between the producer and consumer PX slaves, which in turn, reduces execution time. If machine memory is plentiful, increasing the value to 8KB may provide performance gains if communication between inter-operational PX slaves is a bottleneck. If the value is increased, the SGA size must be increased accordingly. If there is insufficient space in the PX message pool, parallel queries will fail with an ORA-4031 memory allocation failure message. Here is an example computation for the PX message pool. Let’s use an 8KB PARALLEL_EXECUTION_MESSAGE_SIZE. The value for size is 8192. A total of 20 users will be issuing parallel queries concurrently so the value of users is 20. Of these 20 users, most will be using a Degree of Parallelism of 4, but several large queries use a degree of 8. So for the computation of connections, the value of 8 for DOP should be used. The value of connections is: connections = (8 x 8) + (2 x 8) = 80 Substituting these numbers into the above formula for buffer space yields buffer space = (3 x 8192 x 20 x 2 x 80) = 78,643,200 bytes Dividing by 1,048,576 to get megabytes yields 75MB. This is the minimum amount of memory that should be added directly to the size of the shared pool or large pool depending upon the setting for PARALLEL_AUTOMATIC_TUNING. The SHARED_POOL_SIZE and LARGE_POOL_SIZE parameters control the size of the share pool and large pool respectively. It is desirable to locate the message buffers in the large pool because the shared pool is already heavily used for other purposes, and is often a place for contention without further exacerbation. The large pool, on the other hand, has considerably less activity because it has fewer uses. It is only used for the PX message pool, DB Writer I/O slaves, and Multithreaded Server (MTS) if those features are used. Since all of the other features provided by setting PARALLEL_AUTOMATIC_TUNING to a value of TRUE can be overridden with other parameter settings if desired, set it to a value of TRUE so that the PX message pool will be located in the large pool. There is no other mechanism to specify that the message pool should be located in the large pool. The above computation is based upon a number of assumptions you make and therefore produces a good starting point. PX operations also cause more shared pool memory to be used than their serial counterparts. PX slaves are specialized versions of the garden-variety server process so their operational characteristics are similar. So if 8 PX slaves will be used to solve a particular query, the queries running on each slave will occupy space in the shared pool. Even if the PX message pool resides in the large pool, the size of the shared pool should be increased to accommodate the additional memory requirements caused by the additional SQL statements. A good starting point is to increase the size by 25% when implementing PX on a well-tuned database, then monitoring for contention to determine is further adjustments are necessary. PX Facility Configuration and Use Maresh Other Resource Considerations In addition to the obvious database parameters required to configure PX, there are also a number of others that must be reevaluated. When a SQL statement executes using PX, the Query Coordinator uses a two-phase commit strategy to commit transactions. Consider a parallelized UPDATE statement that utilizes 8 PX slaves to update rows in a total of 30 table partitions. This results in a total of nine transactions; one for each of the 8 PX slaves, and one for the Query Coordinator. If the SQL statement utilizes two PX slaves per thread of execution, then the number of transactions would be 17; one for each of the 16 PX slaves, and one for the Query Coordinator. Therefore, the parameter value of TRANSACTIONS should be increased by the value of PARALLEL_MAX_SERVERS to handle the additional number of transactions caused by the PX facility. The Query Coordinator will acquire DML locks on each of the partitions that are being processed. Each of the PX slaves will also acquire DML locks on each of the partitions that they are processing. Therefore, the number of DML locks and enqueue resources must both be increased considerably to handle the additional requirements imposed by PX. If nondefault values for the PROCESSES and SESSIONS parameters are used, they must be increased by the value of PARALLEL_MAX_SERVERS to accommodate the addition of PX slave processes. Since the UPDATE statement modifies data, transaction information must be written to rollback segments in the event that the transaction must be rolled back. Each slave process will utilize its own transaction space within a rollback segment since each is treated as a separate transaction. The total amount of space used in rollback segments is not considerably higher than that used by a serial transaction. But the overall load on the rollback segments while the statement is executing will be considerably higher. This is caused because all of the work has been spread across multiple PX slaves to process the statement within a shorter period of time. While parallel DML statements are running, verify that rollback segment contention is minimal. If contention exists, then the rollback segments should be tuned accordingly. Performance on the disks on which the rollback segments reside should also be monitored for contention Likewise, when parallel DML statements are executing, redo entries are being generated at a much faster rate. This will affect the all of the structures associated with the log writer (LGWR) including the log buffer in the SGA, and the number and size of the online redo logs. This facility should be monitored closely when parallel DML statements are executing to determine if tuning is necessary to improve performance. PGA Memory Issues The next set of parameters affect size of the Process Global Area (PGA) of the PX slave processes. The PGA is the private memory area within each server or PX slave process. Sort and hash areas are located within this memory structure. Once this memory has been allocated in the PGA, it remains allocated for the life of the server process. Sort and hash memory areas can account for a considerable amount of overall memory usage on the database host, particularly when PX is employed. If the database has been configured so that 10MB of memory sort area may be allocated, then up to 10MB will be allocated for each sort in the query that runs serially. However, the same query running in parallel with a degree of 16 may use 170MB of sort memory per sort because each PX slave process and the Query Coordinator are each able to allocate 10MB. If the query performs multiple sorts, as in a merge join, then considerably more than 170MB of memory will be allocated. Data are sorted anytime an aggregation operation such as GROUP BY or ORDER BY is performed, and when merge joins are performed. The SORT_AREA_SIZE parameter controls the amount of memory that can be allocated within the PGA for a single sort. For small sorts, only the amount of memory required to perform the sort will be allocated. If a sort cannot be performed entirely within memory, then the additional space required to perform the sort is allocated from temporary segments on disk. With the high data volumes processed within a data warehouse, some sorts will be performed using temporary segments on disk. When this occurs, multiple sort runs are performed in memory while the results of each run are stored on disk. Even though disk sorts occur, sorting time will decrease as the size of the in-memory sort area increases because fewer sort runs will be required to complete the sort. Query performance will be optimal if the sort is performed entirely in memory. The SORT_AREA_RETAINED_SIZE parameter controls the amount of memory retained in the PGA from a sort after it completes. It affects the total amount of sort memory within the PGA when multiple sorts are performed within a single query. In a merge join, for example, each of the two row sources must be sorted before they are merged. Sort memory up to the value of SORT_AREA_SIZE is allocated to perform the first sort. After the first sort has completed, all of the data that cannot fit within the value of SORT_AREA_RETAINED_SIZE are written to disk. For example, if SORT_AREA_SIZE is 1MB and SORT_AREA_RETAINED_SIZE is 512KB, after the first sort completes 512KB of the data will be written to disk PX Facility Configuration and Use Maresh and the remaining 512KB will be retained in memory. To sort the second row source, an additional 1MB of memory is allocated to perform the sort. At the end of the second sort, the total sort memory used is now 1.5MB; 512KB retained from the first sort, and 1MB from the second sort. Recursive statements may allocate additional PGA memory for sorting. So allow at least the sum of SORT_AREA_SIZE and SORT_AREA_RETAINED_SIZE for the maximum number of PX slaves that will be configured for additional host memory. The hash area used for hash joins has similar properties to the in-memory sort area. Whenever a query performs a hash join, memory is allocated in the PGA to hold the hash table of the first row source in the join. The value of HASH_AREA_SIZE determines the maximum amount of memory that can be allocated per session when a hash join is performed. Like SORT_AREA_SIZE, hash join memory can be allocated by each PX slave that performs the hash join. PARAMETER DEFAULT ALTER SYSTEM? ALTER SESSION? AUTO TUNING DEFAULT DML_LOCKS 4x TRANSACTIONS No No 4x TRANSACTIONS ENQUEUE_RESOURCES Derived from SESSIONS No No Derived from SESSIONS HASH_AREA_SIZE 2x SORT_AREA_SIZE Yes Yes 2x SORT_AREA_SIZE LARGE_POOL_SIZE 0 No in 8i No Derived Yes in 9i PARALLEL_ADAPTIVE_MULTI_USER FALSE Yes No TRUE PARALLEL_AUTOMATIC_TUNING FALSE No No N/A PARALLEL_BROADCAST_ENABLED (8i) FALSE No Yes FALSE PARALLEL_EXECUTION_MESSAGE_SIZE 2048 Yes Yes 4096 PARALLEL_MAX_SERVERS 5 No No CPU_COUNT X 10 PARALLEL_MIN_PERCENT 0 No Yes 0 PARALLEL_MIN_SERVERS 0 No No Derived PARALLEL_THREADS_PER_CPU 2 Yes No 2 PROCESSES Derived No No Derived SESSIONS 1.1 * PROCESSES +5 No No 1.1 * PROCESSES +5 SHARED_POOL_SIZE Derived No in 8i No Derived Yes in 9i SORT_AREA_RETAINED_SIZE SORT_AREA_SIZE Yes Yes SORT_AREA_SIZE SORT_AREA_SIZE 64KB Yes Yes 64KB TRANSACTIONS 1.1 x SESSIONS No No 1.1 x SESSIONS Table 1. PX Configuration Parameters The total amount of memory that will be used for sorting and performing hash joins by all PX slave processes should be accounted for when configuring PX. When configuring these parameters, choose conservative values and monitor overall system memory usage during peak loads to verify that there is sufficient memory on the host machine to comfortably support PX Facility Configuration and Use Maresh the load. If additional memory is available that could be used for sorting and performing hash joins, the parameter values can be increased if additional performance is desired. Enabling parallel broadcasting by setting the value of PARALLEL_BROADCAST_ENABLED to true will also cause more PGA memory to be used. The additional memory usage will be proportional to the size of the row set being broadcast and may range from a few kilobytes up to tens or hundreds of megabytes. It is best to enable this parameter after a stable PX configuration has been established, so that it can be monitored in the absence of other changes. Automatic PX Configuration By this time, PX configuration may appear to be difficult and complex to configure. However, to make configuration easier, Oracle has provided the PARALLEL_AUTOMATIC_TUNING parameter. When set to a value of TRUE, all of the parameters associated with PX are configured to a set of reasonable values. Table 1 shows all of the parameters relevant to PX including the less obvious ones, and the effects of setting PARALLEL_AUTOMATIC_TUNING to a value of TRUE. All of the parameters in the table should be considered when configuring and monitoring the PX facility, although some have only an indirect affect on performance. The table can be used effectively as a checklist. Recommended Starting Configuration The following parameters usually provide a stable starting point when setting up PX on the host configurations mentioned in the introduction of this article. It uses a conservative approach that should produce good PX throughput without overburdening the host system. PARALLEL_AUTOMATIC_TUNING=TRUE PARALLEL_MAX_SERVERS – Set to a value of CPU_COUNT x 3 PARALLEL_MIN_SERVERS – Set to a value of PARALLEL_MAX_SERVERS / 4 LARGE_POOL_SIZE – Use the value computed from the above formula SHARED_POOL_SIZE – Increase the current value by 20% to accommodate PX SQL statements In addition to reducing the complexity of PX configuration, enabling Parallel Automatic Tuning is the only way to locate the PX message buffers in the large pool. If nondefault values for PROCESSES, SESSIONS, LARGE_POOL_SIZE, or TRANSACTIONS are already in use, refer to the Oracle documentation listed in the Bibliography to adjust the values accordingly. Leaving PARALLEL_MAX_SERVERS at the default value of CPU_COUNT x 10 may overload the host, even on a database host and database designed and configured specifically for PX. Setting the value of PARALLEL_MIN_SERVERS to a value of PARALLEL_MAX_SERVERS / 4 leaves at least some servers in the slave pool ready for immediate use. After monitoring the behavior over several days or weeks, begin to make incremental adjustments as necessary to tune performance, and to take advantage of additional free host system resources. Implementing Parallel Operations The overall goal in implementing PX is to improve throughput by utilizing excess resources available on the database instance. This may occur at all times, or may only occur during certain times during the day such as during the evening when end-user loads are minimal. PX should not be implemented on instances that are already overburdened or that are near their resource limit. There are two methods of implementing Parallel Query and Parallel DML operations. The first method involves setting the Degree of Parallelism on tables and indexes as in the following example. SQL> alter table f_monthly_acctg parallel(degree 8); Table altered. The default value for degree is 1. Once the above statement is executed, any queries that access the f_monthly_acctg table may utilize PX if the optimizer determines that it is appropriate. Certainly, this is the least invasive method of implementing PX because in many cases, neither SQL statements nor applications must to be changed. The risk here is that queries that inappropriately perform full-table scans on large tables may cause significant performance degradation on the entire database if they are executed simultaneously using PX by multiple users. PX Facility Configuration and Use Maresh On partitioned tables with local indexes, PX can be used to perform index range scans. To utilize PX in this capacity, the degree of the index must be altered as shown in the following example. SQL> alter index f_monthly_acctg_ix03 parallel(degree 8); Index altered. Before implementing PX by altering the degree at the object level, it is prudent to review SQL statements running against the candidate tables either through the V$SQL dynamic performance view, or by code review, to make sure that all queries are optimized. PX should not be used as a quick fix for missing or inadequate indexes. PX implemented on poorly tuned queries may result in disastrous performance. Also, determine the number of queries that will potentially use PX, and how long they will run to get an idea of the expected increase in system load. Queries that run in the 5 second to 5 minute range are good candidates for implementing PX using this method. The overhead associated with PX will usually not significantly benefit queries that run in less than 5 seconds. If queries run longer than 5 minutes, then there is a risk that many users will execute them simultaneously causing overall significant database performance degradation. Some queries may still need to be tuned to utilize PX even though the table and/or index degree has been increased from a value of one. Information about tuning queries to utilized PX is contained in the references listed in the Bibliography section. The second, and preferred approach is to modify the SQL statements to use PX through the use of hints as shown in the statement below. While this may be more invasive to applications than the first method, PX operations occur in a predictable and controllable manner. SELECT /*+ PARALLEL(s,8) / quarter_num, sum(acctg_num) FROM s_daily_acctg s WHERE year_num = 2002 AND acct_num = 70035 GROUP BY quarter_num; DDL statements require special directives to utilize PX. Maintenance tasks such as gathering table statistics and rebuilding indexes are performed under the control of Database Administrators so performance can be evaluated and controlled. Other DDL statements that create temporary objects, as part of the data loading process, are usually performed by experienced developers, possibly with the help of a Database Administrator. So here too, performance can be predicted and controlled. In a production data warehouse environment, many of these tasks can be performed during off-peak hours when plenty of excess database resources are available. Shortening the time of data loading windows is a particularly good use for PX. So what is the best method? That depends upon when and where the extra performance is required. The following implementation methodology lowers the risk of overall poor database performance when implementing PX. 1. 2. For queries and DML statements that run during peak database load times a. First, verify that all of the queries accessing the tables of interest are optimized. If the queries are properly tuned and perform well, it may not be necessary to implement PX. b. After tuning, verify that there are sufficient database resources available to implement PX. There should be a considerable amount of CPU idle time available with few disk waits. c. If there are many different statements running against a particular table that could benefit from PX, then implement PX by altering DEGREE on the table and/or index. This is the more aggressive approach and performance should be monitored to verify that overall database performance is maintained. d. If there are few statements that could benefit from PX, tune and modify the queries using hints. This is a more conservative approach, but performance should still be monitored to verify that overall database performance is not impaired. These are likely to be long-running reports that run rather infrequently. For all other tasks that run during off-peak hours a. Identify the tasks that are good candidates to utilize PX. These include data loading tasks as well as maintenance tasks that require considerable time to complete. b. Verify that all of the SQL statements are fully optimized. If the queries are properly tuned and perform well, it may not be necessary to implement PX. PX Facility Configuration and Use c. Tune SQL statements for PX and implement PX using hints. d. For other maintenance tasks, implement PX using the directives specific to each statement. e. Verify performance so that overall database performance is maintained. Maresh The optimizer may reduce the run-time, or actual, Degree of Parallelism because certain conditions are not met. Here are four of the most common reasons why this will occur, an example of each, and how it may be avoided. • Parallel operations on too small a table. A degree of 10 is specified for a nonpartitioned table that has a total of 500 data blocks. With today’s high performance disk storage subsystems, scanning 500 blocks may require less than a second. The optimizer may determine that the overhead associated with parallelizing a query that scans 500 data blocks will take more time than performing a scan without PX (serially). Here, the overhead associated with parallelizing the query leads to the reduction in DOP. For this case, choose a Degree of Parallelism that is appropriate for the size of the table. • Disk contention. A degree of 10 is specified for a nonpartitioned table that has a total of 20,000 data blocks. The tablespace on which the table resides has two large datafiles. The DBAs thought it would be easier to create and maintain a few large datafiles instead of many smaller ones. Additionally, the table has a total of only five extents. In this case, the optimizer will determine that 10 PX slaves operating on two datafiles will cause disk contention and will reduce the Degree of Parallelism. To avoid this problem, create many smaller tablespace datafiles across many disk mount points. Set the next extent size for the table to a small enough value so that extents will be allocated across many datafiles. • CPU contention. A degree of 20 is specified for a partitioned table that has a total of 20,000 data blocks. The table has a total of 80 partitions and data is distributed fairly uniformly between all of the partitions (good job!). The database is hosted on a single-instance system that has a total of 4 CPUs. Here, the optimizer will determine that 20 PX slaves operating on 4 CPUs will cause CPU contention and will reduce the Degree of Parallelism accordingly. Values for the degree should be chosen that are reasonable for the number of CPUs on the host system. A good rule of thumb is to choose a value for DEGREE that is no greater than the number of CPUs per node on the system. If inter-operational parallelism occurs in the query, a total of 2 x CPU_COUNT slaves will be brought to bear on the query. Poorly partitioned table. A degree of 20 is specified for a partitioned table that has a total of 50,000 data blocks. The table is range partitioned by month, and has a total of 36 partitions. A query that will produce a quarterly report accesses data in a total of three partitions. Since partition level granularity will be used to divide the workload, the optimizer will reduce the Degree of Parallelism to 3 since only one PX slave can operate in each partition when partition level granularity is employed. To utilize the desired Degree of Parallelism, the table must be range partitioned in shorter intervals so that the three month’s worth of data resides in at least 20 partitions, and preferably a much higher number for PX to perform efficiently. Common Parallel Data Warehouse Operations While there are many PX operations that can be performed, the ones discussed in this section typically account for over 90% of all PX operations on a data warehouse, and they are easy to implement. Parallel Query Parallel Query (PQ) was one of the first parallel operations that was introduced in Oracle version 7. PQ has a number of very good uses. On both partitioned and nonpartitioned tables, it can be used to perform full-table scans when all of the rows in the table must be processed. This occurs when building drill across and summary tables, and materialized views. It is also useful for scanning tables when nonselective report queries are executed. On nonpartitioned tables, the scan is performed by splitting the table into multiple ranges of ROWIDs. Each PX process operates on one or more ranges or ROWIDs. This is referred to as block range granularity. On partitioned tables, if one partition is scanned, the same paradigm is employed. If multiple partitions are scanned, then each PX slave scans one or more table partition. This is referred to as partition level granularity. If multiple partitions are accessed within a partitioned table, PX can be employed to perform index range scans on local indexes. Here, each PX slave scans one or more index partitions. It soon becomes obvious that the design of partition tables and the associated index strategy will have a major impact on how well PX will perform on them. PX Facility Configuration and Use Maresh The example below shows a parallel query implemented using a query hint. To use the PARALLEL hint, specify the table and the desired degree. If the table is aliased, as in the below example, the alias must be used in the hint. In this example, a Degree of Parallelism of 8 is requested to process the s_daily_acctg table. SELECT /*+ PARALLEL(s,8) / quarter_num, sum(acctg_num) FROM s_daily_acctg s WHERE year_num = 2002 AND acct_num = 70035 GROUP BY quarter_num; An optional third argument of the hint is available to specify the number of instances to use when Oracle Parallel Server (OPS) or Real Application Cluster (RAC) is implemented. In a single instance environment, specifying instances as in the following example has unintended results. SELECT /*+ PARALLEL(s,8,8) / quarter_num, sum(acctg_num) FROM s_daily_acctg s WHERE year_num = 2002 AND acct_num = 70035 GROUP BY quarter_num; In a RAC environment, this query requests a total of 64 PX slaves, 8 slaves on each of 8 instances. In a single instance environment, it will still result in 64 PX slaves (8 x 8) being requested! The instances argument of the PARALLEL hint should not be used in single instance environments. Parallel DML Update and delete statements on partitioned tables can be parallelized when the operation includes multiple partitions. Each PX slave processes one or more partitions. Parallel DML must explicitly be enabled at the session level before executing parallel DML statements. SQL> ALTER SESSION ENABLE PARALLEL DML; Session altered. Now, any DML statements executed within the session may utilize PX. Simple statements may be executed as in the following example. SQL> UPDATE /*+ PARALLEL(dl,8) */ ddl_log dl 2 SET status_txt = ‘SUCCESS’ 3 WHERE status_ind = 1; 107288 rows updated. SQL> COMMIT; Commit complete. Because parallel DML statements utilize a two-phase commit strategy, either a commit or rollback statement must be executed after each parallel DML statement completes. Any other statement executed after a parallel DML statement has completed will produce an error. In compound statements, PX can be employed in the select statement, the DML statement, or both. Each is configured and tuned separately as shown in the following example that aggregates daily fact data into a monthly fact table. INSERT /*+ PARALLEL(s,8) */ INTO f_monthly_acctg_stage s SELECT /*+ PARALLEL(f,8) */ period_id, store_id, location_id, account_id, SUM(acctg_nmbr) FROM f_daily_acctg f WHERE period_id = 1774 GROUP BY period_id, store_id, location_id, account_id; First, the select statement is tuned and parallelized. Since the query includes an aggregation operation, two sets of eight PX slaves each will be employed in an inter-operational parallel step. In the first intra-operational step, eight slaves will scan multiple partitions of the f_daily_acctg table. These slaves will produce rows that will be consumed by the second set of eight PX slaves that perform the aggregation. The second step, in turn, produces rows that will be consumed by the eight slaves employed in the insert statement to populate the f_monthly_acctg_stage table. A total of 24 PX slaves will be used by this statement. PX Facility Configuration and Use Maresh By default, parallel insert statements are performed using direct path operations. These operations bypass the SGA buffer cache so throughput is much higher than conventional path inserts. However, there are two artifacts associated with direct path inserts. Since entire data blocks are preformatted and inserted directly into datafiles, existing table blocks with free space will not be used. Additionally, any free space left in data blocks inserted by direct path inserts will never be eligible for subsequent conventional path inserts because they are never placed on the table or partition free lists. The most useful place for direct path inserts is for loading intermediate tables that will be subsequently loaded into production tables using conventional path. If the target table of a parallel insert statement is nonpartitioned, only direct path inserts can be performed. If the table is partitioned, both direct and conventional path inserts can be performed. To use conventional path inserts on partitioned tables, specify the NOAPPEND hint on the insert statement. If the insert statement is loading data into a table that can easily be recreated in the event that the database crashed and required recovery, the NOLOGGING hint can be used to further increase data loading throughput. The nologging option causes the amount of redo information to be significantly reduced. This option is useful for loading temporary tables that may be part of the data warehouse data loading task. Create Table As Select This is another feature that is useful for manipulating table data during data warehouse loading tasks that has been available since Oracle 7. It combines the functionality of a create table and insert statement into a single step. The example below illustrates use of the statement. CREATE TABLE f_monthly_acctg_stage TABLESPACE stage PARALLEL 8 NOLOGGING AS SELECT /*+ PARALLEL(f,8) */ period_id, store_id, location_id, account_id, SUM(acctg_nmbr) acctg_nmbr FROM f_daily_acctg f GROUP BY period_id, store_id, location_id, account_id; Once again, each of two operations is performed separately. The select statement will employ 16 PX slaves to perform the rollup using inter-operational parallelism. The create table statement will both create the table and insert the rows produced by the parallel query using direct path operations. Since this is an intermediate temporary table, the nologging clause has also been used to improve throughput. The Degree of Parallelism for the insert part of the statement is specified by the parallel 8 directive in the create table statement. It is syntactically correct to not specify the DOP following the parallel directive. If a DOP does not follow the directive, then the query will utilize all available slaves available on the instance up to the value of PARALLEL_MAX_SERVERS as an unintended consequence. Be sure to always specify the requested DOP after the parallel directive in this statement. Create Materialized View Similar to the Create Table as Select statement, materialized view creation can also be parallelized. This is illustrated in the statement below. In this example, up to 12 PX threads will be exploited to read the source tables and populate the materialized view. In the select statement, the optimizer will use parallelism on the tables it determines that can best benefit from it. In this case, it is likely to be the fact table, f_daily_store_sales, since the remaining dimension tables are small. PX will also be employed in the step that populates the materialized view in the same way that the table was populated in the create table as select statement using direct path operations. PX Facility Configuration and Use Maresh CREATE MATERIALIZED VIEW mv_daily_store_sales_state TABLESPACE materialized_view PARALLEL 12 PCTFREE 0 NOLOGGING BUILD IMMEDIATE REFRESH COMPLETE ENABLE QUERY REWRITE AS SELECT dp.production_dt, pr.product_id, pr.product_dsc, pr.sic_cd, pr.upc_cd, pr.size_cd, pr.product_category_cd, ds.store_id, ds.state, SUM(fss.total_sales_dol) sales, SUM(fss.total_return_dol) returns FROM f_daily_store_sales fss, d_production_day dp, d_product pr, d_store ds WHERE fss.production_day_id = dp.production_day_id AND fss.product_id = pr.product_id AND fss.store_id = ds.store_id GROUP BY dp.production_dt, pr.product_id, pr.product_dsc, pr.sic_cd, pr.upc_cd, pr.size_cd, pr.product_category_cd, ds.store_id, ds.state; Create Index Both local and global indexes may be efficiently created using PX on both nonpartitioned and partitioned tables. To build an index using PX, simply add the PARALLEL directive to the create index statement and specify the desired Degree of Parallelism. CREATE BITMAP INDEX f_store_status_IX02 ON f_comp_status(store_id) PARALLEL(DEGREE 4) NOLOGGING; Rebuild Index Likewise, PX can be used to rebuild both nonpartitioned and partitioned indexes. Add the PARALLEL directive to the alter index statement and specify the desired Degree of Parallelism. ALTER INDEX f_daily_store_sales REBUILD PARALLEL 12 NOLOGGING; When creating or rebuilding indexes, increasing the sort area size at the session level will improve performance since more of the sorting process will occur in memory rather than on disk. Additionally, building indexes with the nologging feature increases throughput. But remember that with the nologging feature, the index would not be able to be recovered in the event that a database recovery was necessary. Another artifact of both the parallel index creation and rebuild operations is that the Degree of Parallelism on the index will be set to the degree used for the index operation. Even if the degree at the table is set to a value of 1, a statement running against the table that uses the index may use PX as an unintended consequence. To prevent this from occurring, always set the Degree of Parallelism to the desired value after the parallel index maintenance operation has completed, as shown in the following statement. PX Facility Configuration and Use Maresh ALTER INDEX f_daily_store_sales PARALLEL(DEGREE 1 INSTANCES 1); Gather Object Statistics Most readers are probably familiar with the ANALYZE statement used to collect statistics on tables and indexes. This statement cannot utilize PX. While there haven’t been any recent significant enhancements to this statement, the DBMS_STATS package continues to be enhanced. The DBMS_STATS package can be used to gather statistics at various levels, and can utilize the PX facility. As a result, object statistics can be gathered in a much shorter period of time compared with the ANALYZE statement. Many will find that running the DBMS_STATS procedures are more cumbersome than the ANALYZE statement, but the inconvenience is worth the flexibility and performance gains. There are many procedures within the DBMS_STATS package, but the following procedures are the basic ones for collecting statistics. In all cases, the Degree of Parallelism is specified as an argument in the procedure call. GATHER_INDEX_STATS gathers global and index statistics on an index or index partition. GATHER_TABLE_STATS is used to gather global and object level statistics on partitioned and nonpartitioned tables. An option is available to also gather statistics on all index and index partitions that exist on the specified table. GATHER_SCHEMA_STATS gathers global and object level statistics on all tables and indexes in a specified schema. Statistics should be gathered exclusively with either the ANALYZE statement or DBMS_STATS. When migrating from ANALYZE to the DBMS_STATS method of statistics collection, drop all of the statistics collected with the ANALYZE statement before using DBMS_STATS. Performance Monitoring The final step in achieving optimal PX performance across the database instance is to monitor performance. Oracle provides several dynamic performance views for this purpose. The older views begin with V$PQ while the newer ones begin with V$PX. The older views date back to Oracle 7 which supported few parallel features, one of which included parallel query, hence the V$PQ designation. When capabilities were added beginning with version 8 to include parallel execution of other operations, the overall facility was named PX, and the V$PX views were added. Both sets of views provide useful information. There is however, some redundancy between several V$PQ and V$PX views. Monitoring the PX Message Pool After configuring PX for the first time on a database instance, or when configuration changes have been made, the size of the PX message pool should be monitored. The following query indicates that the size of the message pool is approximately 40MB and is located in the large pool. From the query, we can also determine that the PARALLEL_AUTOMATIC_TUNING configuration parameter is set to a value of true because the message pool is located in the large pool. If the value of PARALLEL_AUTOMATIC_TUNING were set to false, then the query would have indicated that the message pool was located in the shared pool. SQL> SQL> SQL> SQL> SQL> 2 3 4 break on pool skip 1 col bytes format 9,999,999,999; col name format a25 SELECT pool, name, SUM(bytes) bytes FROM v$sgastat WHERE pool = 'large pool' GROUP BY ROLLUP(pool,name); POOL NAME BYTES ----------- ------------------------- -------------large pool PX msg pool 41,179,660 free memory 1,299,988 42,479,648 Message buffer usage can be monitored by querying the V$PX_PROCESS_SYSTAT view as follows. PX Facility Configuration and Use Maresh SQL> SELECT statistic, value bytes 2 FROM v$px_process_sysstat 3 WHERE statistic LIKE 'Buffers%'; STATISTIC BYTES ------------------------------ -------------Buffers Allocated 1,246,409 Buffers Freed 1,246,409 Buffers Current 0 Buffers HWM 3,141 Multiplying the value for buffers HWM (high water mark) by the value of PARALLEL_EXECUTION_MESSAGE_SIZE will yield the highest usage of PX message buffer space. Using and 8KB message size for this example, the highest amount of memory usage by message buffers was 25MB out of roughly 40MB of total space available. After periodically monitoring this view over days or weeks, any major size discrepancies of the message pool should be adjusted accordingly. It is prudent to maintain at least a 25% margin between the highest message buffer space usage and the total amount of PX message pool space allocated. Monitoring the Overall PX Facility V$PX_PROCESS_SYSTAT shows overall statistics for the PX facility. The server statistics show the number of PX slaves that are in use and how many are available in the PX slave pool at a point in time. Here, the term slave and server are synonymous. Slaves that are in use are slaves that are currently under the command of a Query Coordinator that cannot be used for new parallel operations. Slaves that are available are slaves that are idle in the slave pool that may be acquired for parallel operations. The total number of slaves that are in use can be as large as the parameter value of PARALLEL_MAX_SERVERS. If the sum of the values of Servers In Use and Servers Available does not equal the value of PARALLEL_MAX_SERVERS, then it means that the value of PARALLEL_MIN_SERVERS is less than the value of PARALLEL_MAX_SERVERS. In such a case, additional slaves will be dynamically created when a Query Coordinator requests them and none are available in the slave pool. SQL> SELECT * 2 FROM v$px_process_sysstat 3 WHERE statistic LIKE 'Servers%'; STATISTIC -----------------------------Servers In Use Servers Available Servers Started Servers Shutdown Servers Highwater Servers Cleaned Up Server Sessions VALUE ---------38 2 820 386 40 0 67,372 If the value for Servers Shutdown is high, it indicates that many slaves were dynamically destroyed to reduce the total number of slaves in the slave pool to the value of PARALLEL_MIN_SERVERS. If there is sufficient memory available, consider increasing the value of PARALLEL_MIN_SERVERS to avoid the overhead associated with dynamically creating and destroying PX slaves. V$SYSSTAT contains some useful statistics about the types and numbers of statements parallelized. The total number of data flow operations parallelized is shown on the second line of the output. Typically, one inter-operational parallel operation results in one data flow operation (DFO). If the number of DFOs is substantially higher than the sum of the remaining three rows, then many queries have been parallelized that contain multiple join operations, or compound statements such as INSERT INTO..SELECT…, CREATE TABLE … AS SELECT, or CREATE MATERIALIZED VIEW. SQL> SELECT name, value 2 FROM v$sysstat 3 WHERE name LIKE '%parallel%' 4 order by name; NAME VALUE ------------------------------ ---------DDL statements parallelized 360 DFO trees parallelized 7425 DML statements parallelized 26 queries parallelized 6998 PX Facility Configuration and Use Maresh The following query against V$SYSSTAT is useful for determining when parallel operations were unable to acquire the total number of slaves that were requested. The total number of slaves at run-time will be less than the requested Degree of Parallelism if there are an insufficient number of slaves available when PARALLEL_MIN_PERCENT is set to a nonzero value, or when PARALLEL_ADAPTIVE_MULTI_USER is set to a value of TRUE. Downgrading can often explain erratic performance of the same query that executes many times over the course of a day. When the query requires a longer time to execute, it is likely to have acquired fewer slaves than requested. SQL> SELECT name, value 2 FROM v$sysstat 3 WHERE name LIKE 'Parallel%' 4 order by name; NAME VALUE --------------------------------------------- ---------Parallel operations downgraded 1 to 25 pct 18 Parallel operations downgraded 25 to 50 pct 4 Parallel operations downgraded 50 to 75 pct 0 Parallel operations downgraded 75 to 99 pct 0 Parallel operations downgraded to serial 108 Parallel operations not downgraded 7425 At the next lower level of the PX facility, V$PX_PROCESS shows the status of each PX slave. The number of slaves in the pool will range from the value of PARALLEL_MIN_SERVERS to the value of PARALLEL_MAX_SERVERS. Note that the SID and SERIAL# are only assigned when the slave is in use. SQL> SELECT * 2 FROM v$px_process 3 ORDER BY server_name; SERV ---P000 P001 P002 P003 P004 … P038 P039 STATUS PID SPID SID SERIAL# --------- ---------- --------- ---------- ---------IN USE 21 20559 34 9438 IN USE 22 20561 50 24257 IN USE 23 20563 84 540 IN USE 24 20565 26 10661 IN USE 25 20567 18 27455 AVAILABLE AVAILABLE 64 22264 65 22266 Moving to the SQL statement level, V$PX_SESSION can be used to determine how many slaves have been acquired by each Query Coordinator. The results of the following query indicate that there are 9 sessions associated with the first Query Coordinator, and 17 sessions associated with each of the last two. Since V$PX_SESSION also includes information about the Query Coordinator, the number of PX slaves being used by each QC will be one less than the number of reported in the COUNT(*) column. SQL> SELECT qcsid, COUNT(*) 2 FROM v$px_session 3 GROUP BY qcsid; QCSID COUNT(*) ---------- ---------45 9 73 17 74 17 Summary PX architecture is a logical extension of the core database architecture. When a parallel SQL statement begins to execute, it borrows PX slaves from a common pool available to all database users. When the statement completes, the slaves are returned to the slave pool for use by other processes. There are many considerations about the database and host that affect overall performance of the PX facility. Prior to setting up PX, the database and host configuration should be optimized to take full advantage of PX. PX is one application that can take full advantage of a well-designed high-throughput disk subsystem. Proper database physical design is also an important factor that influences optimal PX performance. PX Facility Configuration and Use Maresh There are many database configuration parameters that control the configuration and performance of the PX facility. Besides the obvious parameters that begin with the word PARALLEL, there are many others that are often overlooked that cause to less than optimal performance, or even statement failure. Once the PX facility has been configured, it can generally be implemented by two methods. The first method is by setting the Degree of Parallelism on tables and indexes to a value greater than one. In this case, statements will automatically begin using PX. The greatest risk of this approach is that some poorly performing statements may degrade overall database performance. Additionally, some statements may require tuning to fully take advantage of PX. The recommended approach is to choose statements that can best benefit from PX and manually tune them with query hints. In any case, statements running against the candidate tables should be optimized prior to using PX. You may find that performance is quite adequate without the use of PX. The most common data warehouse operations that can take advantage of PX include parallel query, parallel DML, Create Table as Select, materialized view creation. For maintenance operations, index creation and rebuilds, and object statistics gathering can also be parallelized. Once the database is up and running on PX, it is important to monitor the overall health of the PX facility. The V$PX_PROCESS_SYSTAT view provides a high level view. One can monitor the size of the PX message pool and verify that its intended location is correct. The view also provides a summary of message buffer and PX slave allocation. The V$SYSSTAT view is useful for determining the types of PX operations that have been run, as well as information about operations that did not acquire all of the slaves that were requested. The V$PX_PROCESS and V$PX_SESSION views show details about PX statements that are currently executing. Once one has a good overview of the overall PX architecture as it relates to the overall database and host environments, and properly configures and monitors PX operation on the database instance, PX can be effectively and efficiently be used to scale data warehouse throughput.