Download Getting the Best Out of Your Data Warehouse - Performance

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Oracle Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Transcript
Getting the Best Out of Your Data Warehouse Performance Tips
Scott McKay
WhereScape Limited
Abstract
Performance. It could almost always be better, but is it good enough? This
can be a difficult question to answer in a data warehouse environment
where much of the load may consist of adhoc queries. However, poor
performance can turn an otherwise successful data warehousing project
into a failure. Data volumes and the expectations of data warehouses are
ever increasing, so performance considerations can’t be ignored.
This paper focuses on the design and setup of a dimensional data
warehouse to get the best out of it. It is intended to dispel a few myths
about data warehouse environments and provide a few practical
performance tips. The tips are based on a number of years experience as
an Oracle DBA, combined with two years specializing in data warehouse
development and performance.
Overview
Performance is often found wanting after a project gets implemented and data
volumes start growing. There can a number of reasons for this. Whatever the
reasons, they often stem from performance not being designed in from the
beginning.
It is too easy to blame issues on the hardware, or to purchase bigger hardware
as a solution to the problem. Often this is nothing more than a temporary bandaid as it masks underlying issues that will get worse as data volumes grow.
For a data warehouse to successfully handle today’s ever increasing data
volumes and complex analytical needs is often a balancing act between query
performance, load times and the ever important cost.
Common Myths
1. “A data warehouse is read only so write performance is not
important.”
There are 2 good reasons why write performance is critical to a Data Warehouse
environment:


The data has to be loaded in to the database. This may seem obvious, but
often seems to be forgotten when designing or purchasing the IO subsystem. As Data Warehouses grow bigger and more complex, the time
required for data loading grows larger. At the same time companies are
becoming larger and more global, so the actual window available for
loading the data is shrinking. This compounds the issue, and having the
data available when the business would like it is often a challenge in
mature data warehouse environments.
Query performance. This may not be quite as obvious, until you consider
the types of queries that a Data Warehouse is generally expected to
answer. A typical data warehouse query might be: “Show me how my
sales have been trending over the last 5 years?” or “How have my
inventory costs been tracking?”. Both of these would require the database
to sort and aggregate potentially large volumes of data. If the data volume
is large then these operations will happen in the temporary segments on
disk. Sorts on disk are relative write intensive and will certainly have a
performance impact if write performance is a bottleneck.
2. “Full table scans are bad.”
Full table scans may be bad in a particular case. In another case, a full table
scan may be the best way of achieving the desired result. Consider the
Inventory query mentioned above. This may involve reading all of the records in
a table which contains a snapshot of your entire inventory each month, for
several years. To see how your inventory cost is trending, you would probably
have to read all the records in the table and a full table scan is the best way of
doing this.
What about a scenario where you might have 3 years worth of inventory
snapshots and you want to look at the past year. You need to view one third of
the table. Using an index on a date field might be good because then there is two
thirds of the table that you don’t need to read. This seems logical, except that it
depends on the work required to get the table rowid from the index. For a
standard b-tree index the database may have to read an average of 2 or 3 blocks
to traverse the index to get to the required entry, and then perform another block
read for the rowid lookup. In this case a full table scan would be much faster. If
you were using a bitmap index however, then it would quite likely be faster than
either of the other approaches.
A common data loading approach is to load a subset of data each night into load
and/or stage tables for processing before loading fact tables. In this case you
want to be using full table scans for the processing of your load and stage tables
because you are processing all of the rows in the tables.
3. “Additional Indexes Improve Performance.”
This sounds similar to myth number 2, and in some cases the reasoning is the
same, but is worth a separate mention because it is so common. I have seen
several major performance issues fixed by actually removing indexes from
tables. In some cases they may have made an improvement on the particular
query, or data load which was being worked on at the time, but have had adverse
effects on other parts of the system. The reasons for this are generally as
discussed above.
Another potential issue with over-indexing is the adverse effect that it has on
inserting data into a table. In a data warehouse load, thousands of records are
often being inserted at once. In this case it can often be better to drop the
indexes on the table before loading and the recreate them at the end of the load
process. While index creation can be a relatively expensive operation, the time
for this is often significantly less than the negative impact of the database
maintaining these indexes while inserting the data.
Bitmap indexes are particularly expensive to maintain when inserting data, but
can be quicker to create, so it is definitely worth considering dropping any bitmap
indexes on a table before loading any data into it, and then recreating them after
the data is loaded.
It should be noted that adding an additional index to a table can be exactly what
is required. However, it should be done after considering what negatives it may
cause along with other options.
Application Design Tips
1. Use a Dimensional Star Schema Design
This is described in many texts and is a common standard for data warehouse
design, so I won’t describe it here (Refer to The Data Warehouse Toolkit by
Ralph Kimball). It is a simple design which is easy to model and performs well for
queries.
2. Generate artificial keys for dimension table joins.
This can greatly simplify the development of queries over the star schema
because all of the table joins have been simplified to single column joins, and
nulls have been given a value. Performance can also be better as any index on
this column just contains a single column.
3. Partition Tables that are likely to grow large.
This can have significant benefits for both query performance and load
performance.
Range partitioning by month is often the best for both scenarios in a Data
Warehouse for the following reasons:
 Queries often have date ranges or restrictions, so partitioning by a date
field lends itself to query performance in many case.
 As data loads are generally done over only the most recent data, or
recently changed data, the data load can be written to only work on the
related partitions. This can have significant benefits for very large tables
where indexes are dropped before the load and rebuilt after. If local
indexes are used, then the indexes on the unaffected partitions can
remain untouched.
 Another big advantage is that often a certain “window” of data is required
to be kept. In these cases then the old partitions can be dropped instantly
rather than expensive deletes being required.
4. Use Bitmap indexes on Fact tables for Dimension table joins.
Dimensions are generally orders of magnitude smaller than the fact tables that
they related to. This means that there are generally a large number of like
dimension keys in the fact table. Bitmap indexes are ideally suited to this type of
join.
In rare cases where a large dimension contains almost as many rows as the
related fact table (i.e. the dimension key records are relatively unique in the fact
table then a standard b-tree index may perform better).
5. Use Materialized views to create generic aggregate tables.
Materialized views when used in conjunction with the query rewrite feature in
Oracle can significantly improve query performance. If the materialized view is
kept generic then it can potentially improve any number of queries.
An example of this might be a sales fact table that shows sales by customer,
product and month. A materialized view which selects all the columns from this
table except product could then be automatically used by the optimizer for any
query which does not reference product (e.g. “Who is my best customer?”).
Initial System Setup
Large Block size
Usually set to 32kb for Oracle. The database is generally dealing with large data
sets and large transactions so performance is improved by having a large block
size. There may be some advantage in having a tablespace with a smaller block
size for any particularly small dimensions which exist.
Large pga_aggregate_target (or sort_area_size).
Sorts on disk are expensive and a data warehouse is generally dealing with
much larger data sets than a transactional system. There may be some
advantage in setting workarea_policy to manual within a job, and then setting a
large sort_area_size, hash_area_size and bitmap_merge_area_size parameters.
These parameters can all be modified using the “alter session” command.
Use a smaller number of large rollback segments rather than a lot of
little ones.
“Snapshot too old” errors are common when the initial size of the rollback
segments is too small.
Use local, temporary and read only tablespaces where applicable. Local
tablespaces in particular can improve the speed of data loads because during the
dropping and creating of indexes before and after the load a lot of extent
management may be required. For the same reason uniform extent sizes within a
particular tablespace can lead to a significant overall disk space saving. Having 3
different uniform extent sizes (small, medium and large) may useful.
Use the Keep buffer pool.
This is particularly useful for dimensions and will give significant performance
benefits over a single very large buffer pool. The reason for this becomes clear
when you consider the type of tables in a data warehouse. Dimension tables are
generally small and the same dimensions are constantly referenced by different
queries.
Fact tables, by contrast, can be very large and are often accessed in isolation. If
you have your fact tables in the same tablespace as your dimension tables then
the dimension tables may be getting constantly aged out. A keep buffer pool
large enough to hold all your dimension tables and the indexes on them is a good
idea. You can then assign your dimensions and dimension indexes to the keep
pool using the “alter table” or “alter index” commands.
Table Statistics
Analyze database tables using the dbms_stats database package. This allows
the database to use the cost based optimizer and all the Oracle Performance
features which have been introduced since Oracle 7.
Table statistics also provide much more accurate statement cost details when
doing explain plans.
Diagnosing Existing Performance Problems
Statspack
Oracle Statspack is a great utility for diagnosing existing performance problems
and it is shipped free with the database. It works by taking snapshots of various
performance metrics and storing them in database tables. The supplied
statspack report can then be used to give a picture of what has happened
between any two snapshots taken. For a new system or a system experiencing
performance issues then running a Statspack snapshot at 6 hour intervals can be
particularly useful.
The report can be done to a number of different levels of detail, but the beginning
of the report in all cases provides a useful over for system performance statistics,
such as shared pool statistics, buffer pool statistics and IO metrics. Also
provided is a formatted summary of the System Wait Events from the
v$system_event view.
System Wait Events
These are often overlooked, but can generally provide a very quick indication of
where the bottlenecks are in a system. It is usually the best place to start.
Searching Oracle metalink for a particular wait event from statspack will generally
return a useful note which not only describes what the event is in some detail, but
also provides suggestions as to how to reduce the waits for the event. An
example of the wait events shown by Statspack is shown in Appendix 1.
Session Wait Events
If a specific process or time of day is causing performance issues then Session
Wait Events from the v$session_wait view can be particularly useful. If looking in
real time this table can be queries for a particular SID, several times which gives
a good indication of where most of the time is going.
When trying to look at what the issues where for historical periods of bad
performance, then the Statspack report can provide a summary of the Session
Waits occurring at the time of a snapshot.
Tuning Individual Statements
Tuning system parameters can have a dramatically effect on systems where
there is a fundamental setup issue. However, this is not usually the case. Most
performance problems can be attributed to poorly designed or poorly written SQL
statements.
Selecting from v$sqlarea at any particular time and ordering by buffer_gets or
disk_reads descending can provide and instant picture of what the most resource
intensive statements in the system are.
Tuning the poor performing statements can often have the biggest impact on
overall system performance, because it can free up a lot of resources that may
have otherwise been causing a bottleneck.
Hints
SQL hints are a great way of changing the execution plan of a statement to make
it perform more efficiently. There are a large number of different types of hints
available, so I won’t go into detail on them here.
Using an explain plan utility you can quickly and easily try different hints and see
the effect that they have on the execution path of the statement.
Rewrite the Statement
The performance of statements can change dramatically when a statement is
rewritten to return the same result set via different methods. An example of this
may be changing a statement which uses and “where not in” clause to “where not
exists”, or to a “minus”. Each of these methods could be made to return the same
result set, but may do it via quite different execution paths.
Add Indexes
While over-indexing tables is usually not a good idea, adding an index to improve
query performance is still often the best way. Generally if the column, or
combination of columns which is used for a join or lookup is unique, or very
selective then adding an index on it should have any adverse effects on any
other queries. If the column, or columns is not very selective then an index on it,
potentially resulting in a large range scan can often perform worse.
If there are other indexes on the table where range scans are performed then
other queries may also change to use any additional index created.
Conclusions
There are a number of common mistakes made with regards to data warehouse
system setup. Many can be attributed to myths (or misunderstandings) about the
role of the data warehouse and it’s requirements for performing that role.
Application design is one of the keys to avoiding performance issues later on.
Generally for a data warehouse a dimensional star schema’s provide a robust
and simple design which performs well for data warehouse type queries.
Following good practices when purchasing and setting up a data warehouse
system is also important.
If performance problems do occur then Oracle provides a large number of tools
for diagnosing and solving these problems with the database. Key among these
are Statspack, the System and Session Wait tables and the explain plan utility.
Most problems stem from poorly performing individual SQL statements, and the
biggest performance gains can usually be obtained by tuning individual
statements. Oracle’s query hints are particularly useful for this.
Performance is something that should be considered and designed in from the
beginning of any system implementation. When issues occur it is generally best
to follow a pragmatic approach to diagnosing what is causing the problems,
rather than just spending money on more powerful hardware.
References
Oracle Metalink: http://metalink.oracle.com
Oracle 9i Database Utilities, Release 2(9.2), March 2002, Oracle Corporation
Oracle 9i Database Performance Tuning Guide and Reference, Release 2(9.2),
March 2002, Oracle Corporation
The Data Warehouse Lifecycle Toolkit, Ralph Kimball et al, 1998, John Wiley and
Sons.
Appendix 1: Statspack System Wait Events
Wait Events for DB: DWT Instance: DWT Snaps: 283 -327
-> s - second
-> cs - centisecond 100th of a second
-> ms - millisecond 1000th of a second
-> us - microsecond - 1000000th of a second
-> ordered by wait time desc, waits desc (idle events last)
Event
Waits
Timeouts
---------------------------- ------------ ---------PL/SQL lock timer
12,604
12,537
db file sequential read
9,250,844
0
db file parallel write
127,206
0
db file scattered read
4,480,652
0
SQL*Net message from dblink
1,926,862
0
direct path read
2,325,280
0
direct path write
2,300,831
0
free buffer waits
23,111
20,728
SQL*Net more data from dblin
7,139,501
0
async disk IO
200,718
0
log file parallel write
1,031,109
0
write complete waits
4,502
4,215
enqueue
2,624
1,181
log file sequential read
102,660
0
local write wait
4,301
2,914
buffer busy waits
58,778
607
log file sync
109,580
93
PX Deq: Execute Reply
1,703
120
control file parallel write
34,068
0
log file switch completion
2,639
286
log buffer space
2,957
55
latch free
179,875
7,075
row cache lock
5,052
69
control file sequential read
58,803
0
single-task message
363
0
db file parallel read
4,417
0
inactive session
60
60
library cache load lock
134
4
process startup
240
0
SQL*Net more data to client
132,054
0
LGWR wait for redo copy
66,473
73
log file single write
708
0
PX Deq Credit: send blkd
12,955
0
SQL*Net message to dblink
1,926,862
0
library cache pin
461
0
PX Deq: Table Q Get Keys
190
0
SQL*Net break/reset to clien
991
0
PX Deq: Parse Reply
224
0
PX Deq Credit: need buffer
7,416
0
PX Deq: Signal ACK
79
2
kksfbc child completion
9
9
PX Deq: Join ACK
231
0
undo segment extension
119,908
119,904
wait list latch free
3
0
SQL*Net break/reset to dblin
4
0
slave TJ process wait
1
1
PX Deq: Table Q qref
69
0
kkdlgon
1
0
SQL*Net more data to dblink
14
0
Avg
Total Wait
wait
Waits
Time (s)
(ms)
/txn
---------- ------ -------194,811 15456
0.0
67,709
7
29.8
56,831
447
0.4
54,684
12
14.4
39,846
21
6.2
26,306
11
7.5
22,017
10
7.4
21,509
931
0.1
17,017
2
23.0
7,305
36
0.6
5,634
5
3.3
4,291
953
0.0
3,862
1472
0.0
3,641
35
0.3
3,018
702
0.0
1,943
33
0.2
1,849
17
0.4
763
448
0.0
732
21
0.1
496
188
0.0
445
151
0.0
390
2
0.6
263
52
0.0
259
4
0.2
140
385
0.0
117
26
0.0
59
986
0.0
16
121
0.0
12
51
0.0
10
0
0.4
6
0
0.2
3
4
0.0
3
0
0.0
3
0
6.2
3
6
0.0
1
4
0.0
1
1
0.0
0
2
0.0
0
0
0.0
0
2
0.0
0
13
0.0
0
0
0.0
0
0
0.4
0
18
0.0
0
9
0.0
0
18
0.0
0
0
0.0
0
2
0.0
0
0
0.0