Download Optimizing SQL

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DBase wikipedia , lookup

Serializability wikipedia , lookup

Microsoft Access wikipedia , lookup

Tandem Computers wikipedia , lookup

Oracle Database wikipedia , lookup

Relational algebra wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Ingres (database) wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Clusterpoint wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Null (SQL) wikipedia , lookup

SQL wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

PL/SQL wikipedia , lookup

Transcript
Database Systems – SQL
SQL OPTIMIZATION
Writing efficient queries requires understanding what effects the performance of your
query.
In general hardware can have a profound effect on your performance.
There are some basic physical constraints that databases must work around. Especially
with non-solid state memory. Solid state memory changes the stats below and are
constantly rewriting the specifications.
Disk Seeks – Average 10 ms, aka 100 seeks a second. Only real optimization is to
distribute data across multiple disks. Seek time per table is hard to improve.
Disk Reading & Writing – 10 to 20 MB per second, best way to optimize is to distribute
data across multiple disks
Disk Spindles – The greater the number of spindles the greater the opportunity for a
database to read/write in parallel.
CPU Cycles – Data must be processed. Smaller data fits in memory and thus faster to
process.
Database Systems – SQL
SQL OPTIMIZATION
A bigger issue is your data base design and how your query executes.
Let’s start by understanding how Postgres executes a query.
1. Transmitting the SQL string to the database backend.
2. Parsing the query string
3. Planning of the query to optimize the retrieval of data
4. Retrieval of data from the hardware
5. Transmission of the data to the client.
http://www.revsys.com/writings/postgresql-performance.html, Note there are
additional comments here about general DB tuning that are skipped as we do not have
control over our database environment.
Database Systems – SQL
SQL OPTIMIZATION
Transmitting the SQL string to the database backend.
This is not a particularly lengthy step. It requires the actual characters you type to be
transferred to the database. If a query is exceptionally long it may be placed in a
stored procedure to save the transmission time.
Parsing the query string
The text you typed must be broken into tokens. Again, if this takes longer than desired,
a stored procedure may be used to save the parsing time.
Planning of the query to optimize the retrieval of data
If your query is already prepared it can save a lot of time. Otherwise it must determine
the best way to execute the query. Should it use an index (s) or could a hash join be
more efficient.
Database Systems – SQL
SQL OPTIMIZATION
Retrieval of data from the hardware
There isn’t much you can do here other than improving your hardware.
Transmission of the data to the client.
There isn’t much you can do here other than to minimize the number of columns
returned in the result set as well as the number of rows.
Database Systems – SQL
OPTIMIZATION
The more extensive your permissions are, the less optimized your database will be.
Table level, column level permissions, resource counting, etc can be problematic if you
have a large number of statements being executed. However, if you have a few very
queries that execute over a large amount of data, this factor may not be as significant.
USING EXPLAIN
Using EXPLAIN helps you understand how your query executes. It informs you what
order tables are joined and what indexes are used to join them. If you notice joins on
unindexed fields, you can index them to improve performance.
To execute an EXPLAIN simply type:
EXPLAIN SqlQuery;
Database Systems – SQL
OPTIMIZATION
Suppose that you have the SELECT statement shown here and that you plan to
examine it using EXPLAIN ANALYZE:
CREATE TABLE authors (
id int4 PRIMARY KEY,
name varchar
);
CREATE TABLE books (
id
int4 PRIMARY KEY,
author_id int4,
title
varchar
);
Database Systems – SQL
OPTIMIZATION
Try analyzing the following query:
EXPLAIN ANALYZE SELECT authors.name, books.title FROM books, authors WHERE
books.author_id=16 and authors.id = books.author_id ORDER BY books.title;
We get:
QUERY PLAN
------------------------------------------------------------------------------------------------Sort (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1)
Sort Key: books.title
-> Nested Loop (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1)
-> Index Scan using authors_pkey on authors (cost=0.00..5.82 rows=1 width=36) (actual
time=0.029..0.033 rows=1 loops=1)
Index Cond: (id = 16)
-> Seq Scan on books (cost=0.00..23.75 rows=6 width=36) (actual time=0.026..0.052
rows=7 loops=1)
Filter: (author_id = 16)
Total runtime: 16.386 ms
Database Systems – SQL
OPTIMIZATION
Read from the bottom up
We first see the complete query time:
Total runtime: 16.386 ms
Postgres then performs a sequential scan on the books table filtering the rows that
have an author_id of 16.
-> Seq Scan on books (cost=0.00..23.75 rows=6 width=36) (actual
time=0.026..0.052 rows=7 loops=1)
Filter: (author_id = 16)
Although no explicit index is placed on the authors table, an implicit one exists due to
the primary key. Therefore postgres utilizes it to select the authors whose keys are
equal to 16
-> Index Scan using authors_pkey on authors (cost=0.00..5.82 rows=1
width=36) (actual time=0.029..0.033 rows=1 loops=1)
Index Cond: (id = 16)
Database Systems – SQL
OPTIMIZATION
The final results are sorted by the book’s title:
Sort (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1)
Sort Key: books.title
-> Nested Loop (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1)
Note the actual and estimated times are listing in parenthesis.
Let’s add an index
CREATE INDEX books_idx1 on books(author_id);
If you rerun the EXPLAIN query would you expect the performance to increase?
Database Systems – SQL
OPTIMIZATION
The final results are sorted by the book’s title:
Sort (cost=29.71..29.73 rows=6 width=64) (actual time=0.189..16.233 rows=7 loops=1)
Sort Key: books.title
-> Nested Loop (cost=0.00..29.63 rows=6 width=64) (actual time=0.068..0.129 rows=7 loops=1)
Note the actual and estimated times are listing in parenthesis.
Let’s add an index
CREATE INDEX books_idx1 on books(author_id);
If you rerun the EXPLAIN query would you expect the performance to increase?
It will not, until you run:
ANALYZE books;
However, you are still not ensured the index will be used. If there are a small number
of records in books, it still may perform a sequential scan.
Database Systems – SQL
OPTIMIZATION
What are other considerations why an index might not be used in Postgres?
Problem: The planner has decided its faster to do a table scan than an index scan : This can happen if a) your table
is relatively small, or the field you are indexing has a lot of duplicates. Solution: Case in point, boolean fields are
not terribly useful to index since 50% of your data is one thing and 50% is another. However they are good
candidates to use for Partial indexes e.g. to only index data that is active.
Problem: You set up an index that is incompatible with how you are actually filtering a field. There are a couple of
variants of this situation. The old
LIKE '%me' will never use an index, but LIKE 'me%' can possibly use an index.
The upper lower trap - you defined your index like:
CREATE INDEX idx_faults_name ON faults USING btree(fault_name);, But you are running a query like this:
SELECT * FROM faults where UPPER(fault_name) LIKE 'CAR%' Possible fix:
CREATE INDEX idx_faults_name ON faults USING btree(upper(fault_name));
Database Systems – SQL
OPTIMIZATION
For this example (run in mySQL), make the following assumptions:
The columns being compared have been declared as follows:
Table
Column
Data Type
tt
ActualPC
CHAR(10)
tt
AssignedPC
CHAR(10)
tt
ClientID
CHAR(10)
et
EMPLOYID
CHAR(15)
do
CUSTNMBR
CHAR(15)
The tables have the following indexes:
Table
Index
tt
ActualPC
tt
AssignedPC
tt
ClientID
et
EMPLOYID (primary key)
do
CUSTNMBR (primary key)
Database Systems – SQL
OPTIMIZATION / mySQL
The tt.ActualPC values are not evenly distributed.
Initially, before any optimizations have been performed, the EXPLAIN
statement produces the following information:
table
et
do
et_1
tt
type
ALL
ALL
ALL
ALL
possible_keys key key_len ref rows Extra
PRIMARY
NULL NULL
NULL 74
PRIMARY
NULL NULL
NULL 2135
PRIMARY
NULL NULL
NULL 74
AssignedPC,
NULL NULL
NULL 3872
ClientID,
ActualPC
range checked for each record (key map: 35)
Because type is ALL for each table, this output indicates that MySQL is
generating a Cartesian product of all the tables.
For the case at hand, this product is 74 × 2135 × 74 × 3872 =
45,268,558,720 rows.
Database Systems – SQL
OPTIMIZATION / mySQL
One problem here is that MySQL can use indexes on columns more efficiently if they
are declared as the same type and size.
In this context, VARCHAR and CHAR are considered the same if they are declared as
the same size.
tt.ActualPC is declared as CHAR(10) and et.EMPLOYID is CHAR(15), so there is a
length mismatch.
To fix this disparity between column lengths, use ALTER TABLE to lengthen ActualPC
from 10 characters to 15 characters.
ALTER TABLE tt MODIFY ActualPC VARCHAR(15);
Database Systems – SQL
OPTIMIZATION / mySQL
Now tt.ActualPC and et.EMPLOYID are both VARCHAR(15).
Executing the EXPLAIN statement again produces this result:
table type
tt
ALL
where
do
et_1
et
possible_keys key
AssignedPC,
NULL
key_len ref
NULL
NULL
rows Extra
3872 Using ClientID,
ActualPC
2135
ALL
PRIMARY
NULL
NULL
NULL
range checked for each record (key map: 1)
ALL
PRIMARY
NULL
NULL
NULL
range checked for each record (key map: 1)
eq_ref PRIMARY
PRIMARY 15 tt.ActualPC
74
1
This is not perfect, but is much better: The product of the rows values is less by a
factor of 74. This version executes in a couple of seconds.
Database Systems – SQL
OPTIMIZATION / mySQL
A second alteration can be made to eliminate the column length mismatches for the
tt.AssignedPC = et_1.EMPLOYID and tt.ClientID = do.CUSTNNBR comparisons:
ALTER TABLE tt MODIFY AssignedPC VARCHAR(15), -> MODIFY ClientID VARCHAR(15);
After that modification, EXPLAIN produces the output shown here:
table
et
tt
type
ALL
ref
possible_keys key
key_len
PRIMARY
NULL
NULL
AssignedPC,
ActualPC 15
ref
rows
NULL
74
et.EMPLOYID 52
et_1
do
eq_ref
eq_ref
PRIMARY
PRIMARY
tt.AssignedPC 1
tt.ClientID
1
PRIMARY
PRIMARY
15
15
Extra
Using ClientID, where
ActualPC
Database Systems – SQL
OPTIMIZATION / mySQL
At this point, the query is optimized almost as well as possible.
The remaining problem is that, by default, MySQL assumes that values in the
tt.ActualPC column are evenly distributed, and that is not the case for the tt table.
Fortunately, it is easy to tell MySQL to analyze the key distribution:
ANALYZE TABLE tt;
With the additional index information, the join is perfect and EXPLAIN produces this
result:
Database Systems – SQL
OPTIMIZATION / mySQL
table
tt
et
et_1
do
type
ALL
eq_ref
eq_ref
eq_ref
possible_keys
AssignedPC
PRIMARY
PRIMARY
PRIMARY
key
NULL
PRIMARY
PRIMARY
PRIMARY
key_len
NULL
15
15
15
ref
rows Extra
NULL
3872 Using ClientID, where ActualPC
tt.ActualPC
1
tt.AssignedPC 1
tt.ClientID
1
Note that the rows column in the output from EXPLAIN is an educated guess from
the MySQL join optimizer.
You should check whether the numbers are even close to the truth by comparing the
rows product with the actual number of rows that the query returns.
If the numbers are quite different, you might get better performance by using
STRAIGHT_JOIN in your SELECT statement and trying to list the tables in a different
order in the FROM clause.
Database Systems – SQL
OPTIMIZATION / Postgres
These issues do not seem to exist in Postgres.
EXPLAIN SELECT tt.ActualPC, et.EmployID, tt.ClientID
FROM tt, et, et AS et_1, doo WHERE tt.ActualPC = et.EMPLOYID
AND tt.AssignedPC = et_1.EMPLOYID AND tt.ClientID = doo.CUSTNMBR;
QUERY PLAN
---------------------------------------------------------------------------Hash Join (cost=65.49..160.23 rows=1495 width=38)
Hash Cond: (tt.clientid = doo.custnmbr)
-> Hash Join (cost=5.29..77.60 rows=1495 width=38)
Hash Cond: (tt.assignedpc = et_1.employid)
-> Hash Join (cost=2.64..54.40 rows=1495 width=49)
Hash Cond: (tt.actualpc = et.employid)
-> Seq Scan on tt (cost=0.00..30.59 rows=1659 width=33)
-> Hash (cost=1.73..1.73 rows=73 width=16)
-> Seq Scan on et (cost=0.00..1.73 rows=73 width=16)
-> Hash (cost=1.73..1.73 rows=73 width=16)
-> Seq Scan on et et_1 (cost=0.00..1.73 rows=73 width=16)
-> Hash (cost=33.98..33.98 rows=2098 width=16)
-> Seq Scan on doo (cost=0.00..33.98 rows=2098 width=16)
Database Systems – SQL
OPTIMIZATION / Postgres
Why so many sequential scans?
Database Systems – SQL
OPTIMIZATION / Postgres
Why so many sequential scans?
In reality a few thousand records isn’t many. The query optimizer decided the
overhead for indexes or hash joins wasn’t worth it.
Database Systems – SQL
SQL OPTIMIZATION
SPEEDING UP SELECTS
When the data stored in a database changes, the statistics used to optimize queries
are not updated automatically. Therefore, use the ANALYZE command on each table to
speed up results.
WHERE CLAUSE OPTIMIZATION
First note that any optimizations for the WHERE clause of a SELECT query also work
for the WHERE clauses of DELETE and UPDATE queries.
Database Systems – SQL
SQL OPTIMIZATION
Examples of very fast queries
Some examples of queries that are very fast:
SELECT COUNT(*) FROM tbl_name;
SELECT MIN(key_part1),MAX(key_part1) FROM tbl_name;
SELECT MAX(key_part2) FROM tbl_name WHERE key_part1=constant;
SELECT ... FROM tbl_name ORDER BY key_part1,key_part2,... LIMIT 10;
SELECT ... FROM tbl_name ORDER BY key_part1 DESC, key_part2 DESC, ... LIMIT 10;
These were actually statements about mySQL, but this should be the same in most
modern relational databases.
Database Systems – SQL
SQL OPTIMIZATION
INSERT
The time required for inserting a row is determined by the following factors, where the
numbers indicate approximate proportions:
Connecting: (3)
Sending query to server: (2)
Parsing query: (2)
Inserting row: (1 × size of row)
Inserting indexes: (1 × number of indexes)
Closing: (1)
Database Systems – SQL
SQL OPTIMIZATION
INSERT
If you are inserting many rows from the same client at the same time, use INSERT
statements with multiple VALUES lists to insert several rows at a time. This is
considerably faster (many times faster in some cases) than using separate single-row
INSERT statements.
When loading a table from a text file, use COPY FROM. This is usually significantly
faster than using INSERT statements.
See .http://www.postgresql.org/docs/8.1/static/sql-copy.html
While outside of the scope of what I wish to cover, if you are loading a large amount of
data in proportion to the current size of the table, it may be more efficient to drop the
indexes, load the file, and then reinstate the indexes.
Database Systems – SQL
SQL OPTIMIZATION
INSERT
To speed up INSERT operations that are performed with multiple statements for nontransactional tables, lock your tables:
BEGIN WORK;
LOCK TABLES IN ACCESS EXCLUSIVE;
INSERT INTO a VALUES (1,23),(2,34),(4,33);
INSERT INTO a VALUES (8,26),(6,29);
COMMIT WORK;
http://www.postgresql.org/docs/8.1/static/sql-lock.html
There is no LOCK TABLE in the SQL standard, which instead uses SET
TRANSACTION to specify concurrency levels on transactions. PostgreSQL supports that
too; see SET TRANSACTION for details.
Copying a file to a table is still faster than the method demonstrated above.
Database Systems – SQL
SQL OPTIMIZATION
GENERAL TIPS
Use persistent connections to the database to avoid connection overhead.
Run ANALYZE AND VACUUM update the statistics and reclaim deleted space after a lot
of rows are removed from a table.
One recommendation I do not agree with:
Try to keep column names simple. For example, in a table named customer, use a
column name of name instead of customer_name. I DISAGREE!
To make your names portable to other SQL servers, you should keep them shorter
than 18 characters.
Database Systems – SQL
SQL TRANSACTIONS
It is often important to ensure a series of statements occur as an atomic unit or do not
occur at all.
For example, if you wanted to transfer money from one account to another, you would
not want the removal of the funds from one account to occur without the depositing of
those funds in the second account. If something happened to prevent the depositing of
the funds, then you would want the withdrawal cancelled.
This is accomplished through the use of transactions.
In PostgreSQL, a transaction is set up by surrounding the SQL commands of the
transaction with BEGIN and COMMIT commands. So our banking transaction would
actually look like:
The syntax is simple:
BEGIN;
Any SQL commands you wish to execute atomically
COMMIT;
http://www.postgresql.org/docs/8.3/static/tutorial-transactions.html
Database Systems – SQL
SQL TRANSACTIONS
Note that some statements can not be rolled back. These are typically ones that alter
the definition of the database/table structure.
If statements like these are included within a transaction, then if another statement
fails within the transaction, then a full rollback to the beginning of the transaction can
not occur.
Transactions can be broken up so that you can rollback to a specific point within the
transaction using the SAVEPOINT command.
SAVEPOINT identifier
ROLLBACK TO SAVEPOINT identifier
RELEASE SAVEPOINT identifier
http://www.postgresql.org/docs/8.1/static/sql-savepoint.html
http://developer.postgresql.org/pgdocs/postgres/sql-rollback-to.html
http://www.postgresql.org/docs/8.1/static/sql-release-savepoint.html
Database Systems – SQL
VIEWS
A view in SQL is a great way to present information to a user in another way than the
logical table structure.
You might do this to limit the access of certain fields, i.e. Social Security Number or
Date of birth.
You might wish to derive a field. i.e age from date of birth.
You might wish to denormalize a series of tables and remove the ID fields so they do
not confuse someone generating reports from the data.
Database Systems – SQL
VIEWS
The actual syntax has more options than this, but a simple form for creating a view is
as follows:
CREATE VIEW ViewName AS SelectStatement;
So if you had the following table:
You could great a view that would
show the following:
LastName FirstName
DOB
LastName
FirstName
Age
Jeter
Derek
5/6/1974
Jeter
Derek
33
Kurt
Schilling
4/4/1961
Kurt
Schilling
46
Babe
Ruth
1/1/1900
Babe
Ruth
Really Old
BaseballPlayers table
BaseballPlayers view
The BaseballPlayers2 view can be created with the following statement:
CREATE VIEW BaseballPlayers2 AS SELECT LastName, FirstName, Year(DOB) AS Age
FROM BaseballPlayers;
http://www.postgresql.org/docs/8.1/static/sql-createview.html
Database Systems – SQL
VIEWS
A View name can not be the same name as a table that already exists.
Views must have unique column names.
Columns selected can be column names or expressions. If you create an expression,
name it!
A view can be created from many kinds of SELECT statements. It can refer to base
tables or other views. It can use joins, UNION, and subqueries.
Currently, views are read only: the system will not allow an insert, update, or delete
on a view. You can get the effect of an updatable view by creating rules that rewrite
inserts, etc. on the view into appropriate actions on other tables. For more information
see CREATE RULE.
http://www.postgresql.org/docs/8.1/static/sql-createview.html
Database Systems – SQL
VIEWS
Use the DROP VIEW statement to drop views.
Be careful that the names and types of the view's columns will be assigned the way
you want. For example,
CREATE VIEW vista AS SELECT 'Hello World';
is bad form in two ways: the column name defaults to ?column?, and the column data
type defaults to unknown. If you want a string literal in a view's result, use something
like
CREATE VIEW vista AS SELECT text 'Hello World' AS hello;
Database Systems – SQL
DATE/TIME
WORKING with DATETIME, DATE, and INTERVAL VALUES
Types
DATETIME or TIMESTAMP
Structured "real" date and time values, containing year, month, day, hour, minute,
second and millisecond for all useful date & time values (4713 BC to over 100,000
AD).
DATE
Simplified integer-based representation of a date defining only year, month, and day.
INTERVAL
Structured value showing a period of time, including any/all of years, months, weeks,
days, hours, minutes, seconds, and milliseconds. "1 day", "42 minutes 10 seconds",
and "2 years" are all INTERVAL values.
Database Systems – SQL
DATE/TIME
WORKING with DATETIME, DATE, and INTERVAL VALUES
Which do I want to use: DATE or TIMESTAMP? I don't need minutes or hours
in my value
That depends. DATE is easier to work with for arithmetic (e.g. something reoccurring
at a random interval of days), takes less storage space, and doesn't trail "00:00:00"
strings you don't need when printed. However, TIMESTAMP is far better for real
calendar calculations (e.g. something that happens on the 15th of each month or the
2nd Thursday of leap years). More below.
1. The difference between two TIMESTAMPs is always an INTERVAL
TIMESTAMP '1999-12-30' - TIMESTAMP '1999-12-11' = INTERVAL '19 days'
2. You may add or subtract an INTERVAL to a TIMESTAMP to produce another
TIMESTAMP
TIMESTAMP '1999-12-11' + INTERVAL '19 days' = TIMESTAMP '1999-12-30'
3. You may add or subtract two INTERVALS
INTERVAL '1 month' + INTERVAL '1 month 3 days' = INTERVAL '2 months 3 days'
Database Systems – SQL
DATE/TIME
1. The difference between two DATES is always an INTEGER, representing the
number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
Database Systems – SQL
DATE/TIME
Function
Description
Example
Result
age(timestamp, timestamp)
Subtract arguments, producing
a "symbolic" result that uses years and months
age(timestamp '2001-04-10', timestamp '195743 years 9 mons 27 days
06-13')
age(timestamp)
Subtract from current_date
age(timestamp '1957-06-13')
current_date
Today's date; see Section 9.9.4
current_time
Time of day; see Section 9.9.4
current_timestamp
Date and time; see Section 9.9.4
date_part(text, timestamp)
Get subfield (equivalent to extract); see Section date_part('hour', timestamp '2001-02-16
9.9.1
20:38:40')
date_part(text, interval)
Get subfield (equivalent to extract); see Section
date_part('month', interval '2 years 3 months') 3
9.9.1
date_trunc(text, timestamp)
Truncate to specified precision; see also Section date_trunc('hour', timestamp '2001-02-16
9.9.2
20:38:40')
extract(field fromtimestamp)
Get subfield; see Section 9.9.1
extract(hour from timestamp '2001-02-16
20:38:40')
extract(field from interval)
Get subfield; see Section 9.9.1
extract(month from interval '2 years 3 months') 3
isfinite(timestamp)
Test for finite time stamp (not equal to infinity)
isfinite(timestamp '2001-02-16 21:28:30')
true
isfinite(interval)
Test for finite interval
isfinite(interval '4 hours')
true
justify_hours(interval)
Adjust interval so 24-hour time periods are
represented as days
justify_hours(interval '24 hours')
1 day
justify_days(interval)
Adjust interval so 30-day time periods are
represented as months
justify_days(interval '30 days')
1 month
localtime
Time of day; see Section 9.9.4
localtimestamp
Date and time; see Section 9.9.4
now()
Current date and time (equivalent
to current_timestamp); see Section 9.9.4
timeofday()
Current date and time; see Section 9.9.4
43 years 8 mons 3 days
20
2001-02-16 20:00:00
20