Download chapter 1 - Sacramento - California State University

Document related concepts

IMDb wikipedia , lookup

Relational algebra wikipedia , lookup

Serializability wikipedia , lookup

Microsoft Access wikipedia , lookup

Oracle Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Concurrency control wikipedia , lookup

SQL wikipedia , lookup

PL/SQL wikipedia , lookup

Database wikipedia , lookup

Ingres (database) wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

ContactPoint wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
IMPROVING QUERY OPTIMIZATION OF MYSQL
Samit Wangnoo
B.S., Jammu University, India, 2006
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL
2009
IMPROVING QUERY OPTIMIZATION OF MYSQL
A Project
by
Samit Wangnoo
Approved by:
__________________________________, Committee Chair
Dr. William J Mitchell
___________________________________, Second Reader
Dr. Scott Gordon
______________________________
Date
ii
Student: Samit Wangnoo
I certify that this student has met the requirements for format contained in the University
format manual, and that this Project is suitable for shelving in the Library and credit is to
be awarded for the Project.
______________________________, Graduate Coordinator
Dr. Cui Zhang
Department of Computer Science
iii
_________________
Date
Abstract
of
IMPROVING QUERY OPTIMIZATION OF MYSQL
by
Samit Wangnoo
The intent of the project is to study the MySQL design limitations and tradeoffs;
and produce a solution of how to tackle these limitations and tradeoffs.
Thereby
improving the efficiency of the MySQL queries and making the system fast.
The goal of the project is to show how we can reduce the query timing by using
benchmarking while we are creating a database and also by following specific
optimization rules to get more efficient output in less time. Trying to work on these
limitations and finding an alternate will help improve the efficiency of MySQL as a
whole. The project is based on benchmarking, which is intended to measure the runtime
performance of scalar expressions; it has some significant implications for the way that it
is used and also the way the results are interpreted. Benchmarking is done on the
application and database to find out where the bottlenecks are. After fixing one
bottleneck (or by replacing it with a “dummy” module), we can proceed to identifying the
iv
next bottleneck. Even if the overall performance of the application is currently
acceptable, we need to at least make a plan for each bottleneck and decide how to solve
it.
______________________________, Committee Chair
Dr.William J Mitchell
_______________________________
Date
v
TABLE OF CONTENTS
Page
List of Figures ………………………………………………………………………viii
Chapter
1. INTRODUCTION …………………………………………………………………1
2. DATABASE SYSTEM ARCHITECTURE ……………………………………….4
2.1 MySQL Architecture ……………………………………………………..5
2.2 Features of MySQL ………………………………………………………20
2.3 Limitations and Tradeoffs of MySQL ……………………………………26
3. SCHEMA OPTIMIZATION ………………………………………………………27
3.1 Hardware ………………………………………………………………...27
3.2 Software …………………………………………………………………..29
3.3 Designing Database to Improve Speed …………………………………...30
3.4 Advantages of Using Clustering ………………………………………….38
4. OPTIMIZATION …………………………………………………………………..40
4.1 Bottlenecks ………………………………………………………………..40
4.2 Measuring Performance …………………………………………………..42
vi
4.3 Optimizing Hardware for MySQL ………………………………………43
4.4 Optimizing Disks ………………………………………………………..43
4.5 Optimizing OS …………………………………………………………..45
4.6 Choosing API ……………………………………………………………46
4.7 Compiling and Installing MySQL ………………………………………47
4.8 Execution Path of a Query ………………………………………………48
4.9 Query Analyzer ………………………………………………………….49
4.10 Optimizing Environmental Parameters ………………………………...50
4.11 Optimizing Tables Structure …………………………………………...64
4.12 Optimizing How to Load Tables ………………………………………75
4.13 Optimizing Queries ……………………………………………………78
4.14 Example ………………..……………………………………………...92
5. CONCLUSION ………………………………………………………………….109
Bibliography ………………………………………………………………………..112
vii
LIST OF FIGURES
Figure
Page
1. Primary Components of MySQL Architecture …………………………………...6
2. Architecture of MyODBC ………………………………………………………..14
3. Execution Path of a Query ………………………………………………………..48
4. Query Analyzer SQL Error and Warning Counts ………………………………...50
5. Snapshot of the Resultant Table Before Using Hints ……………………………..99
6. Snapshot of the Resultant Table After Using Hints ………………………………105
viii
1
Chapter 1
INTRODUCTION
MySQL is the most popular open source relational database management system
based on SQL (Structured Query Language). Relational database management system is a
program that lets you create, update, and administer a relational database. It stores data in
separate tables rather than putting all data in a common repository. This helps in
improving speed and flexibility to the database. MySQL is developed by MySQL AB.
MySQL AB is a commercial company that provide services to the MySQL database [1].
MySQL is proven and cost-effective database solution that helps in reducing the
cost of the database software infrastructure [1]. MySQL reduces the Total Cost of
Ownership (TCO) of database software by reducing database licensing cost, lowering
hardware expenditure, reducing administration, engineering and support cost. MySQL is
easy to install and deploy, reliable and available and it has embedded library [2].
MySQL has become the most popular open source database because of its
consistent fast performance, high reliability and ease of use. Many of the world’s largest
and fast growing organizations use MySQL such as Yahoo!, Alcatel-Lucent, Google,
Nokia, Youtube and Booking.com. MySQL is a part of the famous open source stack
2
LAMP (Linux, Apache, MySQL, PHP/ Pearl/ Python). MySQL is flexible since it can
work on different platforms like Linux, Windows, Unix, BSD (Mac OS X), Handheld
(Windows CE). MySQL gives both open source freedom and 24 X 7 supports
The new MySQL Enterprise improved the application performance and
efficiencies for critical deployments using the new Query Analyzer feature. The MySQL
Query Analyzer helps in improving the application performance by monitoring query
performance and accurately pinpointing the code that causes poor performance and slow
down the system. This helps the DBAs to work on the queries that make the database
slower. With the help of MySQL Query Analyzer, DBA can tune the code by
continuously monitoring and fine tuning the queries, thus helps in achieving peak
performance on MySQL.
The project is to show how to reduce the query execution time by using
benchmarking and to work on MySQL limitations and finding an alternate which can
help in improving the efficiency of MySQL as a whole. The project is based on
benchmarking. Benchmarking application and database helps in finding out the
bottlenecks in the project. After fixing one bottleneck (or by replacing it with a “dummy”
module), we can proceed to identifying the next bottleneck. Even if the overall
performance of the application is currently acceptable, we need to at least make a plan for
each bottleneck and decide how to solve it. Benchmarking helps in designing and
3
selecting the proper components for a successful MySQL evaluation by working through
the basic functional testing and evaluating the best practices. The benchmarking can help
optimize the design of the database and query processing time.
Hence, the objective of the project is to suggest some rules and hints that will help
in optimizing queries and thereby making the system faster and more efficient. This, in
effect, saves a lot of processing time, resource time, and overall cost of the system; thus,
it optimizes the queries and thereby makes the system faster.
4
Chapter 2
DATABASE SYSTEM ARCHITECTURE
According to Techotopia, “Database Management System (DBMS) is a software
system that facilitates the creation, maintenance and use of an electronic database” [4].
Relational database management systems (RDBMS) implement the relational model of
tables and relationships. Database Management System (DBMS) is classified into two
ways known as Shared-File and Client-Server. A Shared-File DBMS interacts directly
with the underlying database files that are used mostly on desktop computers and for
databases that are designed does not need too much storage space. The Client-Side is
further divided into two components. The server component resides on the physical
computer as database files. This is responsible for all the interaction with the database.
The other component is client side which sends requests to the server who processes the
requests and sends the result back to the client [4].
The typical example of the shared file DBMS is Microsoft Access and for clientserver DBMS is MySQL. Client-Server DBMS is more efficient than shared file. The
advantages of Client Server DBMS are, client does not have to reside on the same
computer as server. It can send request from any computer on the network to the server
5
that can be on a remote host. Since server is on a remote host, this makes server invisible
to all its clients, making the database available to more users than shared- file DBMS.
Separating client from server increases the range of client types. Client can be written in
any programming language such as C, C++, or Java or web based applications can be
developed using PHP and JSP.
MySQL has a unique Pluggable storage engine architecture that gives the user
more flexibility to choose from a variety of purpose-specific storage engines. Some of the
available engines include InnoDB, MyISAM and NDB. The most popular engine is the
InnoDB, which is a general purpose engine. Oracle has acquired the InnoDB engine and
has continued to develop and maintain it. The other popular database engine is MyISAM,
this engine is particularly famous in web based environment. This engine was present
from the beginning. This engine provides high performance but this engine does not
support transactions so it is not an ideal engine to choose in DBMS environment. The
other famous database engine is NDB. It is used for MySQL cluster only. It is famous
particular in financial applications [4].
2.1 MySQL Architecture
Understanding the MySQL database architecture is important in order to efficiently
manage the MySQL database. The important components of the MySQL architecture are
explained in the Figure 1 which is taken from MySQL and Sun Website [3].
6
Connectors
Native C API, JDBC, ODBC, .NET, PHP, Perl, Python, Ruby, Cobol
MySQL Server
Connection Pool
Authentication, Thread Reuse, Connection Limits, Check Memory, Caches
Management
Services and
Utilities
SQL Interface
Backup and
Recovery, Security,
Replication, Cluster,
Administration,
Configuration,
Migration and
Metadata
DML, DDL, Stored
Procedure, Views,
Triggers, etc.
Parser
Optimizer
Query Translation,
Object Privilege
Access Paths ,
Statistics
Caches &
Buffers
Global and Engine
Specific Caches and
Buffers
Pluggable Storage Engine
MyISAM
InnoDB
NDB
Archive
Federated
Memory
Merge
Figure 1: Primary Components of MySQL Architecture [3]
7
2.1.1 Primary Components of MySQL Architecture:
MySQL provides lot of different storage engines depending upon the requirements.
MySQL has a flexible features, it can have different mixture of storage engines and tables
at the same time in a single database.
Database Storage Engines and Table Types are responsible for accumulating and
retrieving information, the database storage engine lies at the heart of your MySQL
installation. When it comes to picking a specialized storage engine or table type, MySQL
offers database designers and administrators a lot of choices. [3] MySQL supports both
transaction-safe and non transaction-safe tables.
Advantages of Transaction-Safe Tables:
i.
It allows execution of multiple SQL statements in single operation.
ii.
If a transaction fails, it can be restored to the previous safe state by using recovery
logs and backups.
Advantages of Non-Transaction-Safe Tables:
i.
It is much faster and it uses disk space as there is no overhead transaction.
ii.
It uses less memory to perform update.
2.1.2 Famous Pluggable Storage Engines in MySQL:
8
(a) InnoDB: InnoDB is MySQL’s high performance, transaction-safe engine. It
provides advanced multi user concurrency capabilities by allowing you to lock
individual rows rather than entire tables. This engine is recommended where table
needs support for the transactions. It is used when we are storing very large
amount of data in a tables. The only overhead of this engine is it consumes more
disk space than any other engine [6].
Properties of InnoDB:
i.
InnoBB table supports transaction, ACID, foreign keys referentialintegrity constraint, data checksums.
ii.
Supports different isolation modes.
iii.
It does not support key compression
iv.
Both data and indexes are cached in memory.
v.
InnoDB tables are clustered using primary keys.
(b) MyISAM: The MyISAM table type is mature, stable, and simple to manage.
MyISAM is a direct descendent of ISAM database engine. MyISAM was the
original product from MySQL. This is the default engine for many platforms. The
default MySQL engine is fast, compressible, and FULLTEXT-searchable. Some
9
of its advantages are like high speed operations that do not require integrity of the
transactions. MyISAM boosts system response by letting administrators place
their data and index files in separate directories, which can be stored on different
disk drives [6].
Properties of MyISAM:
i.
No transactions get corrupted even there is a power failure.
ii.
It has small disk and memory footprint.
iii.
Only indexes are cached in MySQL so it saves lot of cache and buffer
space.
iv.
It allows table locks and also allows concurrent inserts [14].
(c) Memory / Heap: Memory previously known as Heap. These tables are memorybased, extremely fast and easy to configure, letting developers leverage the
benefits of in-memory processing via a standard SQL interface. HEAP tables use
hashed indexes and are stored in memory. This makes them very fast, but because
they are stored in memory, if your MySQL server crashes, you will lose all data
stored in them. This makes them unsuitable for everyday storage. However,
HEAP tables are useful for temporary tables, such as those that contain real-time
10
statistics that are calculated anew each time the web page that displays them is
loaded.
Properties of Heap:
Merge: MERGE tables are two or more identical MyISAM tables joined by a
UNION statement. You can only select, delete, and update from the collection
of tables. If you drop the MERGE table, you are dropping only the MERGE
specification. That means the MERGE table no longer exists, but the
MyISAM tables it was constructed from and the data in them are still intact.
The most common reason to use MERGE tables is to get more speed. You can
split a big, read-only table into several parts, and then put the different table
parts on different disks. This results in faster access times, and therefore more
efficient searches. Also, if you know exactly what you are looking for within
the split parts, you can search in just one of the split tables for some queries,
or if you need to search the entire table, use a MERGE command to access the
parts as a whole. It is faster to repair the individual files that are mapped to a
MERGE file than to try to repair one huge file [5].
11
(d) Federated: MySQL version 5.0.3 introduces a new storage engine designed for
distributed computing. Specifying the Federated option when creating your table
tells MySQL that the table is actually resident in another database server.
MySQL uses an URL-like formatted string to tell MySQL the location of the
remote table as part of the COMMENT portion of your Create Table statement.
MySQL reads the SQL statements, performs some internal transformations, and
then accesses the Federated table via its client API. The results are then presented
to you as if the query was executed locally.
(e) NDB: MySQL's new clustering capabilities rely on the NDB storage engine. By
spreading the data and processing load over multiple computers, clustering can
greatly improve performance and reliability.
(f) Archive: This engine is used chiefly for creating large, compressed tables that
will not need to be searched via indexes. In addition, we cannot use any data
alteration statements, such as UPDATE or DELETE, although you are free to use
SELECT and INSERT.
(g) CSV: By creating comma-separated files (.csv), this storage engine makes it very
easy to feed other applications that consume these kinds of files with MySQLbased data.
12
2.1.3 Connectors/API: Connectors provide database application developers and thirdparty tools with packaged libraries of standards-based functions to access MySQL. These
libraries range from Open Database Connectivity (ODBC) technology through Java and
.NET-aware components. By using the ODBC connector to MySQL, any ODBC-aware
client application (for example, Microsoft Office, report writers, Visual Basic) can
connect to MySQL without knowing the vagaries of any MySQL-specific keyword
restrictions, access syntax, and so on; it's the connector's job to abstract this complexity
into an easily used, standardized interface.
MySQL AB and several third parties provide application programming interface
(API) libraries to let developers write client applications in a wide variety of
programming languages, like C (provided automatically with MySQL), C++, Eiffel,
.NET, Perl, PHP, Python, Ruby, Tcl.
2.1.4 Parser: Parser decomposes the SQL commands it receives from calling programs
into a form that can be understood by the MySQL engine. The objects that will be used
are identified, along with the correctness of the syntax. The Syntax Parser also checks the
objects being referenced to ensure that the privilege level of the calling program allows it
to use them. [6]
13
2.1.5 SQL Interface: The SQL interface provides the mechanism to receive commands
and transmit results to the user. The SQL interface is built on ANSI SQL standards.
2.1.6 Cache and Buffers: Cache and Buffers help in ensuring that the data that is
referenced recently will be in the available in the cache for quick response.
2.1.7 Connectors: MySQL connectors are drivers that provide connectivity to the
MySQL server for client program. There are currently five MySQL connectors:
a) Connector/ODBC: ODBC provides driver support for connecting to a MySQL
server using the Open Database connectivity (ODBC) API. Support is available
for ODBC connectivity from Windows, Unix and Mac OS X platforms.
MyODBC Architecture: MyODBC architecture is based on five components
Application, Driver Manager, DSN Configuration, Connector /ODBC and
MySQL Server. MyODBC Architecture discussed in Figure 2 is taken from the
MySQL Reference Manual [7].
14
Figure 2: Architecture of MyODBC [7]
 Application: The Application uses the ODBC API to access the data from
the MySQL server. The ODBC API communicates with the Driver
Manager. The Application communicates with the Driver Manager using
the standard ODBC calls. The Application does not care where the data is
15
stored, how it is stored, or even how the system is configured to access the
data. It needs to know only the Data Source Name (DSN).
A number of tasks are common to all applications, no matter how they use
ODBC. These tasks are:

Selecting the MySQL server and connecting to it

Submitting SQL statements for execution

Retrieving results (if any)

Processing errors

Committing or rolling back the transaction enclosing the SQL
statement

Disconnecting from the MySQL server
Because most data access work is done with SQL, the primary tasks for
applications that use ODBC are submitting SQL statements and retrieving
any results generated by those statements.
 Driver Manager: The Driver Manager is a library that manages
communication between application and driver or drivers. It performs the
following tasks:
 Resolves Data Source Names (DSN): The DSN is a configuration string
that identifies a given database driver, database, database host and
optionally authentication information that enable an ODBC application to
16
connect to a database using a standardized reference. Because the database
connectivity information is identified by the DSN, any ODBC compliant
application can connect to the data source using the same DSN reference.
This eliminates the need to separately configure each application that
needs access to a given database; instead you instruct the application to
use a pre-configured DSN.
 Loading and unloading of the driver required to access a specific database
as defined within the DSN. For example, if you have configured a DSN
that connects to a MySQL database then the driver manager will load the
MyODBC driver to enable the ODBC API to communicate with the
MySQL host.
 Processes ODBC function calls or passes them to the driver for
processing.
b) Connector/Net: Net enables developers to create .NET applications that use data
stored in a MySQL database. Connector/NET implements a fully-functional
ADO.NET interface and provides support for use with ADO.NET aware tools.
Applications that need to use Connector/NET can be written in any of the
supported .NET languages [7].
17
Connector/NET enables developers to easily create .NET applications that require
secure, high performance data connectivity with MySQL. It implements the
required ADO.NET interfaces and integrates into ADO.NET aware tools.
Developers can build applications using their choice of .NET languages.
Connector/NET is a fully managed ADO.NET driver written in 100% pure C#.
Connector/NET supports the following:

All the MySQL 5.0 features.

Large-packet support for sending and receiving rows and BLOB’s up to 2
gigabytes in size.

Protocol compression which allows for compressing the data stream
between the client and server.

Support for connecting using TCP/IP sockets, named pipes, or shared
memory on Windows.

Support for connecting using TCP/IP sockets or Unix sockets on Unix.

Support for the Open Source Mono framework developed by Novell.

Fully managed, does not utilize the MySQL client library.
c) Connector /J: J provides driver support for connecting to MySQL from a Java
application using the standard Java Database Connectivity (JDBC) API.
18
MySQL provides connectivity for client applications developed in the Java
programming language via a JDBC driver, which is called MySQL Connector/J.
MySQL Connector/J is a JDBC-3.0 Type 4 driver, which means that is pure Java,
implements version 3.0 of the JDBC specification, and communicates directly
with the MySQL server using the MySQL protocol [7].
d) Connector /MXJ: MXJ is a tool that enables easy deployment and management
of MySQL server and database through your Java application.
MySQL Connector/MXJ is a Java Utility package for deploying and managing a
MySQL database. MySQL Connector/MXJ is a solution for deploying the
MySQL database engine (mysqld) intelligently from within a Java package.
Deploying and using MySQL can be as easy as adding an additional parameter to
the JDBC connection url, which will result in the database being started when the
first connection is made. This makes it easy for Java developers to deploy
applications which require a database by reducing installation barriers for their
end-users [7].
MySQL Connector/MXJ makes the MySQL database appear to be a java-based
component. It does this by determining what platform the system is running on,
selecting the appropriate binary, and launching the executable. It will also
optionally deploy an initial database, with any specified parameters.
19
e) Connector/PHP: PHP is a Windows-only connector for PHP that provides the
mysql and mysqli extensions for use with MySQL 5.0.18 and later. PHP is a
server-side, HTML-embedded scripting language that is used to create dynamic
Web pages. It is available for most operating systems and Web servers, and can
access most common databases, including MySQL. PHP may be run as a separate
program or compiled as a module for use with the Apache Web server [7].
f) Connector/ODBC: The MySQL Connector/ODBC drivers also called MyODBC
drivers that provide access to a MySQL database using the industry standard
Open Database Connectivity (ODBC) API. It was developed according to the
specifications of the SQL Access Group and defines a set of function calls, error
codes, and data types that can be used to develop database-independent
applications. ODBC usually is used when database independence or simultaneous
access to different data sources is required [7].
2.1.8 Query Optimizer: “Query Optimizer is a part of the server that takes a parsed SQL
query and produces a query execution plan.” MySQL query optimizer restructures the
query by first restricting using SELECT to narrow the number of tuples and then it
performs the projections to reduce the number of attributes and finally evaluates any join
condition.
20
Query Optimizer streamlines the syntax for use by the Execution Component,
which then prepares the most efficient plan of query execution. The Query Optimizer
checks to see which index should be used to retrieve the data as quickly and efficiently as
possible. It chooses one from among the several ways it has found to execute the query
and then creates a plan of execution that can be understood by the Execution Component.
2.2 Features of MySQL:
a) Relational Database: A relational database is a database which stores data in
different tables rather than having all data in a single table; this improves the
speed and flexibility of the database. MySQL is one of the famous relational
databases.
b) Client/Server Architecture: MySQL is a client/server architecture system. There
is a database server (MySQL) and arbitrary many clients (application programs),
which communicate with the server, for example query data, save changes, etc.
The clients can run on the same computer or on internet. Most of the big
databases use client server architecture. The databases that use client server
architecture are Oracle, MySQL, and Microsoft SQL Server. There are some
databases that use file-server system; these are Microsoft Access, dBase, FoxPro.
The main problem with filer-server databases are when they work on network
their efficiency decreases as the number of users increase on the network [8].
21
c) SQL Compatibility: MySQL supports SQL and adheres to the current SQL
standards although with some restrictions and some other extensions.
d) Views: A view is a subset of a database. Views are tables designed by users using
queries to display which columns they require, what order they require, how data
is sorted and what type of data are displayed. Unlike ordinary tables (base tables)
in a relational database, a view does not form part of the physical schema: it is a
dynamic, virtual table computed or collated from data in the database. Changing
the data in a table alters the data shown in subsequent invocations of the view.
Beginning with version 5.0.1, MySQL includes support for named views, usually
referred to simply as “views” [8].
e) SubSelect: A subselect specifies a result table derived from the tables, views or
nicknames identified in the FROM clause. The derivation can be described as a
sequence of operations in which the result of each operation is input for the next.
f) Stored Procedure: Beginning with version 5.0, MySQL includes stored
procedure. A stored procedure is a program that runs directly on the database
server. Stored procedures are written with SQL and can be used to improve
performance and help with ease of development. Stored procedure is useful when
we have a script that gets called often or uses any looped queries you may be
generating a lot more network traffic than you should be. Stored procedures will
cut down on long queries being sent over the network by turning a potentially
22
long query into a short alias. Using complex stored procedures could save you a
lot of time when executing a query. By using stored procedures you're adding an
extra data-layer, so to speak. This means you may be able to fix problems in your
application by simply editing a stored procedure, instead of having the change
your application code.
g) Triggers: Support for triggers is included beginning with MySQL 5.0.2. Triggers
are stored routines much like procedures and functions however they are attached
to tables and are fired automatically when a set action is performed on the data in
that table. Triggers are programmable events that react to queries and reside
directly on the database server. Triggers can be executed before or after INSERT,
UPDATE or DELETE statements.
h) Unicode: MySQL has supported all conceivable character sets since version 4.1,
including Latin-1, Latin-2, and Unicode (either in the variant UTF8 or UCS2) [8].
i) Full-Text Search: The MySQL has made it extremely easy to
add FULLTEXT searching to the tables. This is a built in functionality in MySQL
that allows users to search through certain tables for matches to a string. These are
created much like regular KEY and PRIMARY KEY indices. Full-text search
simplifies and accelerates the search for words that are located within a text field.
If we employ MySQL for storing text, we can use full-text search to implement
simply an efficient search function. There are some restrictions of full-text search.
23
First, MySQL automatically orders the results by their relevancy rating. Second,
queries that are longer than 20 characters will not be sorted. Third, and most
importantly, the fields in the MATCH () should be identical to the fields listed in
the table's FULLTEXT definition.
j) Replication: MySQL replication allows you to have an exact copy of a database
from a master server on another server (slave), and all updates to the database on
the master server are immediately replicated to the database on the slave server so
that both databases are in sync [8]. This is not a backup policy because an
accidentally issued DELETE command will also be carried out on the slave; but
replication can help protect against hardware failures though.
k) Transactions: MySQL 4.0 and onwards supports transactions and thus it follows
the ACID test. ACID is Atomicity, Consistency, Isolation, Durability. It helps
building a robust web application and simultaneously reduces the possibility of
data corruption in the database. The term "transaction" refers to a series of SQL
statements which are treated as a single unit by the RDBMS. Typically, a
transaction is used to group together SQL statements which are interdependent on
each other; a failure in even one of them is considered a failure of the group as a
whole. In the context of a database system, a transaction means the execution of
several database operations as a block. Thus, a transaction is said to be successful
only if all the individual statements within it are executed successfully. This holds
24
even if in the middle of a transaction there is a power failure, the computer
crashes, or some other disaster occurs. Transactions also give programmers the
possibility of interrupting a series of already executed commands (a sort of
revocation). In many situations this leads to a considerable simplification of the
programming process.
In spite of popular opinion, MySQL has supported transactions for a long time.
One should note here that MySQL can store tables in a variety of formats. The
default table format is called MyISAM, and this format does not support
transactions. But there are a number of additional formats that do support
transactions.
l) Foreign Key Constraints: Foreign key was introduces in MySQL 4.0 to make
MySQL proper RDBMS. MySQL supports foreign key constraints for InnoDB
tables but it does not support foreign key constraint in MyISAM.
In order to set up a foreign key relationship between two MySQL tables, three
conditions must be met; both tables must be of the InnoDB table type, the fields
used in the foreign key relationship must be indexed and the fields used in the
foreign key relationship must be similar in data type.
m) GIS Functions: Since version 4.1, MySQL has supported the storing and
processing of two dimensional data. Thus MySQL is well suited for GIS
(geographic information systems) applications [8].
25
n) Programming Languages: There are quite a number of APIs (application
programming interfaces) and libraries for the development of MySQL
applications. For client programming you can use, among others, the languages C,
C++, Java, Perl, PHP, Python, and Tcl.
o) ODBC: MySQL supports the ODBC interface Connector/ODBC. This allows
MySQL to be addressed by all the usual programming languages that run under
Microsoft Windows (Delphi, Visual Basic, etc.). The ODBC interface can also be
implemented under Unix, though that is seldom necessary.
Windows programmers who have migrated to Microsoft’s new .NET platform
can, if they wish, use the ODBC provider or the .NET interface Connector/NET.
p) Platform Independence: It is not only client applications that run under a variety
of operating systems; MySQL itself (that is, the server) can be executed under a
number of operating systems.
The most important are Apple Macintosh OS X, Linux, Microsoft Windows, and
the countless Unix variants, such as AIX, BSDI, FreeBSD, HP-UX, OpenBSD,
Net BSD, SGI Iris, and Sun Solaris.
q) Speed: MySQL is considered a very fast database program. This speed has been
backed up by a large number of benchmark tests (though such tests—regardless of
the source—should be considered with a good dose of skepticism) [8].
26
2.3 Limitations and Tradeoffs of MySQL:
1. All columns have default values.
2. If some garbage or out-of-range data is entered in the column, MySQL will set some
appropriate value; it does not report any error. For numerical values, MySQL sets the
column value as 0 or the smallest value or the largest value depending upon the
requirement.
3. All calculated expressions return a value that can be used instead of signaling an error
condition. For example, 1/0 returns NULL [7].
27
Chapter 3
SCHEMA OPTIMIZATION
Optimizing a database is a complex task, one need to take care of design and the
system limitations to get the whole application optimized. To optimize a system we need
to know what kind of processing the system is doing and then what bottlenecks it can
have.
For setting an environment for Optimization we need to review the Hardware,
Software and Connectivity we are using for the database.
3.1 Hardware: Hardware plays an important role in the performance of a database query.
Following properties of the hardware device affects the database:
1. Processors: Processor is very important when it come to the performance of the
query. If the processor is fast the query will run faster, so if the production and
test machines have different processor speeds the result of the queries will be lot
different. The numbers will be different if we have more than one processor in the
system. So if there is a mismatch between the test machine and the production
machine it becomes a problem while we are setting engine parameters.
28
2. Memory: Memory is also one of the key features in the optimization of the query.
If we compare two machines with different memories the one with higher
memory will perform better. Prior to using the machine we need to check that
both the machines have same memory so that there is no delay between the
production machine and the client machine.
3. Mass Storage: Disk drive capacity and speed dramatically improve each year.
This gives us more choices to pick a appropriate storage space required for the
project. But this can introduce anomalies when it comes to performance testing.
For example, if we want to examine why bulk data operations take so long on
your production server? We don’t have enough disk capacity on our test platform
to unload/reload 100-GB production data set, so what we can do is we can
connect a high storage external drive. We can connect this external device to the
test machine using USB 1.1. Unfortunately, USB 1.1 is orders-of-magnitude
slower than either USB 2.0 or internal storage. This means that any disk-based
performance numbers we receive, will not be meaningful, and could lead you to
make incorrect assumptions.
4. Attached Devices: Every external device connected to the system can modify the
performance of the MySQL production server. The same holds true for your test
environment. The test platform can be connected to different external devices can
29
change the performance of the system; the other factors can be firewall or so on.
These extra responsibilities can render any performance information useless.
In ideal environment the test system will be a mirror image of the production
server. Usually we don’t have different test system, so what we can do is run the
tests on the production platform during periods of low activity.
3.2 Software: Configuring and installing the correct software is important for any
system.
1. Database: One of the most effective aspects of MySQL open source architecture
is frequent releases and fixing the bugs delivered in the market. It is important to
use the same version of MySQL on both the production and test environment.
MySQL releases can often change the performance of the applications as new
features are provided and old features are deleted.
2. Application: While using different packaged application, it is important to have
same version of software on both the test and production machine. All the new
patches need to be installed on the machine, this increases the responsiveness.
3. Operating system: MySQL’s cross-platform portability is one of the benefits of
its open source heritage. However, this needs to be taken care properly otherwise
it can create problems in determining the issues. For example, if the production
30
server is based on Windows Server 2003, whereas testing server is running Linux.
Given these differences, it will be too hard to transfer results from one platform to
another platform. There are significant architectural and configuration differences
between these operating systems that can affect MySQL application performance.
The same holds true even for homogenous operating systems if they differ in
installed patches or kernel modifications. To reduce the potential for confusion, it
is better to have both the platforms as similar as possible.
3.3 Designing Database to Improve Speed:
1. Choosing the Right Storage Engine and Table Type:
MySQL has various types of storage engines and tables that can be configured at
the same time without interfering with the structure of the queries. It lets you play
with the configuration of the database and table in a way that helps you find the
perfect match of storage engine and table according to the data available.
2. Optimizing Table Structure:
Optimizing table structure helps in reclaiming unused space after deletions and
cleaning up the table after structural modifications. The OPTIMIZE SQL
command takes care of this.
31
Syntax: OPTIMIZE TABLE table_name;
3. Specifying Row Format:
When we create or modify a table we have three options to store rows. We can
use fixed format that converts all varchar columns to char, on the other hand
dynamic format converts all char to varchar. Fixed format make the row size
small and dynamic format makes the row more flexible, so both have their
advantages. Dynamic format uses less space but they have the risk of
fragmentation and corruption. The third option is running myisampack command
which instructs MySQL to compress the table into smaller, read only format. This
translates into less work, better response time and less likely to become corrupted.
For less disk space this is the perfect option.
4. Index Key Compression:
Compressing index key makes read operations faster; this can be done by
enabling PACK_KEYS option. Compressing index key helps in saving lot of
disk space. Only problem with this is it puts an overhead on write operations.
Syntax:
CREATE TABLE Dimcustomer (
32
DimCustomerKey smallint(5) unsigned NOT NULL auto_increment,
DimCustomerID int(11) NOT NULL default '0',
PRIMARY KEY (DimCustomerKey),
UNIQUE KEY DimCustomerID (DimCustomerID),
UNIQUE KEY DimCustomer (DimCustomer)
) TYPE=ISAM PACK_KEYS=1;
5. Checksum Integrity Management:
This option helps in locating corrupted tables but it puts a slight overhead. This
can be done by using CHECKSUM TABLE command.
Syntax:
CHECKSUM TABLE Prospect;
6. Column Types and Performance:
Choosing right type of column helps in improving the performance of the table.
The main aim is to choosing right datatype for specific column. Best practice is to
choose datatypes that are smaller in size and simpler to choose.
7. String Consideration:
33
MySQL offers a variety of string and binary based information. Some of the
useful datatypes available are:

CHAR versus VARCHAR—Dynamic row format automatically converts
CHAR columns to VARCHAR. On the other hand, certain fixed row format
tables convert VARCHAR columns to CHAR. VARCHAR uses excess disk
space but since these days disks are inexpensive so VARCHAR is preferred
over CHAR due to its flexibility, improved speed and less fragmentation.

CHAR BINARY versus VARCHAR BINARY— these columns hold binary
information but they follow same criteria as CHAR and VARCHAR.

BLOBs and TEXT— Binary large objects (BLOBs) typically hold images,
sound files, executables, and so on. This column type is further subdivided
into TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB.
8. Numeric Considerations:

Integers— In MySQL integer can be defined in five ways: TINYINT,
SMALLINT, MEDIUMINT, INT, and BIGINT, along with the ability to
declare them SIGNED or UNSIGNED.

Decimals— Choices available for decimal-based columns include
DECIMAL/NUMERIC, FLOAT, and DOUBLE. Selecting FLOAT or
34
DOUBLE without specifying a precision consumes four bytes and eight bytes,
respectively. We can specify a precision (on both sides of the decimal place)
for the DECIMAL type. It helps in determining the storage consumption [6].
For example, if we define a column of type DECIMAL(10,2), it consumes at least
10 bytes, along with anywhere between 0 and 2 additional bytes, depending on if
there is a fraction or sign involved.
9. Using Views to Improve Performance:
MySQL implemented views with version 5.0. Retrieving information using
SELECT * returns extra columns that are not necessary. When big queries like
these run simultaneously, it negatively affects performance by increasing engine
load and it consumes extra bandwidth on the server. In cases like this it is better to
make a view that returns only those columns that are absolutely required for the
task.
View reduces the number of potential returned rows when performing a query.
Using views with restrictive WHERE clause helps in reducing the data set.
10. Normalization:
35
Normalization is the process of organizing data in a database. This includes
creating tables and establishing relationships between those tables according to
rules designed both to protect the data and to make the database more flexible by
eliminating redundancy and inconsistent dependency. Redundant data wastes disk
space and creates maintenance problems. If data that exists in more than one place
must be changed, the data must be changed in exactly the same way in all
locations.
Normalization is a process whereby the tables in a database are optimized
to remove the potential for redundancy. It is not always suggested to normalize
the database; sometimes it makes more sense not to normalize a database.
Typically, it’s unwise to store calculated information (for example,
averages, sums, minimums, maximums) in the tables. Instead, users receive more
accurate information by computing the requested results at runtime, when the
query is executed. However, if we have a very large set of data that doesn’t
change very often, then it is suggested to consider performing the calculations in
bulk during off-hours, and then storing the results [6].
36
We can use MERGE table option to partition and compress data but it
makes more sense if we add a column to the master table, this helps in searching
the data faster.
11. Using Constraints to Improve Performance:
Constraints are rules that a database designer specifies when setting up a table.
MySQL enforces these changes to information stored in the database. These
changes usually occur via INSERT, UPDATE, or DELETE statements, although
they can also be triggered by structural alterations to the tables themselves.
MySQL offers the following constraints:

UNIQUE: It guarantees there will be no duplicate values in the column.

PRIMARY Key: It identifies the primary unique identifier of a row.

FOREIGN Key: It codifies and enforces the relationships among two or
more tables with regard to appropriate behavior when data changes.

DEFAULT: It provides an automatic value for a column if a user omits
entering data.
37

NOT NULL: It forces users to provide information for a column when
inserting or updating data.

ENUM: It allows to set a restricted list of values for a particular column,
although it is not a true constraint.

SET: It allows to store combinations of predefined values within a string
column, although it is not a true constraint.
These constraints help in reducing the chances of data integrity problem. These
constraints run on the database server, this is faster than manually coding and
downloading/installing the same logic on a client. Reuse is the main foundation of
a good software design practices. Using these constraints, reduces the amount of
time that developers need for these types of tasks, as well as helping cut down on
potential error [6].
12. Using Clustering to Improve the Performance:
38
MySQL clustering is a high performance, scalable, clustered database. MySQL
cluster is a real time open source transactional database designed for fast, alwayson access to data under high throughput conditions. MySQL cluster utilizes a
shared-nothing architecture which does not require any additional infrastructure
investment. MySQL provides maximum data availability using distributed node
architecture with no single point of failure [10].
Clustering spreads data over multiple servers resulting in a single
redundant that is data is distributed and it is scalable that means we can add more
machine depending on the requirements. MySQL cluster consists of a set of
computers that run MySQL servers to receive and respond to queries, storage
nodes to store the data held in a cluster and to process the queries and one and
more management nodes to act as a central point to manage the entire cluster [12].
3.4 Advantages of Using Clustering:
a) High Availability: MySQL provides a fault tolerant architecture that ensures
maximum availability. MySQL implements automatic node recovery to ensure the
switch from one node to another if there is any node failure. MySQL cluster takes
cares of hardware failure also. If all the nodes fail due to hardware failure the
39
MySQL cluster ensures that the entire system can be recovered in a consistent
state by using combination of checkpoints and log execution. MySQL cluster
makes sure that systems are replicated across geographically to be available to all
the regions.
b) Scalability: It improves performance of the database by adding more hardware
and there by lowering the load on a single machine.
c) Easy to Use: It is easier to take backup of a large cluster rather than taking
backup of lots of separate database servers each up a partition of the data.
d) High Performance: MySQL cluster provides high performance. MySQL cluster
has a main memory database solution which keeps the data in memory and limits
IO bottlenecks by asynchronously writing transaction logs on disk. MySQL
cluster enables servers to share processing within a cluster taking full advantage
of all hardware.
e) Extremely Fast Auto Recovery: MySQL has extremely fast auto recovery.
MySQL cluster uses synchronous replication to transfer information to all the
database nodes so if a node fails it can automatically go to the adjacent node
quickly. This eliminates the time of creating and processing the log information.
MySQL cluster database nodes can recover and restart dynamically [7].
40
Chapter 4
OPTIMIZATION
The goal of performance tuning is to minimize the response time for each query
and to maximize the throughput of the entire database server by reducing network traffic,
disk I/O, and CPU time. This goal is achieved through understanding application
requirements, the logical and physical structure of the data, and tradeoffs between
conflicting uses of the database. Performance issues should be considered throughout the
development cycle, not at the end when the system is implemented. Many performance
issues that result in significant improvements are achieved by careful design from the
outset. Although other system-level performance issues, such as memory, hardware, and
so on, the performance gain from these areas is often incremental. Things that need to be
considered for optimizing the database are hardware, OS/Libraries, setup and queries,
API, and Applications.
4.1 Bottlenecks
The most common system bottlenecks are:

Disk seeks: Disk seeks is one of the most expensive operations in the database. It
takes time for the disk to find a piece of data.
41
Some of the reasons why disk seek are one of the expensive operations are:
a) Randomly distributed small files all over the disk.
b) Editing different files over the disk at the same time.
c) Due to files open and getting updated over the disk at the same time leads
to more fragmentations.
Though with modern disks, the mean disk seek is reduced to lower than 10ms, but
it accumulates with time and with creating and updating of files.

Disk Read/Write: Disk read/write is defined as the time taken to read or write
data in a particular place on the disk. To read/write a data in the disk, disk has to
physically move to the designated place.The way to optimize seek time is to
distribute the data onto more than one disk [18].
This is easier to optimize than seeks because you can read in parallel from
multiple disks. Caching data help in reducing the disk read/write time by not
reopening the files repeatedly.

CPU Cycles: The processing speed of the database can be increased by having
the data in main memory. Having smaller tables helps in increasing the speed but
having a smaller table is usually not a concern.

Memory Bandwidth: When the CPU needs more data than can fit in the CPU
cache, main memory bandwidth becomes a bottleneck [18].
42
4.2 Measuring Performance
Black box approach can be used to measure the performance of the database. It measures
the transactions per second (TPS). The transaction is a unit of execution that a user
invokes. The example of a transaction can be a read query or grouping the new updates in
a stored procedure [17]. The variables that can be altered to affect the performance
include:
a) Altering Hardware impacts in the speed of the CPU, bus speed, memory access
time, seek time, network and interface speed, all these items affects the
performance of the database.
b) Altering Operating System impacts in the performance of the native API’s,
threading, locking and memory can impact the performance of a common
benchmark.
c) Number of Client concurrently attempting to connect to the database measures
its ability to handle the burden on the database.
d) Data Schema tells the structure of the database used in the test. It compares the
results from single table to multiple tables.
e) Data Configuration plays an important role in the performance of the database.
By modifying the parameters like maximum number of client connections, size of
43
query cache, logs, index memory cache size, network protocols used, helps in
determining the performance of the system [17].
4.3 Optimizing Hardware for MySQL

If dealing with bulky tables, it is suggested to consider 64 bit hardware. Since
MySQL uses a lot of 64 bit integers internally, a 64 bit CPUs will give much
better performance.

For optimizing large databases, the order of optimization should be RAM, Fast
disks, and CPU power.

Increasing RAM helps in speeding up the database since it allows more key pages
in the RAM.

UPS is an option to save the tables in case of power failure.

Database Systems that have dedicated server, 1G Ethernet is suggested. It
improves Latency thus improves performance.
4.4 Optimizing Disks

To improve the efficiency of the disk, it is suggested to have dedicated disks for
system, programs, and temporary files and for logs if there are many transaction
changes.
44

Low seek time reflects disk optimization. In most cases, blocks will be buffered
together and the seek time does not exceed 1-2 seeks. A formula to calculate the
seek time is: log(row_count) / log(index_block_length/3*2/(key_length +
data_ptr_length))+1 seeks to find a row.

Seek time for write will be 4 seek requests and seek time to put the new key and
to update the index and write in the row will be 2 seeks.

For large databases, the seek time of the disk is dependent on the disk speed,
which increases by NlogN as data increases.

Splitting database on different disks increases disk speed and this can be done by
using symbolic links.

Disk stripping helps in improving the performance of the disk by spreading data
across several partitions on various disks. Stripping disks will increase both read
and write throughput.

Disk mirroring is a process in which two duplicate disks are simultaneously
created with the same data. This process is expensive, as it requires two disks to
store the same kind of data. This process provides maximum security as if one
disk fails the system switches to the other disk immediately to another disk
without any data loss.
45

Disk mirroring and disk stripping are not recommended for temporary files or for
data that can be easily re-generated.

On Linux use hdparm -m16 -d1 on the disks on boot to enable reading/writing of
multiple sectors at a time, and DMA. This may increase the response time by 5-50
%.
4.5 Optimizing OS

Configuring the system helps in reducing the memory usage. Adding RAM also
helps in improving memory problems and increasing speed.

NFS disks are not suitable for data usage since NFS has locking problems.

Increase number of open files for system and for the SQL server. (add ulimit -n #
in the safe_mysqld script).

Increase the number of processes and threads for the system.

If you have relatively few big tables, tell your file system to not break up the file
on different cylinders (Solaris).

Use file systems that support big files (Solaris).
46

Choose which file system to use; ReiserFS on Linux is very fast for open, read
and write. File checks take just a couple of seconds.
4.6 Choosing API

PERL
- Portable programs between OS and databases
- Good for quick prototyping
- Use DBI (Database Interface)/DBD (Database Interface Driver) interface

PHP
- Simpler to learn than PERL
- Standard MySQL interface
- Faster
- Uses fewer resources than PERL, which makes it good for embedding in Web
servers.

C /C++
- The native interface to MySQL.
47
- This is the fastest interface and gives more control.
- Very simple to use

ODBC
- Works on Windows and Unix
- It is slow
- Harder to use, and interface is frequently changed so harder to get used to the
system.
- It adds overhead

Python + others
- We don’t use it anymore
4.7 Compiling and Installing MySQL

Choosing compatible compiler helps in improving the performance by 10-30%
[16].

Compiling MySQL with pgcc improves the efficiency since pgcc is a Pentium
optimized version of gcc [16].
48

Using recommended optimizing options for particular platform.

Using native threads improves efficiency [16].
4.8 Execution Path of a Query:
Figure 3: Execution Path of a Query [7]
49
Execution Path of the query in Figure 3 is taken from MySQL Reference Manual [7]. It is
further explained in following steps:
a) User sends request in the form of a SQL statement to the server.
b) Server checks if the data has been s recently accessed or not in the query cache. If
it has, it is called a hit. The server returns the stored result from the cache;
otherwise it passes the SQL statement for processing.
c) The server parses, preprocesses, and optimizes the SQL into a query execution
plan.
d) The query execution engine executes the plan by calling the storage engine API.
e) The storage engine grabs the data from the data repository and sends the data back
to the user who requested the data.
4.9 Query Analyzer:
Query analyzer can visually correlate MySQL, OS graph and query activities which help
in finding the most expensive queries by visually correlating SQL activity with usage
spikes in the key system resources. It allows analyzing execution plan of the query. It has
a capability of highlighting any region of the graph and thus filters the queries that were
50
executing during the date/time selection. It collects and reports all SQL errors and
warning for each query. A snapshot of a query analyzer is shown in Figure 4.
Figure 4: Query Analyzer SQL Error and Warning Counts
4.10 Optimizing Environmental Parameters:
1. Low value for the max_write_lock_count: MySQL executes INSERT &
UPDATE statements with higher priority. Also INSERT & UPDATE statements
require table lock (for MYISAM engines) which requires even table reads
51
(SELECT statements) to be completed before INSERT & UPDATES are
executed. This can cause long delays for SELECT statements waiting behind an
INSERT or UPDATE statement, which may itself take minimal time to execute,
which is waiting for existing long running SQL Select statements to be
completed. So even a single INSERT or UPDATE can slow-down a heavily
loaded database at unpredictable times.
One solution is to use INSERT DELAYED statement to cause INSERT
statements to be run at lower priority in a queue.
By starting mysqld with a low value for the max_write_lock_count system
variable we are forcing MySQL to temporarily elevate the priority of all SELECT
statements that are waiting for a table after a specific number of inserts to the
table occur. This allows READ locks after a certain number of WRITE locks.
2. FLUSH Command: The root-level user or users having reload privileges can use
flush command to clear the internal caches used by MySQL. The syntax for
FLUSH is:
Syntax:
FLUSH hosts| privileges| tables| logs;
Example before using FLUSH:
52
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows in set (07.20sec)
Example after using FLUSH:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (07.80sec)
Flush Privileges is invoked when a new user is added in the system. When a new
user comes, the privileges from the grant table are reloaded in the MySQL
database. Due to this server caches information in the memory. Every time server
instances this kind of information, this will leads to storing unnecessary
information in the memory. To free this cache memory, FLUSH privileges
command is used.
53
When we run any flush command the Query OK response will assure that
the cleaning process occurred without a hitch.
Flush Hosts command flushes host cache tables and internal DNS cache. It is
used because sometimes the number of connections used reaches to its maximum
number for a particular host and that throws an error. It is also suggested to use
when some of the host changes IP number or gets an error. Flush Hosts command
resets and again allows connections to be made. While using Flush Hosts it is
suggested to be careful because it can lead to reverse DNS lookup that can further
limit the server’s speed.
FLUSH Logs command closes and reopens the standard and update log files. It is
used when the existing log files becomes too big. This command creates a new
empty log file. It helps in managing the log files, which further can be studied
easier to look for errors.
FLUSH Tables command closes all the currently open or in-use tables. FLUSH
Tables also removes all query results from the query cache. No error occurs if a
named table does not exist.
3. Altering Index Buffer Size (key_buffer): The key_buffer variable is an
important variable when it comes to improving the performance of the database
54
and optimizing the system. These work great on indexed tables. It is
recommended to assign 25% of the total system memory. We can change the
configuration of the variable to get better results depending upon the
requirements.
Syntax to check the status of the variable:
SHOW VARIABLES like 'key_buffer_size';
Key_buffer_size = 499712;
Example before changing the variable:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (07.90sec)
Syntax to change the status of the variable:
SET GLOBAL key_buffer_size= 1000000;
Example after changing the variable:
55
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (07.70sec)
4. Setting Maximum Number of Open Tables (table_cache): This variable is
similar to max_connetions variables. Max_Connections variable tells how many
connections a MySQL server can have and similarly table_cache tells how many
table a MySQL server can have at a time. This variable can be altered to improve
performance depending upon the server size. If the server size is big we can
afford to increase the value of this variable thereby lets tables cache in the
memory and thus makes the queries faster.
Syntax to check the status of the variable:
SHOW VARIABLES like 'table_cache';
Table_cache = 256;
Syntax to change the status of the variable:
SET GLOBAL table_cache=512;
56
5. Activating the Query Cache (query_cache_type): MySQL query cache is some
what similar to the SQL cache, it stores the SELECT queries issue by user but on
top of that MySQL query cache also stores the query’s result set. This benefits
database engine by not only by avoiding the overhead of hard parsing for identical
queries but also saves the overhead of recreating complex result set from either
disk or memory caches. It decreases both physical and logical I/O/. It improves
the response times for business applications. The query cache variable can be set
to On, OFF, or DEMAND. It is recommended to turn this variable on for large
number of similar queries requested repeatedly on server. The query cache
variable can control amount of memory allocated to the MySQL query cache. It is
suggested to increase the value of the variable for high volume servers.
Syntax to check the status of the variable:
SHOW VARIABLES like 'query_cache_type';
query_cache_type = OFF
Example before changing the variable:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
57
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (07.90sec)
Syntax to change the status of the variable:
SET SESSION query_cache_type=on;
Example after changing the variable:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (07.80sec)
6. Max_join_size: The default value of max_join_size is “18446744073709551615”
if no size specified in the config file. This variable is used to set the maximum
number of records that a SELECT query will scan for a table join. It is very useful
variable because some of the queries are not written properly and that can end up
scans millions of records thus decreasing the server’s capability to process
requests. The default value of this variable is very large, since this variable will
58
impact different users and queries so it is important to set a value that will suit all
kind of queries and users. Thus choosing a smaller value for this variable
improves the performance of the server but it has to be chosen very carefully
depending upon the database and the kind of requests the server receive.
Syntax to check the status of the variable:
SHOW VARIABLES like 'max_join_size';
Syntax to change the status of the variable:
SET SESSION max_join_size=1000000000;
7. Max_connections: This variable controls the number of connections a server can
concurrently have at any instant of time. The default value of this variable is 100.
This need to be changed depending upon the number of connections. It is
suggested to increase the value of this variable to avoid “Too many connections”
error.
Syntax to check the status of the variable:
SHOW VARIABLES like 'max_connections';
Syntax to change the status of the variable:
SET GLOBAL max_connections=100;
59
8. Long_query_time: MySQL creates a log for all the slow queries, it is called slow
query log. MySQL automatically logs all the queries that take exceptionally more
time to complete. This log helps in determining the specific queries that leads to
downgrading the performance of the database by not finishing within the time
limit. This log helps in writing an optimization algorithm depending upon the
slow queries. It is suggested to have a larger value of this variable for systems
experiencing heavy load.
Syntax to check the status of the variable:
SHOW VARIABLES like 'long_query_time';
Syntax to change the status of the variable:
SET SESSION long_query_time=100;
9. Transaction Isolation Level (tx_isolation): This global variable controls, extend
to which concurrent transactions can occur and the degree to which the data being
updated is visible to other transactions. Never change the isolation level in the
middle of the transaction, such change causes DBMS to make implicit commit. A
value can be assigned to the tx_isolation variable by using SET command.
Syntax to check the status of the variable:
60
SHOW VARIABLES like 'tx_isolation';
Syntax to change the status of the variable:
SET GLOBAL |SESSION tx_isolation=serializable| Read Uncommitted| Read
Committed | Repeatable Read;
Global keyword sets the default transaction level globally for all subsequent
sessions, though the existing sessions are unaffected.
Session keyword sets the default transaction level for all subsequent transactions
performed within the current session
Without Global or Session keyboard the statement sets the isolation level to the
next transaction.
MySQL offers all the four isolation levels specified by the SQL Standards:
1. Read Uncommitted: It is also called dirty read. When it is used
MySQL server uses no lock for SELECT query but employs locks
on UPDATE and DELETES statements. It only ensures that the
physically corrupted data will not be read.
2. Read Committed: This uses shared locks while reading data. It
takes care of the limitation of the read uncommitted; it does not
61
read the physically corrupted data. It will not read the data from
other application till it is committed by it. The limitation it has is, it
can not ensure if the data read would not be changed by the end of
the transaction.
3. Repeatable Read: It overcomes the limitations of both read
uncommitted and committed. It puts locks on all the data that is
used in the query. No other transaction can modify this data.
4. Serializable: This is the most restrictive isolation level. This never
leads to phantom values. It restricts other users to make any kind
of update till data is being used in the query.
10. Binary Log (log_bin): Binary log takes care of all the track changes in the
database. It can be used as a backup of all the queries performed on the database.
It helps in retracing the queries performed on the database. It gives a snapshot of
all the queries performed on the database. It helps in restoring a system to a stable
snapshot and thus helps in recovering from any transaction failure/interruption.
Syntax to check the status of the variable:
SHOW VARIABLES like ‘log_bin’;
62
11. Init_connect: This variable can be used to triggers a series of SQL commands or
queries when a non super user connects successfully. When a connection is
established a set of commands are supposed to executed, this is the easiest way to
initialize user specific SQL code when a connection is established.
Syntax to check the status of the init_connect:
SHOW VARIABLES like ‘interactive_timeout’;
Syntax to change the value of the variable is:
SET GLOBAL init_connect='SET autocommit=0';
12. Interactive_timeout: The default timeout value for a MySQL server connection
is 28,800 seconds. It results in a connection break if the user is idle for a long
period of time. The value of the variable can be changed using SET command.
Syntax to check the value of the timeout is:
SHOW VARIABLES like ‘interactive_timeout’;
Syntax to change the value of the variable is:
SET interactive_timeout = 10000;
63
There is no option at the client side for the timeout option; the only way to set a
timeout is at the server side using the interactive_timeout variable. The reasonable
timeout can be anywhere around 5 minutes, but it is not recommended to have a
larger timeout especially if the number of user trying to connect to the server is
lot.
13. MySQL cache helps in improving the performance optimization tasks. MySQL
cache is little different from Oracle cache. In Oracle cache the whole execution
plans are cached, but in MySQL it stores the result set. The operating modes of
the query cache can be edited and this gives a lot of control to the administrator.
The three values that can be assigned to the mode variable, the values are 0, 1 and
2. By default Query caching is turned off.
Query Caching is turned off for mode 0, Query Caching is enabled for mode 1,
and for mode 2 queries cache is enabled but it works only on demand (select
query_cache ….).
Syntax:
SET GLOBAL query_cache_type=0;
SET GLOBAL query_cache_type=1;
SET GLOBAL query_cache_type=2;
64
4.11 Optimizing Tables Structure:
1. Using NOT NULL as default value speeds up the execution of a query.
Syntax:
ALTER TABLE dimcustomer CHANGE CustomerKey CustomerKey
varchar(100) NOT NULL;
Example before changing the datatype:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (04.30sec)
Example after changing the datatype:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (04.10sec)
65
2. Using appropriate attribute types and length saves lot of space and query access
time. Using appropriate datatypes helps in reducing the size of the column. It is
suggested to use SMALLINT or MEDIUMINT instead if INT. When using fixed
length attributes, such as CHAR it is suggested to use short length.
Example before changing the datatype:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (04.30sec)
Syntax:
ALTER TABLE dimcustomer CHANGE CustomerKey CustomerKey smallint;
After making some more updates like deleting and inserting some rows.
Example after changing the datatype:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
66
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (04.20sec)
3. It is suggested to use fixed length attributes rather than dynamic attributes like
VARCHAR or BLOB.
4. Indexes need to be designed according to the need. Primary key index should be
as small as possible, prefixes should be used where ever possible and indexes
should be used only where required some time it acts as an overhead. It is
suggested that the leftmost attribute has the highest number of the duplicate
entries since it is frequently used.
5. Indexing keys are most commonly used in the WHERE clause. Index keys are
used frequently to join tables in MySQL statements. There is a restriction of how
many indexes can be used in a table and how long the index can be. Most of the
storage engines support at least 16 indexes per table and a total index length of at
least 256 bytes.
Example before adding the indexes:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
67
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (04.20sec)
Syntax Add Index:
CREATE INDEX index1 on DimCustomer (Customerkey);
CREATE INDEX index2 on DimCustomer (Geographykey);
CREATE INDEX index3 on DimCustomer (CustomerAlternatekey);
CREATE INDEX index4 on DimCustomer (EmailAddress);
CREATE INDEX index1 on Prospect (ProspectiveBuyerKey);
CREATE INDEX index2 on Prospect (ProspectAlternateKey);
CREATE INDEX index3 on Prospect (ProspectiveBuyerKey);
CREATE INDEX index4 on Prospect (EmailAddress);
CREATE INDEX index5 on Prospect (AddressLine1);
CREATE INDEX index6 on Prospect (Prospect_id);
Example after adding the indexes:
68
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (04.40sec)
Indexing improves query execution time but access of indexing ads overhead on
the table.
6. It is suggested to index keys that have high selectivity. The selectivity of an index
is the percentage of rows having the same value for the indexed key. The indexed
key is optimal if few rows have the same value.
7. Indexing helps in improving the performance of the query but it is not suggestible
to index all the columns. It is not suitable if we use indexing a column that gets
updated frequently.
8. Composite indexing contains more than one column. It provides more advantages
over single-column indexing. Composite indexing improves selectivity by
combining different columns together. It helps in decreasing system I/O. If all the
selected columns are attached together through composite index, it returns the
value by only accessing the indexes, and thus reducing the overhead of accessing
the table.
69
Syntax:
CREATE INDEX index1 on DimCustomer (Customerkey, Geographykey,
CustomerAlternatekey, EmailAddress, AddressLine1,Phone);
CREATE INDEX index1 on Prospect (ProspectiveBuyerKey,
ProspectAlternateKey,EmailAddress, AddressLine1);
Example before composite indexes:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
2059 rows affected (04.50sec)
Example after adding composite indexes:
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
70
2059 rows affected (04.40sec)
9. To generate unique values AUTO_INCREMENT is preferable.
Syntax:
ALTER TABLE DimCustomer ADD customer_id MEDIUMINT NOT NULL
AUTO_INCREMENT KEY
ALTER TABLE Prospect ADD Prospect_id MEDIUMINT NOT NULL
AUTO_INCREMENT KEY
10. Use small primary keys. Use numbers instead of strings when joining tables this
improves the speed of the query execution.
Comparing String primary key with number:
Select * from DimCustomer; // When primary key is of type Mediumint
18484 rows affected (211.70sec)
Select * from DimCustomer; // When primary key is of type String(Varchar)
18484 rows affected (356.60sec)
11. Horizontal Partitioning: Horizontal partitioning breaks a table into two or more
tables. This breaks rows into two or more tables. This allows different row-based
datasets that can be addressed individually or collectively depending upon the
71
situation. This provides the possibility for indexing. It lowers the number of rows
to be accessed. It will lower number of reads from the disk/memory.
Syntax:
Before Partition:
CREATE TABLE `test`.`dimcustomer` (
`customerkey` smallint(6) default NULL,
`geographykey` smallint(6) default NULL,
`customeralternatekey` varchar(100) NOT NULL,
`Title` varchar(100) default NULL,
`FirstName` varchar(100) default NULL,
`middlename` char(10) default NULL,
`LastName` varchar(100) default NULL,
`NameStyle` varchar(100) default NULL,
`BirthDate` varchar(100) default NULL,
`maritalstatus` char(10) default NULL,
72
`gender` char(10) default NULL,
`EmailAddress` varchar(100) default NULL,
`YearlyIncome` varchar(100) default NULL,
`totalchildren` tinyint(6) default NULL,
`numberchildrenathome` tinyint(4) default NULL,
`EnglishEducation` varchar(100) default NULL,
`SpanishEducation` varchar(100) default NULL,
`FrenchEducation` varchar(100) default NULL,
`EnglishOccupation` varchar(100) default NULL,
`SpanishOccupation` varchar(100) default NULL,
`FrenchOccupation` varchar(100) default NULL,
`numbercarsowned` tinyint(6) default NULL,
`AddressLine1` varchar(100) default NULL,
`Phone` varchar(100) default NULL,
`DateFirstPurchase` varchar(100) default NULL,
73
`CommuteDistance` varchar(100) default NULL,
`customer_id` mediumint(9) NOT NULL,
PRIMARY KEY (`customer_id`),
);
Example:
SELECT firstname, lastname, emailaddress, education, occupation from prospect;
2059 rows affected (04.10sec)
After Partition:
CREATE TABLE prospect1 (
ProspectiveBuyerKey varchar(100) default NULL,
FirstName varchar(50) default NULL,
MiddleName varchar(50) default NULL,
LastName varchar(50) default NULL,
Gender varchar(10) default NULL,
EmailAddress varchar(50) default NULL,
74
prospect_id mediumint(9) NOT NULL auto_increment,
);
CREATE TABLE prospect2 (
ProspectAlternateKey varchar(15) default NULL,
YearlyIncome varchar(100) default NULL,
TotalChildren varchar(100) default NULL,
NumberChildrenAtHome varchar(100) default NULL,
Education varchar(40) default NULL,
Occupation varchar(100) default NULL,
);
CREATE TABLE prospect3 (
NumberCarsOwned varchar(100) default NULL,
AddressLine1 varchar(120) default NULL,
City varchar(30) default NULL,
StateProvinceCode varchar(3) default NULL,
75
PostalCode varchar(15) default NULL,
Phone varchar(20) default NULL,
Salutation varchar(8) default NULL,
Unknown varchar(100) default NULL
);
Example after Partition:
SELECT firstname, lastname, emailaddress FROM prospect1;
2059 rows affected (02.20 sec)
4.12 Optimizing How to Load tables:
1. Using INSERT DELAYED increases the speed by using single disk write. It can
be used when the user is not concerned when the data is written. INSERT
DELAYED will wait until other clients have closed the table before executing the
statement. INSERT DELAYED only works on MyISAM, MEMORY, and
ARCHIVE tables.
76
Syntax:
INSERT DELAYED INTO DimCustomer VALUES ('va1l','va12','va13',....);
2. INSERT LOW_PRIORITY when SELECT statement has higher priority. This
increases the speed of the select query by making insert to happen later when the
select query leaves the resources.
Syntax:
INSERT low_priority INTO counties VALUES('val1','val2','val3','val4',.....);
3. There are three ways of inserting data in the table, single insert, multiple insert
and load data or mysqlimport. Single insert is the slowest, multiple row insert is
the faster and load data is the fastest among them.

Single Insert: It insert one statement at a time.
Syntax:
Insert into DimCustomer Values
('11000','26','AW00011000','','Jon','V','Yang','0','24205','M','M','jon24@ad
venture-works.com','90000','2','0','Bachelors','Licenciatura','Bac +
4','Professional','Profesional','Cadre','1','0','3761 N. 14th St','','1 (11) 500
555-0162','37094','1-2 Miles');
77
Insert into DimCustomer Values
('11001','37','AW00011001','','Eugene','L','Huang','0','23876','S','M','eugene
[email protected]','60000','3','3','Bachelors','Licenciatura','Bac +
4','Professional','Profesional','Cadre','0','1','2243 W St.','','1 (11) 500 5550110','37090','0-1 Miles');
………………………….
18484 rows affected (552.60 sec)

Multiple-row insert technique can improve the performance of the
database possessing, as MySQL will process multiple insertions faster
than it will multiple INSERT statements. This improves the performance
of the database processing, as MySQL considers all these insertions as a
single long insert which is much faster than the multiple inserts.
Syntax:
Insert into DimCustomer Values
('11000','26','AW00011000','','Jon','V','Yang','0','24205','M','M','jon24@ad
venture-works.com','90000','2','0','Bachelors','Licenciatura','Bac +
4','Professional','Profesional','Cadre','1','0','3761 N. 14th St','','1 (11) 500
555-0162','37094','1-2 Miles'),
78
('11001','37','AW00011001','','Eugene','L','Huang','0','23876','S','M','eugene
[email protected]','60000','3','3','Bachelors','Licenciatura','Bac +
4','Professional','Profesional','Cadre','0','1','2243 W St.','','1 (11) 500 5550110','37090','0-1 Miles'),
('11002','31','AW00011002','','Ruben','','Torres','0','23966','M','M','ruben35
@adventure-works.com','60000','3','3','Bachelors','Licenciatura','Bac +
4','Professional','Profesional','Cadre','1','1','5844 Linden Land','','1 (11) 500
555-0184','37082','2-5 Miles'), ……………….
18484 rows affected (61.10 sec)

Load data: To load large amounts of data, LOAD DATA INFILE is
preferable instead of using INSERT statements.
Syntax:
LOAD DATA INFILE 'DimCustomer.sql' INTO TABLE
test.DimCustomer;
18484 rows affected (17.50 sec)
4.13 Optimizing Queries:
1. MySQL interprets query from right to left so all the limiters should be placed at
the right.
79
2. Columns that need to be ORDER BY should be preferably be indexed.
3. Indexes are preferred when a column is searched a lot but it slows down the
insertion in the table. Unused indexes should be dropped.
4. For limited results from database LIMIT should be used in the query. It helps in
saving MySQL resources since as soon as the result is found MySQL stop
searching the whole database.
Syntax before Using Limit:
SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress,
prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE
dimcustomer.customer_id = prospect.prospect_id;
2059 rows affected (04.90 sec)
Syntax after Using Limit:
SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress,
prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE
dimcustomer.customer_id = prospect.prospect_id Limit 0,100;
2059 rows affected (00.30 sec)
80
5. Optimize Table command should be used regularly, since it de-fragments the
table. This is very effective when a table is updated frequently. It speeds up the
table access, reduces the size of the table on disk and also reduces the select query
response time. The OPTIMIZE command should be run when the DBMS is
offline for scheduled maintenance. The command is nonstandard SQL.
Syntax before Optimizing Table:
Select * from dimcustomer limit 0, 10000;
10000 rows affected (113.90 sec)
Syntax:
Optimize TABLE DimCustomer, Prospect;
Syntax before Optimizing Table:
Select * from dimcustomer limit 0, 10000;
10000 rows affected (113.20 sec)
6. When a aggregate function such as COUNT ( ) or SUM( ) are frequently used, it
is advisable to use statistics table that can be updated every time any row is
manually updated with the aggregate values. For example, if the statistics table
81
maintains the count of rows in a table, each time the row is updated the table is
updated in the statistics table. For large table using statistics table is preferred
over aggregate functions since they improve the speed.
7. SELECT HIGH_PRIORITY lets the select operations run first and then any other
write operations. It bypasses other table operations. The only limitation it has that
it can not have Union involved. It gives more power in a multiuser, multipurpose
application. However, HIGH_PRIORITY should only be used in cases where we
need instantaneous access to information. We have to check of the other
operations waiting in the queue are not mission critical.
Syntax:
SELECT HIGH_PRIORITY * FROM DimCustomer;
8. In-memory tables have fast access to data that either never changes or doesn’t
need to persist after restart. Using in-memory table means that a query can
complete without even waiting for disk I/O.
9. Columns having identical information in different tables should have identical
data types, this improves the performance.
10. Indexes are used to improve the speed of the queries but some times they can
slow down the process. So it is sensible to check in a query is using the indexes.
This can be checked by using EXPLAIN. If the indexes aren’t being used then we
can avoid using them to increase the performance.
82
Syntax:
Explain DimCustomer;
11. Avoid using complex SELECT queries on tables that are updated frequently, this
will help in avoiding problems created due to locking the table frequently for read
and write updates.
12. For highly dynamic queries it is better to use SQL_NO_CACHE but otherwise we
can turn it off for more static data. This helps is decreasing the overhead of the
database.
Syntax:
SELECT * FROM DimCustomer;
18484 rows affected (211.70sec)
Syntax:
SELECT SQL_NO_CACHE * FROM DimCustomer;
18484 rows affected (211.50sec)
13. SQL_BUFFER_RESULT hint suggests MySQL to store the result in the result set
that is stored in a temporary table. This can be used for large result sets which
require significant amount of time to send the results to the client to avoid having
the queried tables locked for that period of time.
83
Syntax:
SELECT SQL_BUFFER_RESULT * FROM counties;
14. SQL_BIG_RESULT hint can be used with DISTINCT and GROUP BY SELECT
statements. It is used for large result sets.
Example using Select *:
SELECT * FROM DimCustomer;
13690 rows in set (157.20 sec)
Example using Select SQL_BIG_RESULT *:
SELECT SQL_BIG_RESULT * FROM DimCustomer GROUP BY
CustomerKey;
13690 rows in set (148.10 sec)
15. SQL_SMALL_RESULT hint uses fast temporary tables to store the resulting
table instead of using sorting.
Example Select *:
84
SELECT * FROM Prospect GROUP BY Prospect_id;
2059 rows in set (18.90 sec)
Example Select SQL_SMALL_RESULT *:
SELECT SQL_SMALL_RESULT * FROM Prospect GROUP BY Prospect_id;
2059 rows in set (18.60 sec)
16. Benchmarking: Benchmarking is the easiest and the most effective way to check
the speed of the server. It is an inbuilt function in MySQL that calculates the time
taken to calculate a given expression. When a benchmark function is executed the
result is always 0. It is not used to retrieve the data but it is used to check if a
same expression is executed n number of time the time taken to execute that query
should be same all the time. It can be executed at different period of time during
the day to check how much time difference comes every time the function is
executed. It gives an idea when the server is most busy or idle during the day.
Syntax:
SELECT BENCHMARK(1000000, ‘tan(.45)’);
85
17. MySQL Query Analyzer - The new MySQL Enterprise has improved the
application performance and efficiencies for the critical deployments using the
new Query Analyzer feature. The MySQL Query Analyzer helps in improving the
application performance by monitoring query performance and accurately
pinpointing the code that causes poor performance and slows down the system;
this feature helps the DBAs work on the queries that make the database slower.
With the help of MySQL Query Analyzer, the DBA can tune the code by
continuously monitoring and fine tuning the queries; thus, helps in achieving peak
performance on MySQL.
18. Normalize first and then denormalize: In the real world of high traffic websites
and users demanding sub-second response times, we have to use denormalization. The key to high performance database access is sticking to singletable SELECT queries with short indexes.
19. Show process list: The MySQL process list allows you to see what processes are
currently running on your MySQL servers. SHOW PROCESSLIST to monitor
their MySQL servers for poorly performing queries. We can manually parse
through the aggregated data, to determine which MySQL servers have queries that
may be causing a bottleneck and those queries can be terminated or can be
modified.
86
Syntax:
Show full processlist;
20. Storing Intermediate Results: Intermediate, or staging, tables are quite common
in relational database systems, because they temporarily store some intermediate
results. In many applications they are useful, but Oracle requires additional
resources to create them. Always consider whether the benefit they could bring is
more than the cost to create them. Avoid staging tables when the information is
not reused multiple times.
Some additional considerations:
a) Storing intermediate results in staging tables could improve application
performance. In general, whenever an intermediate result is usable by
multiple following queries, it is worthwhile to store it in a staging table.
The benefit of not retrieving data multiple times with a complex statement
already at the second usage of the intermediate result is better than the cost
to materialize it.
b) Long and complex queries are hard to understand and optimize. Staging
tables can break a complicated SQL statement into several smaller
statements, and then store the result of each step.
87
c) Consider using materialized views. These are pre-computed tables
comprising aggregated or joined data from fact and possibly dimension
tables.
21. Chopping Queries: To improve query processing speed, queries can be
processed into smaller chunks that affects lesser rows each time. Example of this
can be purging old data. We can use LIMIT along the delete statement and can
delete a limited number of rows at a time. Thus we are dividing a process in
smaller processes and putting fewer loads on the processor.
Example before using Limit:
SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress,
prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE
dimcustomer.customer_id = prospect.prospect_id;
2059 rows in set (04.10 sec)
Example after using Limit:
88
SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress,
prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE
dimcustomer.customer_id = prospect.prospect_id Limit 0,100;
100 rows in set (00.30 sec)
22. Join Decomposition: A multiple join statement can be decomposed into multiple
single tables and then joining all those single tables later.
Single Join Query:
SELECT * FROM Prospect LEFT JOIN Prospect1 ON Prospect.Prospect_id =
Prospect1.Prospect1_id where Prospect1.Prospect1_id=1;
Join decomposed:
SELECT * FROM Prospect where Prospect.Prospect_id=1;
SELECT * FROM Prospect1 where Prospect1.Prospect1_id=1;
This looks wasteful but it saves lot of resources and processing time. Caching is
more efficient that joining multiple tables. When first select statement is
processed it caches all the data that later can be used by second statement and thus
lower the processing time and same logic used for the third query. This running
multiple queries helps in finding the result faster than running multiple joins. Also
89
when a table is accessed that table is locked so that updated values are only
accessed. So if we are using multiple tables, we are locking all the tables
associated with those joins. This is not an efficient way of locking.
23. Subquery Conversion: Converting subqueries into select query with join
statement helps in lowering load on the processor.
Subquery:
SELECT FirstName, LastName, EmailAddress From DimCustomer Where
DimCustomer.Customer_id IN (Select Prospect.Prospect_id FROM Prospect)
Order by Prospect.Prospect_id;
2059 rows in set (02.90 sec)
Join:
SELECT DimCustomer.FirstName, DimCustomer.LastName,
DimCustomer.EmailAddress FROM DimCustomer JOIN Prospect ON
DimCustomer.Customer_id=Prospect.Prospect_id);
2059 rows in set (02.80 sec)
90
24. Query Analyzer: It gives a consolidated view of entire MySQL environment. It
has customizable rule-based monitoring and alerts that helps in identifying the
problems before they occur.
25. Avoid computation in the where clause.
26. Only use transaction where it is most necessary. Inserts in auto-commit mode are
the slowest insert statements.
27. Avoid using SELECT query_cache if the database is updated frequently.
SELECT query_cache is faster only with static databases. A miss in the query
cache add 20% overhead compared to a normal query with out query cache.
28. Use temporary views instead of derived tables.
Derived table Syntax:
SELECT * FROM (SELECT * FROM DimCustomer) D;
18484 rows in set (214.60 sec)
Temporary Views Syntax:
CREATE VIEW v AS SELECT * FROM DimCustomer;
Query OK, 0 rows affected (0.00 sec)
Select * from v;
91
18484 rows in set (212.20 sec)
29. Use Truncate Instead of Delete:
Every delete table operation is logged in the transaction log as compared to
truncate table which is not logged in the transaction log. Since truncate table
doesn’t get logged in log it is faster than the delete table operation. One drawback
of truncate table is that it doesn’t have any records in the log table and thus
making it impossible to rollback the operation.
Syntax for delete:
DELETE FROM table_name;
Syntax for truncate:
TRUNCATE FROM tablename;
30. Avoid unnecessary parenthesis: By avoiding unnecessary parenthesis we can save
processor some time for processing extra parenthesis.
Syntax:
SELECT * FROM DimCustomer WHERE (customer_id = 1000 OR
CustomerKey = 12060);
2 rows in set (00.30 sec)
92
Instead use:
SELECT * FROM DimCustomer WHERE customer_id = 1000 OR CustomerKey
= 12060;
2 rows in set (00.30 sec)
4.14 Example
This example shows how using MySQL hints can drastically improve the query
execution time
Create two tables and insert the data from an existing dump. After creating the tables
Access some columns from these two tables.
Solution:
Steps before using optimizing rules:
1. Create two tables.
CREATE TABLE Dimcustomer (
CustomerKey, varchar(100),
GeographyKey, varchar(100),
CustomerAlternateKey, varchar(100),
93
Title, varchar(100),
FirstName, varchar(100),
MiddleName, varchar(100),
LastName, varchar(100),
NameStyle, varchar(100),
BirthDate, varchar(100),
MaritalStatus, varchar(100),
Gender, varchar(100),
EmailAddress, varchar(100),
YearlyIncome, varchar(100),
TotalChildren, varchar(100),
NumberChildrenAtHome, varchar(100),
EnglishEducation, varchar(100),
SpanishEducation, varchar(100),
FrenchEducation, varchar(100),
94
EnglishOccupation, varchar(100),
SpanishOccupation, varchar(100),
FrenchOccupation, varchar(100),
HouseOwnerFlag, varchar(100),
NumberCarsOwned, varchar(100),
AddressLine1, varchar(120),
AddressLine2, varchar(120),
Phone, varchar(100),
DateFirstPurchase, varchar(100),
CommuteDistance, varchar(100),
customer_id, mediumint(9),PRIMARY KEY(customer_id));
CREATE TABLE Prospect (
ProspectiveBuyerKey, varchar(100),
ProspectAlternateKey, varchar(15),
95
FirstName, varchar(50),
MiddleName, varchar(50),
LastName, varchar(50),
BirthDate, varchar(100),
MaritalStatus, varchar(10),
Gender, varchar(10),
EmailAddress, varchar(50),
YearlyIncome, varchar(100),
TotalChildren, varchar(100),
NumberChildrenAtHome, varchar(100),
Education, varchar(40),
Occupation, varchar(100),
HouseOwnerFlag, varchar(1),
NumberCarsOwned, varchar(100),
AddressLine1, varchar(120),
96
AddressLine2, varchar(120),
City, varchar(30),
StateProvinceCode, varchar(3),
PostalCode, varchar(15),
Phone, varchar(20),
Salutation, varchar(8),
Unknown, varchar(100),
prospect_id, mediumint(9),PRIMARY KEY(prospect_id));
2. Insert the statements in the table.
INSERT INTO DimCustomer VALUES
('11000','26','AW00011000','','Jon','V','Yang','0','24205','M','M','jon24@adve
nture-works.com','90000','2','0','Bachelors','Licenciatura','Bac +
4','Professional','Profesional','Cadre','1','0','3761 N. 14th St','','1 (11) 500
555-0162','37094','1-2 Miles');
97
INSERT INTO DimCustomer VALUES
('11001','37','AW00011001','','Eugene','L','Huang','0','23876','S','M','eugene1
[email protected]','60000','3','3','Bachelors','Licenciatura','Bac +
4','Professional','Profesional','Cadre','0','1','2243 W St.','','1 (11) 500 5550110','37090','0-1 Miles');
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------INSERT INTO DimCustomer VALUES
('29483','217','AW00029483','','Jésus','L','Navarro','0','21892','M','M','jésus9
@adventure-works.com','30000','0','0','Bachelors','Licenciatura','Bac +
4','Clerical','Administrativo','Employé','1','0','244, rue de la Centenaire','','1
(11) 500 555-0141','37693','0-1 Miles');
13689 row affected (55.80sec)
INSERT INTO prospect VALUES
('1','21596444800','Adam','','Alexander','13333','M','M','aalexander@lucerne
publishing.com','40000','3','0','Partial
98
Co','Professional','1','2','566 S. Main','','Cedar City','UT','84720','516-5550187','Mr.','0');
INSERT INTO prospect VALUES
('2','3003','Adrienne','','Alonso','14860','M','F','[email protected]'
,'80000','4','0','Bachelors','Management','1','2','7264 St. Peter
Court','','Colma','CA','94014','607-555-0119','Ms.','4');
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------INSERT INTO prospect VALUES
('2059','37111264500','Zoe','','Ward','27990','M','F','zward@consolidatedmes
senger','20000','0','0','High Schoo','Manual','1','1','25149
Howard Dr','','West Chicago','IL','60185','1 (11) 500 555-0190','Ms.','3');
2059 row affected (6.80sec)
99
3. Access columns:
SELECT * FROM dimcustomer,prospect WHERE
dimcustomer.customer_id = prospect.prospect_id;
Result is shown in Figure 5.
Figure 5: Snapshot of the Resultant Table Before Using Hints
2059 rows in set (42.80 sec)
55 Columns
Steps after Using Optimizing Rules:
100
1. Create tables:
CREATE TABLE Dimcustomer(
customerkey, smallint(6),
geographykey, smallint(6),
customeralternatekey, varchar(100),
Title, varchar(100),
FirstName, varchar(100),
middlename, char(10),
LastName, varchar(100),
NameStyle, varchar(100),
BirthDate, varchar(100),
maritalstatus, char(10),
gender, char(10),
EmailAddress, varchar(100),
YearlyIncome, varchar(100),
101
totalchildren, tinyint(6),
numberchildrenathome, tinyint(4),
EnglishEducation, varchar(100),
SpanishEducation, varchar(100),
FrenchEducation, varchar(100),
EnglishOccupation, varchar(100),
SpanishOccupation, varchar(100),
FrenchOccupation, varchar(100),
houseownerflag, tinyint(6),
numbercarsowned, tinyint(6),
AddressLine1, varchar(100),
AddressLine2, varchar(100),
Phone, varchar(100),
DateFirstPurchase, varchar(100),
CommuteDistance, varchar(100),
102
customer_id, mediumint(9),Primary key(customer_id));
CREATE TABLE Prospect (
prospectivebuyerkey, smallint(6),
prospectalternatekey, varchar(100),
FirstName, varchar(100),
middlename, char(10),
LastName, varchar(100),
BirthDate, varchar(100),
maritalstatus, char(10),
gender, char(10),
EmailAddress, varchar(100),
YearlyIncome, varchar(100),
totalchildren, tinyint(4),
numberchildrenathome, tinyint(4),
103
Education, varchar(100),
Occupation, varchar(100),
houseownerflag, tinyint(4),
numbercarsowned, tinyint(4),
AddressLine1, varchar(100),
AddressLine2, varchar(100),
City, varchar(30),
stateprovincecode, char(10),
PostalCode, varchar(100),
Phone, varchar(100),
salutation, char(10),
unknown, tinyint(4),
prospect_id, mediumint(9),Primary key(prospect_id));
2. Setting Environmental Parameters
Setting Maximum number of open tables
104
SET GLOBAL TABLE_CACHE=256;
Altering Index Buffer Size
SET GLOBAL KEY_BUFFER_SIZE=500000;
Enable Query Caching
SET GLOBAL QUERY_CACHE_TYPE=1;
3. Load data in the database using load data infile.
LOAD DATA INFILE 'data.txt' INTO TABLE dname.tname;
13689 rows affected (17.50sec)
4. Add index to the columns that have unique values
ALTER TABLE Dimcustomer ADD INDEX index1 (customerkey,
geographykey);
2059 row affected (0.80sec)
5. Join tables and only select columns that are required instead of accessing all
columns.
105
SELECT dimcustomer.firstname, dimcustomer.lastname,
prospect.emailaddress, prospect.education, prospect.occupation FROM
dimcustomer,prospect WHERE dimcustomer.customer_id =
prospect.prospect_id;
Result is shown in Figure 6.
Figure 6: Snapshot of the Resultant Table After Using Hints
2059 rows in set (3.70 sec), 5 Columns
106
After looking the response time of these queries it is confirmed that by
using proper optimizing techniques we can make our queries much faster and
cleaner. All the rules and suggestions discussed in the document help in
improving the efficiency of the database. When these rules are implemented
individually we can not see significant changes but when combined all together
we can see a significant improvement in the efficiency of the database.
Some of the major hints that have improved the performance of the
database significantly are as mentioned below.
1. Optimizing the Hardware: For bulk tables 64 bit hardware is preferred.
Increasing RAM always help in speeding up the database since it allows
more key pages in the RAM. 1G Ethernet is preferred since they improve
Latency.
2. MySQL Analyzer: With the help of MySQL analyzer we can examine
query activities and find out the most expensive queries which can be
further studied and can be removed with an efficient query which performs
the same task but in an efficient manner.
107
3. Activating Query Cache: By activating query cache the result of the
SELECT query is stored in the cache which helps in speeding up the query
execution since it is available in the cache for faster access.
4. Optimizing Tables: When creating a table it is suggested that most of the
columns by default should be NOT NULL. Using appropriate datatype
always improves the access time. It is suggested to use datatype which are
small in size like SMALLINT, CHAR, and MEDIUMCHAR. If possible use
fixed length attributes rather than dynamic attributes like VARCHAR or
BLOB. Primary key should as small as possible. It is suggested to index
keys that have high selectivity. Composite indexing is suggested since it
decreases system I/O. Horizontal partitioning significantly improves the
performance of the table. Horizontal partitioning provides the possibility for
indexing and it lowers the number of rows to be accessed which improves
the execution time by many folds.
5. Loading Tables: The two most effective ways of loading a table is by either
using multiple-row insert techniques or by using the LOAD DATA INFILE
command. LOAD DATA INFILE command is used for loading large size
data in the tables.
108
6. Optimizing Queries: It is suggested to use LIMIT whenever possible. This
helps in saving MySQL resources. After making changes in the table it is
always suggested to optimize the table using command OPTIMIZE TABLE
since it de-fragments the table. It speeds us the table access. Use SELECT
SQL_BIG_RESULT command instead of SELECT for large result sets.
SELECT SQL_BIG_RESULT command is much faster then SELECT
alone. A multiple join statement should be decomposed into multiple single
tables and then joining all those single tables later. This saves lot of
resources since the When first select statement is processed it caches all the
data that later can be used by second statement and thus lower the
processing time and same logic used for the third query.
109
Chapter 5
CONCLUSION
MySQL optimization is a set of best practices geared towards optimizing the
performance of the database and its queries. The goal of optimizing database is to
minimize the response time and increase the throughput of the database. This can be
achieved by understanding the application requirements as well as the logical and
physical structure of the data. To improve the performance of the database, it is important
that all the performance issues should be considered throughout the database
development cycle. To optimize the database it is important to identify the bottlenecks
and try to focus on those areas. Although, system-level performance issues such as
memory, CPU, hardware and further more are also important for overall tuning of the
database but the performance gained from these areas are more incremental.
Optimizing database can be achieved by analyzing the applications and queries
that interact with the database. The performance of the database is usually affected by
unexpectedly long-lasting queries and updates that affect the database schema. Some of
the main causes of long-lasting queries are slow network communications, inefficient
110
queries, lack of useful indexes, out-of-date statistics, inefficient environmental
parameters, and inefficient table structures.
The goal of the project is to provide rules and hints that will help in optimizing
the queries and database; therefore, making the system faster and more efficient. This, in
effect, saves a lot of processing time and resource time which further improves the
performance of the system. This document has covered most of the topics that helps in
optimizing the database and queries in an efficient manner. An optimized database can
avoid unnecessary hardware and software upgrades.
In this project I have mentioned techniques that can help in tuning queries, tables
and database schemas. The document suggests what values should be assigned to the
environmental variables so that they in turn provide an optimized environment to work
in. This document mentions what kind of database to be used for specific kinds of data to
improve the overall performance of the database. It also mentions what type of datatype
to be used for different columns. In additional to that it suggests different ways to load
data in the database depending upon the type and size of data. It suggests how to
breakdown complex queries into simpler query that will use minimum resources and still
provide the same results. It mentions how to use a MySQL query analyzer to check the
load on the database at different time intervals; therefore, helps in analyzing queries and
see how much resource a particular query is consuming. This document provides a
111
guideline if followed accordingly could help in optimizing queries, tables and thus
improving the overall performance of the database.
The document mentions several different rules along with hints when used
properly could save numerous amounts of resources which then improve the throughput
of the entire database. When these hints are used individually they do not show a
significant change in the processing time; however, when all these hints as well as these
rules are used together, a significant improvement in the response time of the queries
could be seen. The practices discussed in the document not only aid in optimizing the
database but also plays a key role in making the database more proficient.
The previously mentioned rules discussed in this document would help in
optimizing the database efficiently; although, this document is not exhaustive but it
covers most of the concepts that helps in query and database optimization. This is not the
final word in MySQL optimization; though this document can serve as a building block
for more upcoming projects related to MySQL optimization. With the above mentioned
rules along with the document that I have presented I have successfully completed the
task of accomplishing the topic of, “Improving the MySQL Query Optimization”.
112
BIBLIOGRAPHY
[1] Sun and Oracle, [Online]
Available: http://www.sun.com/software/products/mysql/features.jsp
[2] MySQL and Sun, “MySQL Server FAQ, [Online]
Available: http://www.mysql.com/industry/faq/
[3] MySQL and Sun, “Overview of MySQL Storage Engine Architecture, [Online]
Available: http://dev.mysql.com/doc/refman/5.1/en/pluggable-storage-overview.html
[4] Techotopia MySQL Database Architecture, [Online]
Available: http://www.techotopia.com/index.php/MySQL_Database_Architecture
[5] John W. Horn and Michael Grey, “MySQL: Essential Skills” McGraw-Hill 2004
[6] Robert D. Schneider, “MySQL® Database Design and Tuning”, 2005
[7] MySQL Reference Manual, [Online]
Available: http://dev.mysql.com/doc/
[8] Michael Kofler , “MySQL 5”, Third Edition
[9] Steven Feuerstein, Guy Harrison, “MySQL Stored Procedure Programming”, 2006
113
[10] A MySQL Technical White Paper, “MySQL Cluster 6.2”, 2008
[11] Technical White Paper, “MySQL Cluster Architecture Overview - A MySQL”, 2004
[12] Alex Davies and Harrison Fisk, “MySQL Clustering”, 2008
[13] Charles A Bell, “Expert MySQL”, 2007
[14] Peter Zaitsev, “White Paper Advanced-MySQL-Performance-Optimization”
[15] Baron, Peter, Jeremy and Dereck, “High Performance MySQL”
[16] Optimizing MySQL, [Online]
Available: http://dev.mysql.com/tech-resources/presentations/index.html
[17] Technical White Paper, “MySQL Performance Benchmarks”, 2006
[18] MySQL Press, “MySQL Administrative Guide”
[19] Sergey Petrunya, Sun Microsystems White Paper “Understanding and Control of
MySQL Query Optimizer Traditional and Novel Tools and Techniques”, 2006
[20] Charles A. Bell, “Expert MySQL”, 2007