* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download chapter 1 - Sacramento - California State University
Survey
Document related concepts
Relational algebra wikipedia , lookup
Serializability wikipedia , lookup
Microsoft Access wikipedia , lookup
Oracle Database wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Functional Database Model wikipedia , lookup
Concurrency control wikipedia , lookup
Ingres (database) wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
ContactPoint wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Versant Object Database wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
IMPROVING QUERY OPTIMIZATION OF MYSQL Samit Wangnoo B.S., Jammu University, India, 2006 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2009 IMPROVING QUERY OPTIMIZATION OF MYSQL A Project by Samit Wangnoo Approved by: __________________________________, Committee Chair Dr. William J Mitchell ___________________________________, Second Reader Dr. Scott Gordon ______________________________ Date ii Student: Samit Wangnoo I certify that this student has met the requirements for format contained in the University format manual, and that this Project is suitable for shelving in the Library and credit is to be awarded for the Project. ______________________________, Graduate Coordinator Dr. Cui Zhang Department of Computer Science iii _________________ Date Abstract of IMPROVING QUERY OPTIMIZATION OF MYSQL by Samit Wangnoo The intent of the project is to study the MySQL design limitations and tradeoffs; and produce a solution of how to tackle these limitations and tradeoffs. Thereby improving the efficiency of the MySQL queries and making the system fast. The goal of the project is to show how we can reduce the query timing by using benchmarking while we are creating a database and also by following specific optimization rules to get more efficient output in less time. Trying to work on these limitations and finding an alternate will help improve the efficiency of MySQL as a whole. The project is based on benchmarking, which is intended to measure the runtime performance of scalar expressions; it has some significant implications for the way that it is used and also the way the results are interpreted. Benchmarking is done on the application and database to find out where the bottlenecks are. After fixing one bottleneck (or by replacing it with a “dummy” module), we can proceed to identifying the iv next bottleneck. Even if the overall performance of the application is currently acceptable, we need to at least make a plan for each bottleneck and decide how to solve it. ______________________________, Committee Chair Dr.William J Mitchell _______________________________ Date v TABLE OF CONTENTS Page List of Figures ………………………………………………………………………viii Chapter 1. INTRODUCTION …………………………………………………………………1 2. DATABASE SYSTEM ARCHITECTURE ……………………………………….4 2.1 MySQL Architecture ……………………………………………………..5 2.2 Features of MySQL ………………………………………………………20 2.3 Limitations and Tradeoffs of MySQL ……………………………………26 3. SCHEMA OPTIMIZATION ………………………………………………………27 3.1 Hardware ………………………………………………………………...27 3.2 Software …………………………………………………………………..29 3.3 Designing Database to Improve Speed …………………………………...30 3.4 Advantages of Using Clustering ………………………………………….38 4. OPTIMIZATION …………………………………………………………………..40 4.1 Bottlenecks ………………………………………………………………..40 4.2 Measuring Performance …………………………………………………..42 vi 4.3 Optimizing Hardware for MySQL ………………………………………43 4.4 Optimizing Disks ………………………………………………………..43 4.5 Optimizing OS …………………………………………………………..45 4.6 Choosing API ……………………………………………………………46 4.7 Compiling and Installing MySQL ………………………………………47 4.8 Execution Path of a Query ………………………………………………48 4.9 Query Analyzer ………………………………………………………….49 4.10 Optimizing Environmental Parameters ………………………………...50 4.11 Optimizing Tables Structure …………………………………………...64 4.12 Optimizing How to Load Tables ………………………………………75 4.13 Optimizing Queries ……………………………………………………78 4.14 Example ………………..……………………………………………...92 5. CONCLUSION ………………………………………………………………….109 Bibliography ………………………………………………………………………..112 vii LIST OF FIGURES Figure Page 1. Primary Components of MySQL Architecture …………………………………...6 2. Architecture of MyODBC ………………………………………………………..14 3. Execution Path of a Query ………………………………………………………..48 4. Query Analyzer SQL Error and Warning Counts ………………………………...50 5. Snapshot of the Resultant Table Before Using Hints ……………………………..99 6. Snapshot of the Resultant Table After Using Hints ………………………………105 viii 1 Chapter 1 INTRODUCTION MySQL is the most popular open source relational database management system based on SQL (Structured Query Language). Relational database management system is a program that lets you create, update, and administer a relational database. It stores data in separate tables rather than putting all data in a common repository. This helps in improving speed and flexibility to the database. MySQL is developed by MySQL AB. MySQL AB is a commercial company that provide services to the MySQL database [1]. MySQL is proven and cost-effective database solution that helps in reducing the cost of the database software infrastructure [1]. MySQL reduces the Total Cost of Ownership (TCO) of database software by reducing database licensing cost, lowering hardware expenditure, reducing administration, engineering and support cost. MySQL is easy to install and deploy, reliable and available and it has embedded library [2]. MySQL has become the most popular open source database because of its consistent fast performance, high reliability and ease of use. Many of the world’s largest and fast growing organizations use MySQL such as Yahoo!, Alcatel-Lucent, Google, Nokia, Youtube and Booking.com. MySQL is a part of the famous open source stack 2 LAMP (Linux, Apache, MySQL, PHP/ Pearl/ Python). MySQL is flexible since it can work on different platforms like Linux, Windows, Unix, BSD (Mac OS X), Handheld (Windows CE). MySQL gives both open source freedom and 24 X 7 supports The new MySQL Enterprise improved the application performance and efficiencies for critical deployments using the new Query Analyzer feature. The MySQL Query Analyzer helps in improving the application performance by monitoring query performance and accurately pinpointing the code that causes poor performance and slow down the system. This helps the DBAs to work on the queries that make the database slower. With the help of MySQL Query Analyzer, DBA can tune the code by continuously monitoring and fine tuning the queries, thus helps in achieving peak performance on MySQL. The project is to show how to reduce the query execution time by using benchmarking and to work on MySQL limitations and finding an alternate which can help in improving the efficiency of MySQL as a whole. The project is based on benchmarking. Benchmarking application and database helps in finding out the bottlenecks in the project. After fixing one bottleneck (or by replacing it with a “dummy” module), we can proceed to identifying the next bottleneck. Even if the overall performance of the application is currently acceptable, we need to at least make a plan for each bottleneck and decide how to solve it. Benchmarking helps in designing and 3 selecting the proper components for a successful MySQL evaluation by working through the basic functional testing and evaluating the best practices. The benchmarking can help optimize the design of the database and query processing time. Hence, the objective of the project is to suggest some rules and hints that will help in optimizing queries and thereby making the system faster and more efficient. This, in effect, saves a lot of processing time, resource time, and overall cost of the system; thus, it optimizes the queries and thereby makes the system faster. 4 Chapter 2 DATABASE SYSTEM ARCHITECTURE According to Techotopia, “Database Management System (DBMS) is a software system that facilitates the creation, maintenance and use of an electronic database” [4]. Relational database management systems (RDBMS) implement the relational model of tables and relationships. Database Management System (DBMS) is classified into two ways known as Shared-File and Client-Server. A Shared-File DBMS interacts directly with the underlying database files that are used mostly on desktop computers and for databases that are designed does not need too much storage space. The Client-Side is further divided into two components. The server component resides on the physical computer as database files. This is responsible for all the interaction with the database. The other component is client side which sends requests to the server who processes the requests and sends the result back to the client [4]. The typical example of the shared file DBMS is Microsoft Access and for clientserver DBMS is MySQL. Client-Server DBMS is more efficient than shared file. The advantages of Client Server DBMS are, client does not have to reside on the same computer as server. It can send request from any computer on the network to the server 5 that can be on a remote host. Since server is on a remote host, this makes server invisible to all its clients, making the database available to more users than shared- file DBMS. Separating client from server increases the range of client types. Client can be written in any programming language such as C, C++, or Java or web based applications can be developed using PHP and JSP. MySQL has a unique Pluggable storage engine architecture that gives the user more flexibility to choose from a variety of purpose-specific storage engines. Some of the available engines include InnoDB, MyISAM and NDB. The most popular engine is the InnoDB, which is a general purpose engine. Oracle has acquired the InnoDB engine and has continued to develop and maintain it. The other popular database engine is MyISAM, this engine is particularly famous in web based environment. This engine was present from the beginning. This engine provides high performance but this engine does not support transactions so it is not an ideal engine to choose in DBMS environment. The other famous database engine is NDB. It is used for MySQL cluster only. It is famous particular in financial applications [4]. 2.1 MySQL Architecture Understanding the MySQL database architecture is important in order to efficiently manage the MySQL database. The important components of the MySQL architecture are explained in the Figure 1 which is taken from MySQL and Sun Website [3]. 6 Connectors Native C API, JDBC, ODBC, .NET, PHP, Perl, Python, Ruby, Cobol MySQL Server Connection Pool Authentication, Thread Reuse, Connection Limits, Check Memory, Caches Management Services and Utilities SQL Interface Backup and Recovery, Security, Replication, Cluster, Administration, Configuration, Migration and Metadata DML, DDL, Stored Procedure, Views, Triggers, etc. Parser Optimizer Query Translation, Object Privilege Access Paths , Statistics Caches & Buffers Global and Engine Specific Caches and Buffers Pluggable Storage Engine MyISAM InnoDB NDB Archive Federated Memory Merge Figure 1: Primary Components of MySQL Architecture [3] 7 2.1.1 Primary Components of MySQL Architecture: MySQL provides lot of different storage engines depending upon the requirements. MySQL has a flexible features, it can have different mixture of storage engines and tables at the same time in a single database. Database Storage Engines and Table Types are responsible for accumulating and retrieving information, the database storage engine lies at the heart of your MySQL installation. When it comes to picking a specialized storage engine or table type, MySQL offers database designers and administrators a lot of choices. [3] MySQL supports both transaction-safe and non transaction-safe tables. Advantages of Transaction-Safe Tables: i. It allows execution of multiple SQL statements in single operation. ii. If a transaction fails, it can be restored to the previous safe state by using recovery logs and backups. Advantages of Non-Transaction-Safe Tables: i. It is much faster and it uses disk space as there is no overhead transaction. ii. It uses less memory to perform update. 2.1.2 Famous Pluggable Storage Engines in MySQL: 8 (a) InnoDB: InnoDB is MySQL’s high performance, transaction-safe engine. It provides advanced multi user concurrency capabilities by allowing you to lock individual rows rather than entire tables. This engine is recommended where table needs support for the transactions. It is used when we are storing very large amount of data in a tables. The only overhead of this engine is it consumes more disk space than any other engine [6]. Properties of InnoDB: i. InnoBB table supports transaction, ACID, foreign keys referentialintegrity constraint, data checksums. ii. Supports different isolation modes. iii. It does not support key compression iv. Both data and indexes are cached in memory. v. InnoDB tables are clustered using primary keys. (b) MyISAM: The MyISAM table type is mature, stable, and simple to manage. MyISAM is a direct descendent of ISAM database engine. MyISAM was the original product from MySQL. This is the default engine for many platforms. The default MySQL engine is fast, compressible, and FULLTEXT-searchable. Some 9 of its advantages are like high speed operations that do not require integrity of the transactions. MyISAM boosts system response by letting administrators place their data and index files in separate directories, which can be stored on different disk drives [6]. Properties of MyISAM: i. No transactions get corrupted even there is a power failure. ii. It has small disk and memory footprint. iii. Only indexes are cached in MySQL so it saves lot of cache and buffer space. iv. It allows table locks and also allows concurrent inserts [14]. (c) Memory / Heap: Memory previously known as Heap. These tables are memorybased, extremely fast and easy to configure, letting developers leverage the benefits of in-memory processing via a standard SQL interface. HEAP tables use hashed indexes and are stored in memory. This makes them very fast, but because they are stored in memory, if your MySQL server crashes, you will lose all data stored in them. This makes them unsuitable for everyday storage. However, HEAP tables are useful for temporary tables, such as those that contain real-time 10 statistics that are calculated anew each time the web page that displays them is loaded. Properties of Heap: Merge: MERGE tables are two or more identical MyISAM tables joined by a UNION statement. You can only select, delete, and update from the collection of tables. If you drop the MERGE table, you are dropping only the MERGE specification. That means the MERGE table no longer exists, but the MyISAM tables it was constructed from and the data in them are still intact. The most common reason to use MERGE tables is to get more speed. You can split a big, read-only table into several parts, and then put the different table parts on different disks. This results in faster access times, and therefore more efficient searches. Also, if you know exactly what you are looking for within the split parts, you can search in just one of the split tables for some queries, or if you need to search the entire table, use a MERGE command to access the parts as a whole. It is faster to repair the individual files that are mapped to a MERGE file than to try to repair one huge file [5]. 11 (d) Federated: MySQL version 5.0.3 introduces a new storage engine designed for distributed computing. Specifying the Federated option when creating your table tells MySQL that the table is actually resident in another database server. MySQL uses an URL-like formatted string to tell MySQL the location of the remote table as part of the COMMENT portion of your Create Table statement. MySQL reads the SQL statements, performs some internal transformations, and then accesses the Federated table via its client API. The results are then presented to you as if the query was executed locally. (e) NDB: MySQL's new clustering capabilities rely on the NDB storage engine. By spreading the data and processing load over multiple computers, clustering can greatly improve performance and reliability. (f) Archive: This engine is used chiefly for creating large, compressed tables that will not need to be searched via indexes. In addition, we cannot use any data alteration statements, such as UPDATE or DELETE, although you are free to use SELECT and INSERT. (g) CSV: By creating comma-separated files (.csv), this storage engine makes it very easy to feed other applications that consume these kinds of files with MySQLbased data. 12 2.1.3 Connectors/API: Connectors provide database application developers and thirdparty tools with packaged libraries of standards-based functions to access MySQL. These libraries range from Open Database Connectivity (ODBC) technology through Java and .NET-aware components. By using the ODBC connector to MySQL, any ODBC-aware client application (for example, Microsoft Office, report writers, Visual Basic) can connect to MySQL without knowing the vagaries of any MySQL-specific keyword restrictions, access syntax, and so on; it's the connector's job to abstract this complexity into an easily used, standardized interface. MySQL AB and several third parties provide application programming interface (API) libraries to let developers write client applications in a wide variety of programming languages, like C (provided automatically with MySQL), C++, Eiffel, .NET, Perl, PHP, Python, Ruby, Tcl. 2.1.4 Parser: Parser decomposes the SQL commands it receives from calling programs into a form that can be understood by the MySQL engine. The objects that will be used are identified, along with the correctness of the syntax. The Syntax Parser also checks the objects being referenced to ensure that the privilege level of the calling program allows it to use them. [6] 13 2.1.5 SQL Interface: The SQL interface provides the mechanism to receive commands and transmit results to the user. The SQL interface is built on ANSI SQL standards. 2.1.6 Cache and Buffers: Cache and Buffers help in ensuring that the data that is referenced recently will be in the available in the cache for quick response. 2.1.7 Connectors: MySQL connectors are drivers that provide connectivity to the MySQL server for client program. There are currently five MySQL connectors: a) Connector/ODBC: ODBC provides driver support for connecting to a MySQL server using the Open Database connectivity (ODBC) API. Support is available for ODBC connectivity from Windows, Unix and Mac OS X platforms. MyODBC Architecture: MyODBC architecture is based on five components Application, Driver Manager, DSN Configuration, Connector /ODBC and MySQL Server. MyODBC Architecture discussed in Figure 2 is taken from the MySQL Reference Manual [7]. 14 Figure 2: Architecture of MyODBC [7] Application: The Application uses the ODBC API to access the data from the MySQL server. The ODBC API communicates with the Driver Manager. The Application communicates with the Driver Manager using the standard ODBC calls. The Application does not care where the data is 15 stored, how it is stored, or even how the system is configured to access the data. It needs to know only the Data Source Name (DSN). A number of tasks are common to all applications, no matter how they use ODBC. These tasks are: Selecting the MySQL server and connecting to it Submitting SQL statements for execution Retrieving results (if any) Processing errors Committing or rolling back the transaction enclosing the SQL statement Disconnecting from the MySQL server Because most data access work is done with SQL, the primary tasks for applications that use ODBC are submitting SQL statements and retrieving any results generated by those statements. Driver Manager: The Driver Manager is a library that manages communication between application and driver or drivers. It performs the following tasks: Resolves Data Source Names (DSN): The DSN is a configuration string that identifies a given database driver, database, database host and optionally authentication information that enable an ODBC application to 16 connect to a database using a standardized reference. Because the database connectivity information is identified by the DSN, any ODBC compliant application can connect to the data source using the same DSN reference. This eliminates the need to separately configure each application that needs access to a given database; instead you instruct the application to use a pre-configured DSN. Loading and unloading of the driver required to access a specific database as defined within the DSN. For example, if you have configured a DSN that connects to a MySQL database then the driver manager will load the MyODBC driver to enable the ODBC API to communicate with the MySQL host. Processes ODBC function calls or passes them to the driver for processing. b) Connector/Net: Net enables developers to create .NET applications that use data stored in a MySQL database. Connector/NET implements a fully-functional ADO.NET interface and provides support for use with ADO.NET aware tools. Applications that need to use Connector/NET can be written in any of the supported .NET languages [7]. 17 Connector/NET enables developers to easily create .NET applications that require secure, high performance data connectivity with MySQL. It implements the required ADO.NET interfaces and integrates into ADO.NET aware tools. Developers can build applications using their choice of .NET languages. Connector/NET is a fully managed ADO.NET driver written in 100% pure C#. Connector/NET supports the following: All the MySQL 5.0 features. Large-packet support for sending and receiving rows and BLOB’s up to 2 gigabytes in size. Protocol compression which allows for compressing the data stream between the client and server. Support for connecting using TCP/IP sockets, named pipes, or shared memory on Windows. Support for connecting using TCP/IP sockets or Unix sockets on Unix. Support for the Open Source Mono framework developed by Novell. Fully managed, does not utilize the MySQL client library. c) Connector /J: J provides driver support for connecting to MySQL from a Java application using the standard Java Database Connectivity (JDBC) API. 18 MySQL provides connectivity for client applications developed in the Java programming language via a JDBC driver, which is called MySQL Connector/J. MySQL Connector/J is a JDBC-3.0 Type 4 driver, which means that is pure Java, implements version 3.0 of the JDBC specification, and communicates directly with the MySQL server using the MySQL protocol [7]. d) Connector /MXJ: MXJ is a tool that enables easy deployment and management of MySQL server and database through your Java application. MySQL Connector/MXJ is a Java Utility package for deploying and managing a MySQL database. MySQL Connector/MXJ is a solution for deploying the MySQL database engine (mysqld) intelligently from within a Java package. Deploying and using MySQL can be as easy as adding an additional parameter to the JDBC connection url, which will result in the database being started when the first connection is made. This makes it easy for Java developers to deploy applications which require a database by reducing installation barriers for their end-users [7]. MySQL Connector/MXJ makes the MySQL database appear to be a java-based component. It does this by determining what platform the system is running on, selecting the appropriate binary, and launching the executable. It will also optionally deploy an initial database, with any specified parameters. 19 e) Connector/PHP: PHP is a Windows-only connector for PHP that provides the mysql and mysqli extensions for use with MySQL 5.0.18 and later. PHP is a server-side, HTML-embedded scripting language that is used to create dynamic Web pages. It is available for most operating systems and Web servers, and can access most common databases, including MySQL. PHP may be run as a separate program or compiled as a module for use with the Apache Web server [7]. f) Connector/ODBC: The MySQL Connector/ODBC drivers also called MyODBC drivers that provide access to a MySQL database using the industry standard Open Database Connectivity (ODBC) API. It was developed according to the specifications of the SQL Access Group and defines a set of function calls, error codes, and data types that can be used to develop database-independent applications. ODBC usually is used when database independence or simultaneous access to different data sources is required [7]. 2.1.8 Query Optimizer: “Query Optimizer is a part of the server that takes a parsed SQL query and produces a query execution plan.” MySQL query optimizer restructures the query by first restricting using SELECT to narrow the number of tuples and then it performs the projections to reduce the number of attributes and finally evaluates any join condition. 20 Query Optimizer streamlines the syntax for use by the Execution Component, which then prepares the most efficient plan of query execution. The Query Optimizer checks to see which index should be used to retrieve the data as quickly and efficiently as possible. It chooses one from among the several ways it has found to execute the query and then creates a plan of execution that can be understood by the Execution Component. 2.2 Features of MySQL: a) Relational Database: A relational database is a database which stores data in different tables rather than having all data in a single table; this improves the speed and flexibility of the database. MySQL is one of the famous relational databases. b) Client/Server Architecture: MySQL is a client/server architecture system. There is a database server (MySQL) and arbitrary many clients (application programs), which communicate with the server, for example query data, save changes, etc. The clients can run on the same computer or on internet. Most of the big databases use client server architecture. The databases that use client server architecture are Oracle, MySQL, and Microsoft SQL Server. There are some databases that use file-server system; these are Microsoft Access, dBase, FoxPro. The main problem with filer-server databases are when they work on network their efficiency decreases as the number of users increase on the network [8]. 21 c) SQL Compatibility: MySQL supports SQL and adheres to the current SQL standards although with some restrictions and some other extensions. d) Views: A view is a subset of a database. Views are tables designed by users using queries to display which columns they require, what order they require, how data is sorted and what type of data are displayed. Unlike ordinary tables (base tables) in a relational database, a view does not form part of the physical schema: it is a dynamic, virtual table computed or collated from data in the database. Changing the data in a table alters the data shown in subsequent invocations of the view. Beginning with version 5.0.1, MySQL includes support for named views, usually referred to simply as “views” [8]. e) SubSelect: A subselect specifies a result table derived from the tables, views or nicknames identified in the FROM clause. The derivation can be described as a sequence of operations in which the result of each operation is input for the next. f) Stored Procedure: Beginning with version 5.0, MySQL includes stored procedure. A stored procedure is a program that runs directly on the database server. Stored procedures are written with SQL and can be used to improve performance and help with ease of development. Stored procedure is useful when we have a script that gets called often or uses any looped queries you may be generating a lot more network traffic than you should be. Stored procedures will cut down on long queries being sent over the network by turning a potentially 22 long query into a short alias. Using complex stored procedures could save you a lot of time when executing a query. By using stored procedures you're adding an extra data-layer, so to speak. This means you may be able to fix problems in your application by simply editing a stored procedure, instead of having the change your application code. g) Triggers: Support for triggers is included beginning with MySQL 5.0.2. Triggers are stored routines much like procedures and functions however they are attached to tables and are fired automatically when a set action is performed on the data in that table. Triggers are programmable events that react to queries and reside directly on the database server. Triggers can be executed before or after INSERT, UPDATE or DELETE statements. h) Unicode: MySQL has supported all conceivable character sets since version 4.1, including Latin-1, Latin-2, and Unicode (either in the variant UTF8 or UCS2) [8]. i) Full-Text Search: The MySQL has made it extremely easy to add FULLTEXT searching to the tables. This is a built in functionality in MySQL that allows users to search through certain tables for matches to a string. These are created much like regular KEY and PRIMARY KEY indices. Full-text search simplifies and accelerates the search for words that are located within a text field. If we employ MySQL for storing text, we can use full-text search to implement simply an efficient search function. There are some restrictions of full-text search. 23 First, MySQL automatically orders the results by their relevancy rating. Second, queries that are longer than 20 characters will not be sorted. Third, and most importantly, the fields in the MATCH () should be identical to the fields listed in the table's FULLTEXT definition. j) Replication: MySQL replication allows you to have an exact copy of a database from a master server on another server (slave), and all updates to the database on the master server are immediately replicated to the database on the slave server so that both databases are in sync [8]. This is not a backup policy because an accidentally issued DELETE command will also be carried out on the slave; but replication can help protect against hardware failures though. k) Transactions: MySQL 4.0 and onwards supports transactions and thus it follows the ACID test. ACID is Atomicity, Consistency, Isolation, Durability. It helps building a robust web application and simultaneously reduces the possibility of data corruption in the database. The term "transaction" refers to a series of SQL statements which are treated as a single unit by the RDBMS. Typically, a transaction is used to group together SQL statements which are interdependent on each other; a failure in even one of them is considered a failure of the group as a whole. In the context of a database system, a transaction means the execution of several database operations as a block. Thus, a transaction is said to be successful only if all the individual statements within it are executed successfully. This holds 24 even if in the middle of a transaction there is a power failure, the computer crashes, or some other disaster occurs. Transactions also give programmers the possibility of interrupting a series of already executed commands (a sort of revocation). In many situations this leads to a considerable simplification of the programming process. In spite of popular opinion, MySQL has supported transactions for a long time. One should note here that MySQL can store tables in a variety of formats. The default table format is called MyISAM, and this format does not support transactions. But there are a number of additional formats that do support transactions. l) Foreign Key Constraints: Foreign key was introduces in MySQL 4.0 to make MySQL proper RDBMS. MySQL supports foreign key constraints for InnoDB tables but it does not support foreign key constraint in MyISAM. In order to set up a foreign key relationship between two MySQL tables, three conditions must be met; both tables must be of the InnoDB table type, the fields used in the foreign key relationship must be indexed and the fields used in the foreign key relationship must be similar in data type. m) GIS Functions: Since version 4.1, MySQL has supported the storing and processing of two dimensional data. Thus MySQL is well suited for GIS (geographic information systems) applications [8]. 25 n) Programming Languages: There are quite a number of APIs (application programming interfaces) and libraries for the development of MySQL applications. For client programming you can use, among others, the languages C, C++, Java, Perl, PHP, Python, and Tcl. o) ODBC: MySQL supports the ODBC interface Connector/ODBC. This allows MySQL to be addressed by all the usual programming languages that run under Microsoft Windows (Delphi, Visual Basic, etc.). The ODBC interface can also be implemented under Unix, though that is seldom necessary. Windows programmers who have migrated to Microsoft’s new .NET platform can, if they wish, use the ODBC provider or the .NET interface Connector/NET. p) Platform Independence: It is not only client applications that run under a variety of operating systems; MySQL itself (that is, the server) can be executed under a number of operating systems. The most important are Apple Macintosh OS X, Linux, Microsoft Windows, and the countless Unix variants, such as AIX, BSDI, FreeBSD, HP-UX, OpenBSD, Net BSD, SGI Iris, and Sun Solaris. q) Speed: MySQL is considered a very fast database program. This speed has been backed up by a large number of benchmark tests (though such tests—regardless of the source—should be considered with a good dose of skepticism) [8]. 26 2.3 Limitations and Tradeoffs of MySQL: 1. All columns have default values. 2. If some garbage or out-of-range data is entered in the column, MySQL will set some appropriate value; it does not report any error. For numerical values, MySQL sets the column value as 0 or the smallest value or the largest value depending upon the requirement. 3. All calculated expressions return a value that can be used instead of signaling an error condition. For example, 1/0 returns NULL [7]. 27 Chapter 3 SCHEMA OPTIMIZATION Optimizing a database is a complex task, one need to take care of design and the system limitations to get the whole application optimized. To optimize a system we need to know what kind of processing the system is doing and then what bottlenecks it can have. For setting an environment for Optimization we need to review the Hardware, Software and Connectivity we are using for the database. 3.1 Hardware: Hardware plays an important role in the performance of a database query. Following properties of the hardware device affects the database: 1. Processors: Processor is very important when it come to the performance of the query. If the processor is fast the query will run faster, so if the production and test machines have different processor speeds the result of the queries will be lot different. The numbers will be different if we have more than one processor in the system. So if there is a mismatch between the test machine and the production machine it becomes a problem while we are setting engine parameters. 28 2. Memory: Memory is also one of the key features in the optimization of the query. If we compare two machines with different memories the one with higher memory will perform better. Prior to using the machine we need to check that both the machines have same memory so that there is no delay between the production machine and the client machine. 3. Mass Storage: Disk drive capacity and speed dramatically improve each year. This gives us more choices to pick a appropriate storage space required for the project. But this can introduce anomalies when it comes to performance testing. For example, if we want to examine why bulk data operations take so long on your production server? We don’t have enough disk capacity on our test platform to unload/reload 100-GB production data set, so what we can do is we can connect a high storage external drive. We can connect this external device to the test machine using USB 1.1. Unfortunately, USB 1.1 is orders-of-magnitude slower than either USB 2.0 or internal storage. This means that any disk-based performance numbers we receive, will not be meaningful, and could lead you to make incorrect assumptions. 4. Attached Devices: Every external device connected to the system can modify the performance of the MySQL production server. The same holds true for your test environment. The test platform can be connected to different external devices can 29 change the performance of the system; the other factors can be firewall or so on. These extra responsibilities can render any performance information useless. In ideal environment the test system will be a mirror image of the production server. Usually we don’t have different test system, so what we can do is run the tests on the production platform during periods of low activity. 3.2 Software: Configuring and installing the correct software is important for any system. 1. Database: One of the most effective aspects of MySQL open source architecture is frequent releases and fixing the bugs delivered in the market. It is important to use the same version of MySQL on both the production and test environment. MySQL releases can often change the performance of the applications as new features are provided and old features are deleted. 2. Application: While using different packaged application, it is important to have same version of software on both the test and production machine. All the new patches need to be installed on the machine, this increases the responsiveness. 3. Operating system: MySQL’s cross-platform portability is one of the benefits of its open source heritage. However, this needs to be taken care properly otherwise it can create problems in determining the issues. For example, if the production 30 server is based on Windows Server 2003, whereas testing server is running Linux. Given these differences, it will be too hard to transfer results from one platform to another platform. There are significant architectural and configuration differences between these operating systems that can affect MySQL application performance. The same holds true even for homogenous operating systems if they differ in installed patches or kernel modifications. To reduce the potential for confusion, it is better to have both the platforms as similar as possible. 3.3 Designing Database to Improve Speed: 1. Choosing the Right Storage Engine and Table Type: MySQL has various types of storage engines and tables that can be configured at the same time without interfering with the structure of the queries. It lets you play with the configuration of the database and table in a way that helps you find the perfect match of storage engine and table according to the data available. 2. Optimizing Table Structure: Optimizing table structure helps in reclaiming unused space after deletions and cleaning up the table after structural modifications. The OPTIMIZE SQL command takes care of this. 31 Syntax: OPTIMIZE TABLE table_name; 3. Specifying Row Format: When we create or modify a table we have three options to store rows. We can use fixed format that converts all varchar columns to char, on the other hand dynamic format converts all char to varchar. Fixed format make the row size small and dynamic format makes the row more flexible, so both have their advantages. Dynamic format uses less space but they have the risk of fragmentation and corruption. The third option is running myisampack command which instructs MySQL to compress the table into smaller, read only format. This translates into less work, better response time and less likely to become corrupted. For less disk space this is the perfect option. 4. Index Key Compression: Compressing index key makes read operations faster; this can be done by enabling PACK_KEYS option. Compressing index key helps in saving lot of disk space. Only problem with this is it puts an overhead on write operations. Syntax: CREATE TABLE Dimcustomer ( 32 DimCustomerKey smallint(5) unsigned NOT NULL auto_increment, DimCustomerID int(11) NOT NULL default '0', PRIMARY KEY (DimCustomerKey), UNIQUE KEY DimCustomerID (DimCustomerID), UNIQUE KEY DimCustomer (DimCustomer) ) TYPE=ISAM PACK_KEYS=1; 5. Checksum Integrity Management: This option helps in locating corrupted tables but it puts a slight overhead. This can be done by using CHECKSUM TABLE command. Syntax: CHECKSUM TABLE Prospect; 6. Column Types and Performance: Choosing right type of column helps in improving the performance of the table. The main aim is to choosing right datatype for specific column. Best practice is to choose datatypes that are smaller in size and simpler to choose. 7. String Consideration: 33 MySQL offers a variety of string and binary based information. Some of the useful datatypes available are: CHAR versus VARCHAR—Dynamic row format automatically converts CHAR columns to VARCHAR. On the other hand, certain fixed row format tables convert VARCHAR columns to CHAR. VARCHAR uses excess disk space but since these days disks are inexpensive so VARCHAR is preferred over CHAR due to its flexibility, improved speed and less fragmentation. CHAR BINARY versus VARCHAR BINARY— these columns hold binary information but they follow same criteria as CHAR and VARCHAR. BLOBs and TEXT— Binary large objects (BLOBs) typically hold images, sound files, executables, and so on. This column type is further subdivided into TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB. 8. Numeric Considerations: Integers— In MySQL integer can be defined in five ways: TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT, along with the ability to declare them SIGNED or UNSIGNED. Decimals— Choices available for decimal-based columns include DECIMAL/NUMERIC, FLOAT, and DOUBLE. Selecting FLOAT or 34 DOUBLE without specifying a precision consumes four bytes and eight bytes, respectively. We can specify a precision (on both sides of the decimal place) for the DECIMAL type. It helps in determining the storage consumption [6]. For example, if we define a column of type DECIMAL(10,2), it consumes at least 10 bytes, along with anywhere between 0 and 2 additional bytes, depending on if there is a fraction or sign involved. 9. Using Views to Improve Performance: MySQL implemented views with version 5.0. Retrieving information using SELECT * returns extra columns that are not necessary. When big queries like these run simultaneously, it negatively affects performance by increasing engine load and it consumes extra bandwidth on the server. In cases like this it is better to make a view that returns only those columns that are absolutely required for the task. View reduces the number of potential returned rows when performing a query. Using views with restrictive WHERE clause helps in reducing the data set. 10. Normalization: 35 Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency. Redundant data wastes disk space and creates maintenance problems. If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations. Normalization is a process whereby the tables in a database are optimized to remove the potential for redundancy. It is not always suggested to normalize the database; sometimes it makes more sense not to normalize a database. Typically, it’s unwise to store calculated information (for example, averages, sums, minimums, maximums) in the tables. Instead, users receive more accurate information by computing the requested results at runtime, when the query is executed. However, if we have a very large set of data that doesn’t change very often, then it is suggested to consider performing the calculations in bulk during off-hours, and then storing the results [6]. 36 We can use MERGE table option to partition and compress data but it makes more sense if we add a column to the master table, this helps in searching the data faster. 11. Using Constraints to Improve Performance: Constraints are rules that a database designer specifies when setting up a table. MySQL enforces these changes to information stored in the database. These changes usually occur via INSERT, UPDATE, or DELETE statements, although they can also be triggered by structural alterations to the tables themselves. MySQL offers the following constraints: UNIQUE: It guarantees there will be no duplicate values in the column. PRIMARY Key: It identifies the primary unique identifier of a row. FOREIGN Key: It codifies and enforces the relationships among two or more tables with regard to appropriate behavior when data changes. DEFAULT: It provides an automatic value for a column if a user omits entering data. 37 NOT NULL: It forces users to provide information for a column when inserting or updating data. ENUM: It allows to set a restricted list of values for a particular column, although it is not a true constraint. SET: It allows to store combinations of predefined values within a string column, although it is not a true constraint. These constraints help in reducing the chances of data integrity problem. These constraints run on the database server, this is faster than manually coding and downloading/installing the same logic on a client. Reuse is the main foundation of a good software design practices. Using these constraints, reduces the amount of time that developers need for these types of tasks, as well as helping cut down on potential error [6]. 12. Using Clustering to Improve the Performance: 38 MySQL clustering is a high performance, scalable, clustered database. MySQL cluster is a real time open source transactional database designed for fast, alwayson access to data under high throughput conditions. MySQL cluster utilizes a shared-nothing architecture which does not require any additional infrastructure investment. MySQL provides maximum data availability using distributed node architecture with no single point of failure [10]. Clustering spreads data over multiple servers resulting in a single redundant that is data is distributed and it is scalable that means we can add more machine depending on the requirements. MySQL cluster consists of a set of computers that run MySQL servers to receive and respond to queries, storage nodes to store the data held in a cluster and to process the queries and one and more management nodes to act as a central point to manage the entire cluster [12]. 3.4 Advantages of Using Clustering: a) High Availability: MySQL provides a fault tolerant architecture that ensures maximum availability. MySQL implements automatic node recovery to ensure the switch from one node to another if there is any node failure. MySQL cluster takes cares of hardware failure also. If all the nodes fail due to hardware failure the 39 MySQL cluster ensures that the entire system can be recovered in a consistent state by using combination of checkpoints and log execution. MySQL cluster makes sure that systems are replicated across geographically to be available to all the regions. b) Scalability: It improves performance of the database by adding more hardware and there by lowering the load on a single machine. c) Easy to Use: It is easier to take backup of a large cluster rather than taking backup of lots of separate database servers each up a partition of the data. d) High Performance: MySQL cluster provides high performance. MySQL cluster has a main memory database solution which keeps the data in memory and limits IO bottlenecks by asynchronously writing transaction logs on disk. MySQL cluster enables servers to share processing within a cluster taking full advantage of all hardware. e) Extremely Fast Auto Recovery: MySQL has extremely fast auto recovery. MySQL cluster uses synchronous replication to transfer information to all the database nodes so if a node fails it can automatically go to the adjacent node quickly. This eliminates the time of creating and processing the log information. MySQL cluster database nodes can recover and restart dynamically [7]. 40 Chapter 4 OPTIMIZATION The goal of performance tuning is to minimize the response time for each query and to maximize the throughput of the entire database server by reducing network traffic, disk I/O, and CPU time. This goal is achieved through understanding application requirements, the logical and physical structure of the data, and tradeoffs between conflicting uses of the database. Performance issues should be considered throughout the development cycle, not at the end when the system is implemented. Many performance issues that result in significant improvements are achieved by careful design from the outset. Although other system-level performance issues, such as memory, hardware, and so on, the performance gain from these areas is often incremental. Things that need to be considered for optimizing the database are hardware, OS/Libraries, setup and queries, API, and Applications. 4.1 Bottlenecks The most common system bottlenecks are: Disk seeks: Disk seeks is one of the most expensive operations in the database. It takes time for the disk to find a piece of data. 41 Some of the reasons why disk seek are one of the expensive operations are: a) Randomly distributed small files all over the disk. b) Editing different files over the disk at the same time. c) Due to files open and getting updated over the disk at the same time leads to more fragmentations. Though with modern disks, the mean disk seek is reduced to lower than 10ms, but it accumulates with time and with creating and updating of files. Disk Read/Write: Disk read/write is defined as the time taken to read or write data in a particular place on the disk. To read/write a data in the disk, disk has to physically move to the designated place.The way to optimize seek time is to distribute the data onto more than one disk [18]. This is easier to optimize than seeks because you can read in parallel from multiple disks. Caching data help in reducing the disk read/write time by not reopening the files repeatedly. CPU Cycles: The processing speed of the database can be increased by having the data in main memory. Having smaller tables helps in increasing the speed but having a smaller table is usually not a concern. Memory Bandwidth: When the CPU needs more data than can fit in the CPU cache, main memory bandwidth becomes a bottleneck [18]. 42 4.2 Measuring Performance Black box approach can be used to measure the performance of the database. It measures the transactions per second (TPS). The transaction is a unit of execution that a user invokes. The example of a transaction can be a read query or grouping the new updates in a stored procedure [17]. The variables that can be altered to affect the performance include: a) Altering Hardware impacts in the speed of the CPU, bus speed, memory access time, seek time, network and interface speed, all these items affects the performance of the database. b) Altering Operating System impacts in the performance of the native API’s, threading, locking and memory can impact the performance of a common benchmark. c) Number of Client concurrently attempting to connect to the database measures its ability to handle the burden on the database. d) Data Schema tells the structure of the database used in the test. It compares the results from single table to multiple tables. e) Data Configuration plays an important role in the performance of the database. By modifying the parameters like maximum number of client connections, size of 43 query cache, logs, index memory cache size, network protocols used, helps in determining the performance of the system [17]. 4.3 Optimizing Hardware for MySQL If dealing with bulky tables, it is suggested to consider 64 bit hardware. Since MySQL uses a lot of 64 bit integers internally, a 64 bit CPUs will give much better performance. For optimizing large databases, the order of optimization should be RAM, Fast disks, and CPU power. Increasing RAM helps in speeding up the database since it allows more key pages in the RAM. UPS is an option to save the tables in case of power failure. Database Systems that have dedicated server, 1G Ethernet is suggested. It improves Latency thus improves performance. 4.4 Optimizing Disks To improve the efficiency of the disk, it is suggested to have dedicated disks for system, programs, and temporary files and for logs if there are many transaction changes. 44 Low seek time reflects disk optimization. In most cases, blocks will be buffered together and the seek time does not exceed 1-2 seeks. A formula to calculate the seek time is: log(row_count) / log(index_block_length/3*2/(key_length + data_ptr_length))+1 seeks to find a row. Seek time for write will be 4 seek requests and seek time to put the new key and to update the index and write in the row will be 2 seeks. For large databases, the seek time of the disk is dependent on the disk speed, which increases by NlogN as data increases. Splitting database on different disks increases disk speed and this can be done by using symbolic links. Disk stripping helps in improving the performance of the disk by spreading data across several partitions on various disks. Stripping disks will increase both read and write throughput. Disk mirroring is a process in which two duplicate disks are simultaneously created with the same data. This process is expensive, as it requires two disks to store the same kind of data. This process provides maximum security as if one disk fails the system switches to the other disk immediately to another disk without any data loss. 45 Disk mirroring and disk stripping are not recommended for temporary files or for data that can be easily re-generated. On Linux use hdparm -m16 -d1 on the disks on boot to enable reading/writing of multiple sectors at a time, and DMA. This may increase the response time by 5-50 %. 4.5 Optimizing OS Configuring the system helps in reducing the memory usage. Adding RAM also helps in improving memory problems and increasing speed. NFS disks are not suitable for data usage since NFS has locking problems. Increase number of open files for system and for the SQL server. (add ulimit -n # in the safe_mysqld script). Increase the number of processes and threads for the system. If you have relatively few big tables, tell your file system to not break up the file on different cylinders (Solaris). Use file systems that support big files (Solaris). 46 Choose which file system to use; ReiserFS on Linux is very fast for open, read and write. File checks take just a couple of seconds. 4.6 Choosing API PERL - Portable programs between OS and databases - Good for quick prototyping - Use DBI (Database Interface)/DBD (Database Interface Driver) interface PHP - Simpler to learn than PERL - Standard MySQL interface - Faster - Uses fewer resources than PERL, which makes it good for embedding in Web servers. C /C++ - The native interface to MySQL. 47 - This is the fastest interface and gives more control. - Very simple to use ODBC - Works on Windows and Unix - It is slow - Harder to use, and interface is frequently changed so harder to get used to the system. - It adds overhead Python + others - We don’t use it anymore 4.7 Compiling and Installing MySQL Choosing compatible compiler helps in improving the performance by 10-30% [16]. Compiling MySQL with pgcc improves the efficiency since pgcc is a Pentium optimized version of gcc [16]. 48 Using recommended optimizing options for particular platform. Using native threads improves efficiency [16]. 4.8 Execution Path of a Query: Figure 3: Execution Path of a Query [7] 49 Execution Path of the query in Figure 3 is taken from MySQL Reference Manual [7]. It is further explained in following steps: a) User sends request in the form of a SQL statement to the server. b) Server checks if the data has been s recently accessed or not in the query cache. If it has, it is called a hit. The server returns the stored result from the cache; otherwise it passes the SQL statement for processing. c) The server parses, preprocesses, and optimizes the SQL into a query execution plan. d) The query execution engine executes the plan by calling the storage engine API. e) The storage engine grabs the data from the data repository and sends the data back to the user who requested the data. 4.9 Query Analyzer: Query analyzer can visually correlate MySQL, OS graph and query activities which help in finding the most expensive queries by visually correlating SQL activity with usage spikes in the key system resources. It allows analyzing execution plan of the query. It has a capability of highlighting any region of the graph and thus filters the queries that were 50 executing during the date/time selection. It collects and reports all SQL errors and warning for each query. A snapshot of a query analyzer is shown in Figure 4. Figure 4: Query Analyzer SQL Error and Warning Counts 4.10 Optimizing Environmental Parameters: 1. Low value for the max_write_lock_count: MySQL executes INSERT & UPDATE statements with higher priority. Also INSERT & UPDATE statements require table lock (for MYISAM engines) which requires even table reads 51 (SELECT statements) to be completed before INSERT & UPDATES are executed. This can cause long delays for SELECT statements waiting behind an INSERT or UPDATE statement, which may itself take minimal time to execute, which is waiting for existing long running SQL Select statements to be completed. So even a single INSERT or UPDATE can slow-down a heavily loaded database at unpredictable times. One solution is to use INSERT DELAYED statement to cause INSERT statements to be run at lower priority in a queue. By starting mysqld with a low value for the max_write_lock_count system variable we are forcing MySQL to temporarily elevate the priority of all SELECT statements that are waiting for a table after a specific number of inserts to the table occur. This allows READ locks after a certain number of WRITE locks. 2. FLUSH Command: The root-level user or users having reload privileges can use flush command to clear the internal caches used by MySQL. The syntax for FLUSH is: Syntax: FLUSH hosts| privileges| tables| logs; Example before using FLUSH: 52 SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows in set (07.20sec) Example after using FLUSH: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (07.80sec) Flush Privileges is invoked when a new user is added in the system. When a new user comes, the privileges from the grant table are reloaded in the MySQL database. Due to this server caches information in the memory. Every time server instances this kind of information, this will leads to storing unnecessary information in the memory. To free this cache memory, FLUSH privileges command is used. 53 When we run any flush command the Query OK response will assure that the cleaning process occurred without a hitch. Flush Hosts command flushes host cache tables and internal DNS cache. It is used because sometimes the number of connections used reaches to its maximum number for a particular host and that throws an error. It is also suggested to use when some of the host changes IP number or gets an error. Flush Hosts command resets and again allows connections to be made. While using Flush Hosts it is suggested to be careful because it can lead to reverse DNS lookup that can further limit the server’s speed. FLUSH Logs command closes and reopens the standard and update log files. It is used when the existing log files becomes too big. This command creates a new empty log file. It helps in managing the log files, which further can be studied easier to look for errors. FLUSH Tables command closes all the currently open or in-use tables. FLUSH Tables also removes all query results from the query cache. No error occurs if a named table does not exist. 3. Altering Index Buffer Size (key_buffer): The key_buffer variable is an important variable when it comes to improving the performance of the database 54 and optimizing the system. These work great on indexed tables. It is recommended to assign 25% of the total system memory. We can change the configuration of the variable to get better results depending upon the requirements. Syntax to check the status of the variable: SHOW VARIABLES like 'key_buffer_size'; Key_buffer_size = 499712; Example before changing the variable: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (07.90sec) Syntax to change the status of the variable: SET GLOBAL key_buffer_size= 1000000; Example after changing the variable: 55 SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (07.70sec) 4. Setting Maximum Number of Open Tables (table_cache): This variable is similar to max_connetions variables. Max_Connections variable tells how many connections a MySQL server can have and similarly table_cache tells how many table a MySQL server can have at a time. This variable can be altered to improve performance depending upon the server size. If the server size is big we can afford to increase the value of this variable thereby lets tables cache in the memory and thus makes the queries faster. Syntax to check the status of the variable: SHOW VARIABLES like 'table_cache'; Table_cache = 256; Syntax to change the status of the variable: SET GLOBAL table_cache=512; 56 5. Activating the Query Cache (query_cache_type): MySQL query cache is some what similar to the SQL cache, it stores the SELECT queries issue by user but on top of that MySQL query cache also stores the query’s result set. This benefits database engine by not only by avoiding the overhead of hard parsing for identical queries but also saves the overhead of recreating complex result set from either disk or memory caches. It decreases both physical and logical I/O/. It improves the response times for business applications. The query cache variable can be set to On, OFF, or DEMAND. It is recommended to turn this variable on for large number of similar queries requested repeatedly on server. The query cache variable can control amount of memory allocated to the MySQL query cache. It is suggested to increase the value of the variable for high volume servers. Syntax to check the status of the variable: SHOW VARIABLES like 'query_cache_type'; query_cache_type = OFF Example before changing the variable: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM 57 dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (07.90sec) Syntax to change the status of the variable: SET SESSION query_cache_type=on; Example after changing the variable: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (07.80sec) 6. Max_join_size: The default value of max_join_size is “18446744073709551615” if no size specified in the config file. This variable is used to set the maximum number of records that a SELECT query will scan for a table join. It is very useful variable because some of the queries are not written properly and that can end up scans millions of records thus decreasing the server’s capability to process requests. The default value of this variable is very large, since this variable will 58 impact different users and queries so it is important to set a value that will suit all kind of queries and users. Thus choosing a smaller value for this variable improves the performance of the server but it has to be chosen very carefully depending upon the database and the kind of requests the server receive. Syntax to check the status of the variable: SHOW VARIABLES like 'max_join_size'; Syntax to change the status of the variable: SET SESSION max_join_size=1000000000; 7. Max_connections: This variable controls the number of connections a server can concurrently have at any instant of time. The default value of this variable is 100. This need to be changed depending upon the number of connections. It is suggested to increase the value of this variable to avoid “Too many connections” error. Syntax to check the status of the variable: SHOW VARIABLES like 'max_connections'; Syntax to change the status of the variable: SET GLOBAL max_connections=100; 59 8. Long_query_time: MySQL creates a log for all the slow queries, it is called slow query log. MySQL automatically logs all the queries that take exceptionally more time to complete. This log helps in determining the specific queries that leads to downgrading the performance of the database by not finishing within the time limit. This log helps in writing an optimization algorithm depending upon the slow queries. It is suggested to have a larger value of this variable for systems experiencing heavy load. Syntax to check the status of the variable: SHOW VARIABLES like 'long_query_time'; Syntax to change the status of the variable: SET SESSION long_query_time=100; 9. Transaction Isolation Level (tx_isolation): This global variable controls, extend to which concurrent transactions can occur and the degree to which the data being updated is visible to other transactions. Never change the isolation level in the middle of the transaction, such change causes DBMS to make implicit commit. A value can be assigned to the tx_isolation variable by using SET command. Syntax to check the status of the variable: 60 SHOW VARIABLES like 'tx_isolation'; Syntax to change the status of the variable: SET GLOBAL |SESSION tx_isolation=serializable| Read Uncommitted| Read Committed | Repeatable Read; Global keyword sets the default transaction level globally for all subsequent sessions, though the existing sessions are unaffected. Session keyword sets the default transaction level for all subsequent transactions performed within the current session Without Global or Session keyboard the statement sets the isolation level to the next transaction. MySQL offers all the four isolation levels specified by the SQL Standards: 1. Read Uncommitted: It is also called dirty read. When it is used MySQL server uses no lock for SELECT query but employs locks on UPDATE and DELETES statements. It only ensures that the physically corrupted data will not be read. 2. Read Committed: This uses shared locks while reading data. It takes care of the limitation of the read uncommitted; it does not 61 read the physically corrupted data. It will not read the data from other application till it is committed by it. The limitation it has is, it can not ensure if the data read would not be changed by the end of the transaction. 3. Repeatable Read: It overcomes the limitations of both read uncommitted and committed. It puts locks on all the data that is used in the query. No other transaction can modify this data. 4. Serializable: This is the most restrictive isolation level. This never leads to phantom values. It restricts other users to make any kind of update till data is being used in the query. 10. Binary Log (log_bin): Binary log takes care of all the track changes in the database. It can be used as a backup of all the queries performed on the database. It helps in retracing the queries performed on the database. It gives a snapshot of all the queries performed on the database. It helps in restoring a system to a stable snapshot and thus helps in recovering from any transaction failure/interruption. Syntax to check the status of the variable: SHOW VARIABLES like ‘log_bin’; 62 11. Init_connect: This variable can be used to triggers a series of SQL commands or queries when a non super user connects successfully. When a connection is established a set of commands are supposed to executed, this is the easiest way to initialize user specific SQL code when a connection is established. Syntax to check the status of the init_connect: SHOW VARIABLES like ‘interactive_timeout’; Syntax to change the value of the variable is: SET GLOBAL init_connect='SET autocommit=0'; 12. Interactive_timeout: The default timeout value for a MySQL server connection is 28,800 seconds. It results in a connection break if the user is idle for a long period of time. The value of the variable can be changed using SET command. Syntax to check the value of the timeout is: SHOW VARIABLES like ‘interactive_timeout’; Syntax to change the value of the variable is: SET interactive_timeout = 10000; 63 There is no option at the client side for the timeout option; the only way to set a timeout is at the server side using the interactive_timeout variable. The reasonable timeout can be anywhere around 5 minutes, but it is not recommended to have a larger timeout especially if the number of user trying to connect to the server is lot. 13. MySQL cache helps in improving the performance optimization tasks. MySQL cache is little different from Oracle cache. In Oracle cache the whole execution plans are cached, but in MySQL it stores the result set. The operating modes of the query cache can be edited and this gives a lot of control to the administrator. The three values that can be assigned to the mode variable, the values are 0, 1 and 2. By default Query caching is turned off. Query Caching is turned off for mode 0, Query Caching is enabled for mode 1, and for mode 2 queries cache is enabled but it works only on demand (select query_cache ….). Syntax: SET GLOBAL query_cache_type=0; SET GLOBAL query_cache_type=1; SET GLOBAL query_cache_type=2; 64 4.11 Optimizing Tables Structure: 1. Using NOT NULL as default value speeds up the execution of a query. Syntax: ALTER TABLE dimcustomer CHANGE CustomerKey CustomerKey varchar(100) NOT NULL; Example before changing the datatype: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.30sec) Example after changing the datatype: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.10sec) 65 2. Using appropriate attribute types and length saves lot of space and query access time. Using appropriate datatypes helps in reducing the size of the column. It is suggested to use SMALLINT or MEDIUMINT instead if INT. When using fixed length attributes, such as CHAR it is suggested to use short length. Example before changing the datatype: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.30sec) Syntax: ALTER TABLE dimcustomer CHANGE CustomerKey CustomerKey smallint; After making some more updates like deleting and inserting some rows. Example after changing the datatype: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM 66 dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.20sec) 3. It is suggested to use fixed length attributes rather than dynamic attributes like VARCHAR or BLOB. 4. Indexes need to be designed according to the need. Primary key index should be as small as possible, prefixes should be used where ever possible and indexes should be used only where required some time it acts as an overhead. It is suggested that the leftmost attribute has the highest number of the duplicate entries since it is frequently used. 5. Indexing keys are most commonly used in the WHERE clause. Index keys are used frequently to join tables in MySQL statements. There is a restriction of how many indexes can be used in a table and how long the index can be. Most of the storage engines support at least 16 indexes per table and a total index length of at least 256 bytes. Example before adding the indexes: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM 67 dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.20sec) Syntax Add Index: CREATE INDEX index1 on DimCustomer (Customerkey); CREATE INDEX index2 on DimCustomer (Geographykey); CREATE INDEX index3 on DimCustomer (CustomerAlternatekey); CREATE INDEX index4 on DimCustomer (EmailAddress); CREATE INDEX index1 on Prospect (ProspectiveBuyerKey); CREATE INDEX index2 on Prospect (ProspectAlternateKey); CREATE INDEX index3 on Prospect (ProspectiveBuyerKey); CREATE INDEX index4 on Prospect (EmailAddress); CREATE INDEX index5 on Prospect (AddressLine1); CREATE INDEX index6 on Prospect (Prospect_id); Example after adding the indexes: 68 SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.40sec) Indexing improves query execution time but access of indexing ads overhead on the table. 6. It is suggested to index keys that have high selectivity. The selectivity of an index is the percentage of rows having the same value for the indexed key. The indexed key is optimal if few rows have the same value. 7. Indexing helps in improving the performance of the query but it is not suggestible to index all the columns. It is not suitable if we use indexing a column that gets updated frequently. 8. Composite indexing contains more than one column. It provides more advantages over single-column indexing. Composite indexing improves selectivity by combining different columns together. It helps in decreasing system I/O. If all the selected columns are attached together through composite index, it returns the value by only accessing the indexes, and thus reducing the overhead of accessing the table. 69 Syntax: CREATE INDEX index1 on DimCustomer (Customerkey, Geographykey, CustomerAlternatekey, EmailAddress, AddressLine1,Phone); CREATE INDEX index1 on Prospect (ProspectiveBuyerKey, ProspectAlternateKey,EmailAddress, AddressLine1); Example before composite indexes: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.50sec) Example after adding composite indexes: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 70 2059 rows affected (04.40sec) 9. To generate unique values AUTO_INCREMENT is preferable. Syntax: ALTER TABLE DimCustomer ADD customer_id MEDIUMINT NOT NULL AUTO_INCREMENT KEY ALTER TABLE Prospect ADD Prospect_id MEDIUMINT NOT NULL AUTO_INCREMENT KEY 10. Use small primary keys. Use numbers instead of strings when joining tables this improves the speed of the query execution. Comparing String primary key with number: Select * from DimCustomer; // When primary key is of type Mediumint 18484 rows affected (211.70sec) Select * from DimCustomer; // When primary key is of type String(Varchar) 18484 rows affected (356.60sec) 11. Horizontal Partitioning: Horizontal partitioning breaks a table into two or more tables. This breaks rows into two or more tables. This allows different row-based datasets that can be addressed individually or collectively depending upon the 71 situation. This provides the possibility for indexing. It lowers the number of rows to be accessed. It will lower number of reads from the disk/memory. Syntax: Before Partition: CREATE TABLE `test`.`dimcustomer` ( `customerkey` smallint(6) default NULL, `geographykey` smallint(6) default NULL, `customeralternatekey` varchar(100) NOT NULL, `Title` varchar(100) default NULL, `FirstName` varchar(100) default NULL, `middlename` char(10) default NULL, `LastName` varchar(100) default NULL, `NameStyle` varchar(100) default NULL, `BirthDate` varchar(100) default NULL, `maritalstatus` char(10) default NULL, 72 `gender` char(10) default NULL, `EmailAddress` varchar(100) default NULL, `YearlyIncome` varchar(100) default NULL, `totalchildren` tinyint(6) default NULL, `numberchildrenathome` tinyint(4) default NULL, `EnglishEducation` varchar(100) default NULL, `SpanishEducation` varchar(100) default NULL, `FrenchEducation` varchar(100) default NULL, `EnglishOccupation` varchar(100) default NULL, `SpanishOccupation` varchar(100) default NULL, `FrenchOccupation` varchar(100) default NULL, `numbercarsowned` tinyint(6) default NULL, `AddressLine1` varchar(100) default NULL, `Phone` varchar(100) default NULL, `DateFirstPurchase` varchar(100) default NULL, 73 `CommuteDistance` varchar(100) default NULL, `customer_id` mediumint(9) NOT NULL, PRIMARY KEY (`customer_id`), ); Example: SELECT firstname, lastname, emailaddress, education, occupation from prospect; 2059 rows affected (04.10sec) After Partition: CREATE TABLE prospect1 ( ProspectiveBuyerKey varchar(100) default NULL, FirstName varchar(50) default NULL, MiddleName varchar(50) default NULL, LastName varchar(50) default NULL, Gender varchar(10) default NULL, EmailAddress varchar(50) default NULL, 74 prospect_id mediumint(9) NOT NULL auto_increment, ); CREATE TABLE prospect2 ( ProspectAlternateKey varchar(15) default NULL, YearlyIncome varchar(100) default NULL, TotalChildren varchar(100) default NULL, NumberChildrenAtHome varchar(100) default NULL, Education varchar(40) default NULL, Occupation varchar(100) default NULL, ); CREATE TABLE prospect3 ( NumberCarsOwned varchar(100) default NULL, AddressLine1 varchar(120) default NULL, City varchar(30) default NULL, StateProvinceCode varchar(3) default NULL, 75 PostalCode varchar(15) default NULL, Phone varchar(20) default NULL, Salutation varchar(8) default NULL, Unknown varchar(100) default NULL ); Example after Partition: SELECT firstname, lastname, emailaddress FROM prospect1; 2059 rows affected (02.20 sec) 4.12 Optimizing How to Load tables: 1. Using INSERT DELAYED increases the speed by using single disk write. It can be used when the user is not concerned when the data is written. INSERT DELAYED will wait until other clients have closed the table before executing the statement. INSERT DELAYED only works on MyISAM, MEMORY, and ARCHIVE tables. 76 Syntax: INSERT DELAYED INTO DimCustomer VALUES ('va1l','va12','va13',....); 2. INSERT LOW_PRIORITY when SELECT statement has higher priority. This increases the speed of the select query by making insert to happen later when the select query leaves the resources. Syntax: INSERT low_priority INTO counties VALUES('val1','val2','val3','val4',.....); 3. There are three ways of inserting data in the table, single insert, multiple insert and load data or mysqlimport. Single insert is the slowest, multiple row insert is the faster and load data is the fastest among them. Single Insert: It insert one statement at a time. Syntax: Insert into DimCustomer Values ('11000','26','AW00011000','','Jon','V','Yang','0','24205','M','M','jon24@ad venture-works.com','90000','2','0','Bachelors','Licenciatura','Bac + 4','Professional','Profesional','Cadre','1','0','3761 N. 14th St','','1 (11) 500 555-0162','37094','1-2 Miles'); 77 Insert into DimCustomer Values ('11001','37','AW00011001','','Eugene','L','Huang','0','23876','S','M','eugene [email protected]','60000','3','3','Bachelors','Licenciatura','Bac + 4','Professional','Profesional','Cadre','0','1','2243 W St.','','1 (11) 500 5550110','37090','0-1 Miles'); …………………………. 18484 rows affected (552.60 sec) Multiple-row insert technique can improve the performance of the database possessing, as MySQL will process multiple insertions faster than it will multiple INSERT statements. This improves the performance of the database processing, as MySQL considers all these insertions as a single long insert which is much faster than the multiple inserts. Syntax: Insert into DimCustomer Values ('11000','26','AW00011000','','Jon','V','Yang','0','24205','M','M','jon24@ad venture-works.com','90000','2','0','Bachelors','Licenciatura','Bac + 4','Professional','Profesional','Cadre','1','0','3761 N. 14th St','','1 (11) 500 555-0162','37094','1-2 Miles'), 78 ('11001','37','AW00011001','','Eugene','L','Huang','0','23876','S','M','eugene [email protected]','60000','3','3','Bachelors','Licenciatura','Bac + 4','Professional','Profesional','Cadre','0','1','2243 W St.','','1 (11) 500 5550110','37090','0-1 Miles'), ('11002','31','AW00011002','','Ruben','','Torres','0','23966','M','M','ruben35 @adventure-works.com','60000','3','3','Bachelors','Licenciatura','Bac + 4','Professional','Profesional','Cadre','1','1','5844 Linden Land','','1 (11) 500 555-0184','37082','2-5 Miles'), ………………. 18484 rows affected (61.10 sec) Load data: To load large amounts of data, LOAD DATA INFILE is preferable instead of using INSERT statements. Syntax: LOAD DATA INFILE 'DimCustomer.sql' INTO TABLE test.DimCustomer; 18484 rows affected (17.50 sec) 4.13 Optimizing Queries: 1. MySQL interprets query from right to left so all the limiters should be placed at the right. 79 2. Columns that need to be ORDER BY should be preferably be indexed. 3. Indexes are preferred when a column is searched a lot but it slows down the insertion in the table. Unused indexes should be dropped. 4. For limited results from database LIMIT should be used in the query. It helps in saving MySQL resources since as soon as the result is found MySQL stop searching the whole database. Syntax before Using Limit: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows affected (04.90 sec) Syntax after Using Limit: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id Limit 0,100; 2059 rows affected (00.30 sec) 80 5. Optimize Table command should be used regularly, since it de-fragments the table. This is very effective when a table is updated frequently. It speeds up the table access, reduces the size of the table on disk and also reduces the select query response time. The OPTIMIZE command should be run when the DBMS is offline for scheduled maintenance. The command is nonstandard SQL. Syntax before Optimizing Table: Select * from dimcustomer limit 0, 10000; 10000 rows affected (113.90 sec) Syntax: Optimize TABLE DimCustomer, Prospect; Syntax before Optimizing Table: Select * from dimcustomer limit 0, 10000; 10000 rows affected (113.20 sec) 6. When a aggregate function such as COUNT ( ) or SUM( ) are frequently used, it is advisable to use statistics table that can be updated every time any row is manually updated with the aggregate values. For example, if the statistics table 81 maintains the count of rows in a table, each time the row is updated the table is updated in the statistics table. For large table using statistics table is preferred over aggregate functions since they improve the speed. 7. SELECT HIGH_PRIORITY lets the select operations run first and then any other write operations. It bypasses other table operations. The only limitation it has that it can not have Union involved. It gives more power in a multiuser, multipurpose application. However, HIGH_PRIORITY should only be used in cases where we need instantaneous access to information. We have to check of the other operations waiting in the queue are not mission critical. Syntax: SELECT HIGH_PRIORITY * FROM DimCustomer; 8. In-memory tables have fast access to data that either never changes or doesn’t need to persist after restart. Using in-memory table means that a query can complete without even waiting for disk I/O. 9. Columns having identical information in different tables should have identical data types, this improves the performance. 10. Indexes are used to improve the speed of the queries but some times they can slow down the process. So it is sensible to check in a query is using the indexes. This can be checked by using EXPLAIN. If the indexes aren’t being used then we can avoid using them to increase the performance. 82 Syntax: Explain DimCustomer; 11. Avoid using complex SELECT queries on tables that are updated frequently, this will help in avoiding problems created due to locking the table frequently for read and write updates. 12. For highly dynamic queries it is better to use SQL_NO_CACHE but otherwise we can turn it off for more static data. This helps is decreasing the overhead of the database. Syntax: SELECT * FROM DimCustomer; 18484 rows affected (211.70sec) Syntax: SELECT SQL_NO_CACHE * FROM DimCustomer; 18484 rows affected (211.50sec) 13. SQL_BUFFER_RESULT hint suggests MySQL to store the result in the result set that is stored in a temporary table. This can be used for large result sets which require significant amount of time to send the results to the client to avoid having the queried tables locked for that period of time. 83 Syntax: SELECT SQL_BUFFER_RESULT * FROM counties; 14. SQL_BIG_RESULT hint can be used with DISTINCT and GROUP BY SELECT statements. It is used for large result sets. Example using Select *: SELECT * FROM DimCustomer; 13690 rows in set (157.20 sec) Example using Select SQL_BIG_RESULT *: SELECT SQL_BIG_RESULT * FROM DimCustomer GROUP BY CustomerKey; 13690 rows in set (148.10 sec) 15. SQL_SMALL_RESULT hint uses fast temporary tables to store the resulting table instead of using sorting. Example Select *: 84 SELECT * FROM Prospect GROUP BY Prospect_id; 2059 rows in set (18.90 sec) Example Select SQL_SMALL_RESULT *: SELECT SQL_SMALL_RESULT * FROM Prospect GROUP BY Prospect_id; 2059 rows in set (18.60 sec) 16. Benchmarking: Benchmarking is the easiest and the most effective way to check the speed of the server. It is an inbuilt function in MySQL that calculates the time taken to calculate a given expression. When a benchmark function is executed the result is always 0. It is not used to retrieve the data but it is used to check if a same expression is executed n number of time the time taken to execute that query should be same all the time. It can be executed at different period of time during the day to check how much time difference comes every time the function is executed. It gives an idea when the server is most busy or idle during the day. Syntax: SELECT BENCHMARK(1000000, ‘tan(.45)’); 85 17. MySQL Query Analyzer - The new MySQL Enterprise has improved the application performance and efficiencies for the critical deployments using the new Query Analyzer feature. The MySQL Query Analyzer helps in improving the application performance by monitoring query performance and accurately pinpointing the code that causes poor performance and slows down the system; this feature helps the DBAs work on the queries that make the database slower. With the help of MySQL Query Analyzer, the DBA can tune the code by continuously monitoring and fine tuning the queries; thus, helps in achieving peak performance on MySQL. 18. Normalize first and then denormalize: In the real world of high traffic websites and users demanding sub-second response times, we have to use denormalization. The key to high performance database access is sticking to singletable SELECT queries with short indexes. 19. Show process list: The MySQL process list allows you to see what processes are currently running on your MySQL servers. SHOW PROCESSLIST to monitor their MySQL servers for poorly performing queries. We can manually parse through the aggregated data, to determine which MySQL servers have queries that may be causing a bottleneck and those queries can be terminated or can be modified. 86 Syntax: Show full processlist; 20. Storing Intermediate Results: Intermediate, or staging, tables are quite common in relational database systems, because they temporarily store some intermediate results. In many applications they are useful, but Oracle requires additional resources to create them. Always consider whether the benefit they could bring is more than the cost to create them. Avoid staging tables when the information is not reused multiple times. Some additional considerations: a) Storing intermediate results in staging tables could improve application performance. In general, whenever an intermediate result is usable by multiple following queries, it is worthwhile to store it in a staging table. The benefit of not retrieving data multiple times with a complex statement already at the second usage of the intermediate result is better than the cost to materialize it. b) Long and complex queries are hard to understand and optimize. Staging tables can break a complicated SQL statement into several smaller statements, and then store the result of each step. 87 c) Consider using materialized views. These are pre-computed tables comprising aggregated or joined data from fact and possibly dimension tables. 21. Chopping Queries: To improve query processing speed, queries can be processed into smaller chunks that affects lesser rows each time. Example of this can be purging old data. We can use LIMIT along the delete statement and can delete a limited number of rows at a time. Thus we are dividing a process in smaller processes and putting fewer loads on the processor. Example before using Limit: SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; 2059 rows in set (04.10 sec) Example after using Limit: 88 SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id Limit 0,100; 100 rows in set (00.30 sec) 22. Join Decomposition: A multiple join statement can be decomposed into multiple single tables and then joining all those single tables later. Single Join Query: SELECT * FROM Prospect LEFT JOIN Prospect1 ON Prospect.Prospect_id = Prospect1.Prospect1_id where Prospect1.Prospect1_id=1; Join decomposed: SELECT * FROM Prospect where Prospect.Prospect_id=1; SELECT * FROM Prospect1 where Prospect1.Prospect1_id=1; This looks wasteful but it saves lot of resources and processing time. Caching is more efficient that joining multiple tables. When first select statement is processed it caches all the data that later can be used by second statement and thus lower the processing time and same logic used for the third query. This running multiple queries helps in finding the result faster than running multiple joins. Also 89 when a table is accessed that table is locked so that updated values are only accessed. So if we are using multiple tables, we are locking all the tables associated with those joins. This is not an efficient way of locking. 23. Subquery Conversion: Converting subqueries into select query with join statement helps in lowering load on the processor. Subquery: SELECT FirstName, LastName, EmailAddress From DimCustomer Where DimCustomer.Customer_id IN (Select Prospect.Prospect_id FROM Prospect) Order by Prospect.Prospect_id; 2059 rows in set (02.90 sec) Join: SELECT DimCustomer.FirstName, DimCustomer.LastName, DimCustomer.EmailAddress FROM DimCustomer JOIN Prospect ON DimCustomer.Customer_id=Prospect.Prospect_id); 2059 rows in set (02.80 sec) 90 24. Query Analyzer: It gives a consolidated view of entire MySQL environment. It has customizable rule-based monitoring and alerts that helps in identifying the problems before they occur. 25. Avoid computation in the where clause. 26. Only use transaction where it is most necessary. Inserts in auto-commit mode are the slowest insert statements. 27. Avoid using SELECT query_cache if the database is updated frequently. SELECT query_cache is faster only with static databases. A miss in the query cache add 20% overhead compared to a normal query with out query cache. 28. Use temporary views instead of derived tables. Derived table Syntax: SELECT * FROM (SELECT * FROM DimCustomer) D; 18484 rows in set (214.60 sec) Temporary Views Syntax: CREATE VIEW v AS SELECT * FROM DimCustomer; Query OK, 0 rows affected (0.00 sec) Select * from v; 91 18484 rows in set (212.20 sec) 29. Use Truncate Instead of Delete: Every delete table operation is logged in the transaction log as compared to truncate table which is not logged in the transaction log. Since truncate table doesn’t get logged in log it is faster than the delete table operation. One drawback of truncate table is that it doesn’t have any records in the log table and thus making it impossible to rollback the operation. Syntax for delete: DELETE FROM table_name; Syntax for truncate: TRUNCATE FROM tablename; 30. Avoid unnecessary parenthesis: By avoiding unnecessary parenthesis we can save processor some time for processing extra parenthesis. Syntax: SELECT * FROM DimCustomer WHERE (customer_id = 1000 OR CustomerKey = 12060); 2 rows in set (00.30 sec) 92 Instead use: SELECT * FROM DimCustomer WHERE customer_id = 1000 OR CustomerKey = 12060; 2 rows in set (00.30 sec) 4.14 Example This example shows how using MySQL hints can drastically improve the query execution time Create two tables and insert the data from an existing dump. After creating the tables Access some columns from these two tables. Solution: Steps before using optimizing rules: 1. Create two tables. CREATE TABLE Dimcustomer ( CustomerKey, varchar(100), GeographyKey, varchar(100), CustomerAlternateKey, varchar(100), 93 Title, varchar(100), FirstName, varchar(100), MiddleName, varchar(100), LastName, varchar(100), NameStyle, varchar(100), BirthDate, varchar(100), MaritalStatus, varchar(100), Gender, varchar(100), EmailAddress, varchar(100), YearlyIncome, varchar(100), TotalChildren, varchar(100), NumberChildrenAtHome, varchar(100), EnglishEducation, varchar(100), SpanishEducation, varchar(100), FrenchEducation, varchar(100), 94 EnglishOccupation, varchar(100), SpanishOccupation, varchar(100), FrenchOccupation, varchar(100), HouseOwnerFlag, varchar(100), NumberCarsOwned, varchar(100), AddressLine1, varchar(120), AddressLine2, varchar(120), Phone, varchar(100), DateFirstPurchase, varchar(100), CommuteDistance, varchar(100), customer_id, mediumint(9),PRIMARY KEY(customer_id)); CREATE TABLE Prospect ( ProspectiveBuyerKey, varchar(100), ProspectAlternateKey, varchar(15), 95 FirstName, varchar(50), MiddleName, varchar(50), LastName, varchar(50), BirthDate, varchar(100), MaritalStatus, varchar(10), Gender, varchar(10), EmailAddress, varchar(50), YearlyIncome, varchar(100), TotalChildren, varchar(100), NumberChildrenAtHome, varchar(100), Education, varchar(40), Occupation, varchar(100), HouseOwnerFlag, varchar(1), NumberCarsOwned, varchar(100), AddressLine1, varchar(120), 96 AddressLine2, varchar(120), City, varchar(30), StateProvinceCode, varchar(3), PostalCode, varchar(15), Phone, varchar(20), Salutation, varchar(8), Unknown, varchar(100), prospect_id, mediumint(9),PRIMARY KEY(prospect_id)); 2. Insert the statements in the table. INSERT INTO DimCustomer VALUES ('11000','26','AW00011000','','Jon','V','Yang','0','24205','M','M','jon24@adve nture-works.com','90000','2','0','Bachelors','Licenciatura','Bac + 4','Professional','Profesional','Cadre','1','0','3761 N. 14th St','','1 (11) 500 555-0162','37094','1-2 Miles'); 97 INSERT INTO DimCustomer VALUES ('11001','37','AW00011001','','Eugene','L','Huang','0','23876','S','M','eugene1 [email protected]','60000','3','3','Bachelors','Licenciatura','Bac + 4','Professional','Profesional','Cadre','0','1','2243 W St.','','1 (11) 500 5550110','37090','0-1 Miles'); ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------INSERT INTO DimCustomer VALUES ('29483','217','AW00029483','','Jésus','L','Navarro','0','21892','M','M','jésus9 @adventure-works.com','30000','0','0','Bachelors','Licenciatura','Bac + 4','Clerical','Administrativo','Employé','1','0','244, rue de la Centenaire','','1 (11) 500 555-0141','37693','0-1 Miles'); 13689 row affected (55.80sec) INSERT INTO prospect VALUES ('1','21596444800','Adam','','Alexander','13333','M','M','aalexander@lucerne publishing.com','40000','3','0','Partial 98 Co','Professional','1','2','566 S. Main','','Cedar City','UT','84720','516-5550187','Mr.','0'); INSERT INTO prospect VALUES ('2','3003','Adrienne','','Alonso','14860','M','F','[email protected]' ,'80000','4','0','Bachelors','Management','1','2','7264 St. Peter Court','','Colma','CA','94014','607-555-0119','Ms.','4'); ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------INSERT INTO prospect VALUES ('2059','37111264500','Zoe','','Ward','27990','M','F','zward@consolidatedmes senger','20000','0','0','High Schoo','Manual','1','1','25149 Howard Dr','','West Chicago','IL','60185','1 (11) 500 555-0190','Ms.','3'); 2059 row affected (6.80sec) 99 3. Access columns: SELECT * FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; Result is shown in Figure 5. Figure 5: Snapshot of the Resultant Table Before Using Hints 2059 rows in set (42.80 sec) 55 Columns Steps after Using Optimizing Rules: 100 1. Create tables: CREATE TABLE Dimcustomer( customerkey, smallint(6), geographykey, smallint(6), customeralternatekey, varchar(100), Title, varchar(100), FirstName, varchar(100), middlename, char(10), LastName, varchar(100), NameStyle, varchar(100), BirthDate, varchar(100), maritalstatus, char(10), gender, char(10), EmailAddress, varchar(100), YearlyIncome, varchar(100), 101 totalchildren, tinyint(6), numberchildrenathome, tinyint(4), EnglishEducation, varchar(100), SpanishEducation, varchar(100), FrenchEducation, varchar(100), EnglishOccupation, varchar(100), SpanishOccupation, varchar(100), FrenchOccupation, varchar(100), houseownerflag, tinyint(6), numbercarsowned, tinyint(6), AddressLine1, varchar(100), AddressLine2, varchar(100), Phone, varchar(100), DateFirstPurchase, varchar(100), CommuteDistance, varchar(100), 102 customer_id, mediumint(9),Primary key(customer_id)); CREATE TABLE Prospect ( prospectivebuyerkey, smallint(6), prospectalternatekey, varchar(100), FirstName, varchar(100), middlename, char(10), LastName, varchar(100), BirthDate, varchar(100), maritalstatus, char(10), gender, char(10), EmailAddress, varchar(100), YearlyIncome, varchar(100), totalchildren, tinyint(4), numberchildrenathome, tinyint(4), 103 Education, varchar(100), Occupation, varchar(100), houseownerflag, tinyint(4), numbercarsowned, tinyint(4), AddressLine1, varchar(100), AddressLine2, varchar(100), City, varchar(30), stateprovincecode, char(10), PostalCode, varchar(100), Phone, varchar(100), salutation, char(10), unknown, tinyint(4), prospect_id, mediumint(9),Primary key(prospect_id)); 2. Setting Environmental Parameters Setting Maximum number of open tables 104 SET GLOBAL TABLE_CACHE=256; Altering Index Buffer Size SET GLOBAL KEY_BUFFER_SIZE=500000; Enable Query Caching SET GLOBAL QUERY_CACHE_TYPE=1; 3. Load data in the database using load data infile. LOAD DATA INFILE 'data.txt' INTO TABLE dname.tname; 13689 rows affected (17.50sec) 4. Add index to the columns that have unique values ALTER TABLE Dimcustomer ADD INDEX index1 (customerkey, geographykey); 2059 row affected (0.80sec) 5. Join tables and only select columns that are required instead of accessing all columns. 105 SELECT dimcustomer.firstname, dimcustomer.lastname, prospect.emailaddress, prospect.education, prospect.occupation FROM dimcustomer,prospect WHERE dimcustomer.customer_id = prospect.prospect_id; Result is shown in Figure 6. Figure 6: Snapshot of the Resultant Table After Using Hints 2059 rows in set (3.70 sec), 5 Columns 106 After looking the response time of these queries it is confirmed that by using proper optimizing techniques we can make our queries much faster and cleaner. All the rules and suggestions discussed in the document help in improving the efficiency of the database. When these rules are implemented individually we can not see significant changes but when combined all together we can see a significant improvement in the efficiency of the database. Some of the major hints that have improved the performance of the database significantly are as mentioned below. 1. Optimizing the Hardware: For bulk tables 64 bit hardware is preferred. Increasing RAM always help in speeding up the database since it allows more key pages in the RAM. 1G Ethernet is preferred since they improve Latency. 2. MySQL Analyzer: With the help of MySQL analyzer we can examine query activities and find out the most expensive queries which can be further studied and can be removed with an efficient query which performs the same task but in an efficient manner. 107 3. Activating Query Cache: By activating query cache the result of the SELECT query is stored in the cache which helps in speeding up the query execution since it is available in the cache for faster access. 4. Optimizing Tables: When creating a table it is suggested that most of the columns by default should be NOT NULL. Using appropriate datatype always improves the access time. It is suggested to use datatype which are small in size like SMALLINT, CHAR, and MEDIUMCHAR. If possible use fixed length attributes rather than dynamic attributes like VARCHAR or BLOB. Primary key should as small as possible. It is suggested to index keys that have high selectivity. Composite indexing is suggested since it decreases system I/O. Horizontal partitioning significantly improves the performance of the table. Horizontal partitioning provides the possibility for indexing and it lowers the number of rows to be accessed which improves the execution time by many folds. 5. Loading Tables: The two most effective ways of loading a table is by either using multiple-row insert techniques or by using the LOAD DATA INFILE command. LOAD DATA INFILE command is used for loading large size data in the tables. 108 6. Optimizing Queries: It is suggested to use LIMIT whenever possible. This helps in saving MySQL resources. After making changes in the table it is always suggested to optimize the table using command OPTIMIZE TABLE since it de-fragments the table. It speeds us the table access. Use SELECT SQL_BIG_RESULT command instead of SELECT for large result sets. SELECT SQL_BIG_RESULT command is much faster then SELECT alone. A multiple join statement should be decomposed into multiple single tables and then joining all those single tables later. This saves lot of resources since the When first select statement is processed it caches all the data that later can be used by second statement and thus lower the processing time and same logic used for the third query. 109 Chapter 5 CONCLUSION MySQL optimization is a set of best practices geared towards optimizing the performance of the database and its queries. The goal of optimizing database is to minimize the response time and increase the throughput of the database. This can be achieved by understanding the application requirements as well as the logical and physical structure of the data. To improve the performance of the database, it is important that all the performance issues should be considered throughout the database development cycle. To optimize the database it is important to identify the bottlenecks and try to focus on those areas. Although, system-level performance issues such as memory, CPU, hardware and further more are also important for overall tuning of the database but the performance gained from these areas are more incremental. Optimizing database can be achieved by analyzing the applications and queries that interact with the database. The performance of the database is usually affected by unexpectedly long-lasting queries and updates that affect the database schema. Some of the main causes of long-lasting queries are slow network communications, inefficient 110 queries, lack of useful indexes, out-of-date statistics, inefficient environmental parameters, and inefficient table structures. The goal of the project is to provide rules and hints that will help in optimizing the queries and database; therefore, making the system faster and more efficient. This, in effect, saves a lot of processing time and resource time which further improves the performance of the system. This document has covered most of the topics that helps in optimizing the database and queries in an efficient manner. An optimized database can avoid unnecessary hardware and software upgrades. In this project I have mentioned techniques that can help in tuning queries, tables and database schemas. The document suggests what values should be assigned to the environmental variables so that they in turn provide an optimized environment to work in. This document mentions what kind of database to be used for specific kinds of data to improve the overall performance of the database. It also mentions what type of datatype to be used for different columns. In additional to that it suggests different ways to load data in the database depending upon the type and size of data. It suggests how to breakdown complex queries into simpler query that will use minimum resources and still provide the same results. It mentions how to use a MySQL query analyzer to check the load on the database at different time intervals; therefore, helps in analyzing queries and see how much resource a particular query is consuming. This document provides a 111 guideline if followed accordingly could help in optimizing queries, tables and thus improving the overall performance of the database. The document mentions several different rules along with hints when used properly could save numerous amounts of resources which then improve the throughput of the entire database. When these hints are used individually they do not show a significant change in the processing time; however, when all these hints as well as these rules are used together, a significant improvement in the response time of the queries could be seen. The practices discussed in the document not only aid in optimizing the database but also plays a key role in making the database more proficient. The previously mentioned rules discussed in this document would help in optimizing the database efficiently; although, this document is not exhaustive but it covers most of the concepts that helps in query and database optimization. This is not the final word in MySQL optimization; though this document can serve as a building block for more upcoming projects related to MySQL optimization. With the above mentioned rules along with the document that I have presented I have successfully completed the task of accomplishing the topic of, “Improving the MySQL Query Optimization”. 112 BIBLIOGRAPHY [1] Sun and Oracle, [Online] Available: http://www.sun.com/software/products/mysql/features.jsp [2] MySQL and Sun, “MySQL Server FAQ, [Online] Available: http://www.mysql.com/industry/faq/ [3] MySQL and Sun, “Overview of MySQL Storage Engine Architecture, [Online] Available: http://dev.mysql.com/doc/refman/5.1/en/pluggable-storage-overview.html [4] Techotopia MySQL Database Architecture, [Online] Available: http://www.techotopia.com/index.php/MySQL_Database_Architecture [5] John W. Horn and Michael Grey, “MySQL: Essential Skills” McGraw-Hill 2004 [6] Robert D. Schneider, “MySQL® Database Design and Tuning”, 2005 [7] MySQL Reference Manual, [Online] Available: http://dev.mysql.com/doc/ [8] Michael Kofler , “MySQL 5”, Third Edition [9] Steven Feuerstein, Guy Harrison, “MySQL Stored Procedure Programming”, 2006 113 [10] A MySQL Technical White Paper, “MySQL Cluster 6.2”, 2008 [11] Technical White Paper, “MySQL Cluster Architecture Overview - A MySQL”, 2004 [12] Alex Davies and Harrison Fisk, “MySQL Clustering”, 2008 [13] Charles A Bell, “Expert MySQL”, 2007 [14] Peter Zaitsev, “White Paper Advanced-MySQL-Performance-Optimization” [15] Baron, Peter, Jeremy and Dereck, “High Performance MySQL” [16] Optimizing MySQL, [Online] Available: http://dev.mysql.com/tech-resources/presentations/index.html [17] Technical White Paper, “MySQL Performance Benchmarks”, 2006 [18] MySQL Press, “MySQL Administrative Guide” [19] Sergey Petrunya, Sun Microsystems White Paper “Understanding and Control of MySQL Query Optimizer Traditional and Novel Tools and Techniques”, 2006 [20] Charles A. Bell, “Expert MySQL”, 2007