Download SQL Server DB Design Guidelines

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Tandem Computers wikipedia , lookup

Database wikipedia , lookup

Microsoft Access wikipedia , lookup

Team Foundation Server wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Ingres (database) wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Relational model wikipedia , lookup

SQL wikipedia , lookup

Database model wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

PL/SQL wikipedia , lookup

Transcript
SQL Server DB Design Guidelines
by Stephen R. Mull
http://home.earthlink.net/~a013452/index.htm
Contents
Security
Abstraction
Physical Design
I/O & Logical Design
Locks & Deadlocks
Naming Conventions
Coding for Readability
Release & Documentation
Replication, Clustering, and Log Shipping
Appendix A, Stored Procedures and Code Example(s)
Appendix B & C, Temporary Tables vs. Table Valued Variables
Appendix C, Stored Procedure(s) Template with Example
Appendix D, Optimize Disk Configuration in SQL Server
New:
I sat through a presentation on the "fusion io" www.fusionio.com, brand of SSD. FYI, Steve Wozniac is their
Chief Scientist. Here are some specifics:
- It exists as a PCIe card, in order to avoid bottlenecking due to SATA or SCSI type storage communication
protocols.
- Price point is $7,500.- for a 320 GB card. Current max card size is 640 GB.
- The cards use up some server memory (RAM) to store a redundant set of metadata while running. They
have config software that installs with them.
- Throughput is something like 600 MB/sec, maybe 40,000 IOPs (fantastically huge compared to a
mechanical HDD).
- Reliability and longevity are comparable/at least as good as HDDs. 8-10 year life (at data rate of 2 TB/day an industry ~standard for 80 GB drives). Specifically, P(undetected bad data) = EE-20, P(lost data) = EE-30.
Also, since the thing monitors itself, it will wear out gradually, and show you/signal to you as it is doing so ..
- The cards use something like a RAID 5 algorithm internally, algorithms to evenly wear themselves, and
algorithms to proactively monitor and fix any data anomalies. They also use (20%) extra entirely internal
storage, to replace worn out flash cells as the drive ages.
- Combining the cards in a server-level RAID configuration doesn't noticeably slow anything down.
- Adding a second card (and striping it as RAID 0) not only doubles storage size, it doubles storage
throughput (!). This linear scaling applies until you saturate the computer bus.. (resulting in a massively huge
IO rate). Often servers with a few of these cease to be I/O bound, and then become either CPU or network
I/O bound..
- The card is insensitive to sequential vs. random I/O operations, since it is massively parallel. There's then
no benefit to separating data and log file operations onto separate cards, etc.
- There is no need, like spindle loading, to keep extensive freeboard on the drive.
- My (strong) recommendation is to:
o buy a card and test it,
- if you are highly cautious (but these things are more like other PCI cards than HDDs, failure-wise. If they
fail they might bring down the whole box.)
o test it on a different (from production) server first
- by putting tempdb, and the logs on it
o assuming it tests well, move that card to production
- Combined with current sufficient RAM and CPU, it should massively increase the speed of the box, and
dramatically speed current query response, particularly updates and things using tempdb for query
processing. You might even be able to put all the (mdf) data files on it too.
- Use existing storage for backups, etc.
- The hot ticket are these on 2 redundant servers, for all storage, and then put SQL 2k8 EE, maybe using
built-in compression, and synchronous mirroring, on the (x64) system(s).
Update: I/we installed a 320 GB card on each of two 8x, x64, 64GB SQL 2k5 StdEd, Win 2k3 EE servers, for
one of my clients. I also re-configured the RAID array into aligned RAID 5, with caching + battery backup,
and write-through OFF. Data throughput went from ~5 MB/Sec (before) to 300 MB/sec (after). The only
issue was that the Fusion i-o cards, and the Dell (&Broadcom) firmware needed the latest releases, to all
DB Design Standards, sheet 1 of 23
play nicely together.. (due to higher i/o throughput on the bus)
1.
Security

Review the Microsoft Technet site on SQL Server security best practices:
http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/sql/maintain/security/sp3sec/Default.asp

Review Microsoft’s SQL Server Security Checklist:
http://www.microsoft.com/sql/techinfo/administration/2000/security/securingsqlserver.asp










2.
Using only Windows integrated security, as Microsoft suggests, is usually impractical. Therefore
one needs to take care configuring SQL Server permissions & security.
Don’t expose a SQL Server directly to the internet. Consider both firewalls and network
segmentation for securing a SQL Server.
All SQL Server production installations using SQL Server security should have a very strong “sa”
password, to prevent exhaustive search cracks.
Don’t grant a login or service any more permissions than it needs to do it’s specific tasks.
A set of non-administrative roles should be defined with respect to each production SQL server
application.
A set of SQL Server production UserIds should be created and assigned to the created roles.
All production processes should access production dbs either through these production UserIds
or through application roles.
Production UserIds/roles should be granted execute permission on stored procedures only, on an
as-needed basis (excepting requirements due to dynamic SQL).
Before QA of each production release, Quality should change –all- passwords on the test server
& confirm that no code is broken as a result (i.e. there is no unmanaged db access by any
component, tier, or 3rd party service, especially under the “sa” SQL Server UserId)
All sensitive information should be stored in the db in masked form. The masking function should
be simple enough so that it can be executed or reversed in SQL, VB, Java, or C++. Masking
functions should not be stored on the db server anywhere.
Abstraction
Most auto-generated SQL (from many RAD products) scales very inefficiently. With these architectures,
as the user load increases, it is necessary to buy more hardware, in increasing proportion to users added,
to maintain response time. For maintainability, and ultimately scalability, put SQL into stored procedures.
This builds in modifiability on the DBMS, making future maintenance, troubleshooting, and modification
faster, potentially transactionally consistent, and much easier.




3.
Use stored procedures for all production access (updating and reading) on all new databases.
o On SQL Server 2000, views can be substituted for stored procedures with minimal
execution speed impact; however this can still make SQL statement optimization more
difficult. On SQL 7, views are universally slower and significantly less featured than
stored procedures, and thus should not be used in lieu of stored procedures for
abstraction.
On existing databases, views are minimally acceptable as a means of after the fact abstraction
for read-only operations, when implementation of stored procedures would require extensive
modifications.
Abstract connection information wherever feasible.
Some even abstract stored procedure/SQL call text and parameter name(s)/type(s), especially on
high end systems (by making a metadata db). Using this method one can port a system between
db platforms much more easily, while maintaining abstraction, and hence the capability to tune
the specific SQL to each platform.
Physical Design
As a database grows, either in size or load, regularly evaluate partitioning different portions of the
database onto different physical files/filegroups, storage devices, and/or servers. Depending on storage
subsystem design, this can dramatically speed up db lookup, modification, or recovery. Stay well ahead
of storage space & speed needs.
DB Design Standards, sheet 2 of 23

Optimum physical design varies dramatically depending on db size, load size, and load type. For
a small departmental or testing db, the default db file specifications are fine. Often these small
implementations grow into a departmental server, all on RAID 5.

Everything on RAID 5 is a convenient, low write performance configuration. On RAID 5 write
operations, the RAID subsystem has to calculate and write parity information before the write
operation is committed (excluding controller caching). The rule of thumb is that RAID 5 is
acceptable for less than 5% writes and unacceptable for more than 10% writes. With Raid 5, it is
generally not necessary to partition onto different files/filegroups for the sake of I/O speed.
However, filegroup partitioning may still be convenient for back-up/recovery speed optimization.

Optimum read/write striping is RAID 0 + 1, which is 100% space redundant, and therefore the
most expensive storage configuration. To balance between cost & performance, often data is
placed on RAID 5 and the logs (& sometimes tempdb & the system dbs) are placed on RAID 0 +
1. Thus write intensive operations are placed on the fastest (& costliest) redundant storage
subsystem, where they will be helped most.

Especially consider separating log and data operations onto different physical storage
subsystems and controllers. This can be accomplished regardless of whether or not one is using
RAID.

Note: Don’t use Microsoft’s OS based RAID. For small systems, buy an IDE RAID card instead.

Since Write performance is generally proportional to HDD speed, sometimes a ~fast & cheap way
to debottleneck a box is simply to upgrade the HDDs to higher speed drives. This may be much
faster & cheaper than a redesign & optimization of the data access layer. Generally, however, this
just postpones the problem for later.

A great write-up on physical db design by Denny Cherry is included in Appendix D, “Optimize disk
configuration in SQL Server”.
4.
I/O & Logical Design
Persistence design based on table rows as persistent instances of object properties (one row in one table
per instance of the object) can often result in degraded dbms performance. This object based design can
lead to increased round trips to the dbms, by encouraging ISAM operations, which often get executed in a
loop in the middle tier. From the whole application’s perspective, encapsulation of related db operations
into a single (T-SQL) procedure on the server side will enable enormous relational algorithmic
efficiencies, as well as eliminate the encapsulation and parsing overhead that results from multiple dbms
round trips. This architectural limitation of ISAMs applies regardless of hardware or dbms platform
selection.

Design to:
o Minimize round trips to dbms (RAM cache on client(s) as much as possible
o Minimize the size of returned data (no lazy select *, redundant fields, or redundant
selects)
o Share/pool connections

XML
o
o
Being both delimited and character based, XML is verbose. Further, it can’t be streamed
into anything, without first defining the relevant schema. This means that for high
throughput applications, one should carefully weigh the cross platform compatibility
benefits of XML with its throughput limitations compared to TDS 8.
Current high throughput best practices still include separating XML services onto a
separate server, and communicating with SQL Server using TDS 8. The ADO.Net
System.Data.SqlClient class uses native TDS 8. Microsoft’s early rhetoric on this subject,
“XML everything”, is misleading.
DB Design Standards, sheet 3 of 23
o

















5.
Several of Microsoft’s SQL Server/XML architectural models, similar to models implying
use of Analysis Services alongside a production SQL Server, are 2-teir. These models
were designed for a departmental or development server, and ignore many production
considerations, such as throughput under load and maintainability.
Use the minimum possible number of select statements to accomplish any lookup or modification.
For this purpose sub-queries count as separate select statements.
For scalable applications you have to stay stateless. Do -not- use server side cursors, or
components that create server side cursors, at all. They’re too resource intensive to scale. For
server side looping try “while” loops on (small) table variables w/identities instead (v2k). For more
large or complex applications, carefully managed temp tables can also be used. Note that either
creating or dropping a temp table loads the system tables and can create concurrency problems.
Actually compare/test table valued variables vs. temp tables for OLTP applications, and design to
avoid either one if possible. Especially avoid “select into” for OLTP applications. Consider this KB
article on tempdb concurrency issues: http://support.microsoft.com/default.aspx?scid=kb;enus;328551. Generally you still have to manage concurrency manually. Be cautious about trying to
scale the new .Net disconnected recordset objects. Test first at load.
Avoid/delay use of most Declarative Referential Integrity (DRI) in immature or fast OLTP
databases– put check logic in stored procedures or the middle tier
o Exception – when an “in production” schema is not fully abstracted or the new db is
deployed in an environment consisting of multiple data updaters of varying skill levels.
Where row order is constrained always use order by & asc/desc qualifiers – never rely on default
sort order or unconstrained data order in a table.
Use outer joins in lieu of correlated sub-queries (for “not exists”, etc ..), unless you first closely
check the query plan of the statement to make sure it’s not “acting” like a correlated sub-query.
Avoid custom data types. They’re equivalent to DRI and slow down updates to the db.
Consider use of the “no lock” optimizer hint for DSS, catalog, and read only table operations.
Enable “select into/bulk copy” on all new (v7 only) dbs.
Enable “auto create statistics”, “auto update statistics” & “auto shrink” on new small/medium sized
dbs. Evaluate changing these configurations on larger/faster dbs.
Always use a stored procedure template for header/footer code fragments.
Always create a usage example, author, & date in the comment section of the stored procedure
definition.
Always use return codes in stored procedures such as: 1 or a relevant row count for success, 0
for failure or no records affected/found.
Use integer type keys when designing primary keys for lookup speed.
Use timestamps in preference to other primary key candidates when designing tables for high
speed write access, since you can then guarantee uniqueness while still writing to an un-indexed
heap.
Use triggers only to auto-generate log files or timestamps – use procedures elsewhere – it keeps
the db more maintainable.
Avoid bit & smallmoney data types, use integer & money types instead – (word alignment).
Use datetime / smalldatetime & CurrentTimestamp / getdate() functions instead of the timestamp
data type (DRI).
Locks & Deadlocks
Locking is a normal part of db operation. While specifying “NO LOCK” optimizer hints for static reads can
sometimes be useful (and sometimes dangerous), other locking tweaks can cause more trouble than they
solve. Often locking problems are due to complex, inefficient, multi-part read/write T-SQL transactions
(which often result over time from uncoordinated, organic development), combined with a slow I/O
subsystem.

Theoretically deadlocks can be avoided on all intra-db I/O, by carefully designing all transactions
compatibly, with a consistent locking sequence. As a practical matter, that much planning is
usually uneconomical. However, on well designed systems, locking problems are minimal. The
typical scenario is to repair, well after rollout, SQL code that fails intermittently under increasing
load, as deadlocks or timeouts. Use Profiler or sp_lock & sp_who2 to find the locks/deadlocks,
and examine the sequence and structure of the associated transaction & T-SQL.
DB Design Standards, sheet 4 of 23

6.
Fixing the transaction design & T-SQL can often be cleanly accomplished if you’re abstracted
with stored procedures, and very difficult if you’re not.
Temp Tables & Table Valued Variables
First, as a point of reminder - temp tables (names preceded by a # [local to the connection] or ## [global
to all connections] sign) are persistent storage - and so therefore involve writes to tempdb when they are
created, when they are filled, and also when they are dropped. Therefore, one should try to
avoid/minimize temp table usage if possible, by modifying query logic. That said, there are often
scenarios when, give an particular pre-existing schema and required query output, temp storage of some
kind can't be avoided. In these scenarios, try to follow these (sometimes tortured) rules:

Only use table valued variables when:
 the total amount of data to be stored in them doesn't exceed a very few 8k data
pages,
 there is no need for transactional capability.
Otherwise, use temp tables. Table valued variables aren't optimized for data which has to be written to
disk. Nor are table valued variables transactional. Consider not only one instance of your proc, but many,
near simultaneous instances, and the data space in RAM they may collectively occupy.
Please see the links below for execution time comparisons as a function of cardinality/data size, temp
data storage technique, and SQL Server version. The execution times vary by several fold.

For stored procedures which have branching logic, which branches to different queries, put the leaves
of all of the branches, containing each unique query, into separate (sub) procs [a proc called from within
another proc]. This will preserve query plans and reduce recompiles.
SQL Server caches a query plan for each proc the first time it is compiled or run. There is no
sophisticated logic to handle the possibility of a single proc needing different query plans. So, for a single
highly branched proc, there is a substantial chance that the cached query plan will be inapplicable, and
the proc will need to be re-compiled, on the fly, to generate a correct plan for the current execution
scenario. This re-compilation both slows down the proc and loads the CPU resources of the server. The
issue is not so much any single proc, but the cumulative load from many near simultaneous executions of
such procs, with different arguments and therefore different branching, that may result in an OLTP
environment.

In any proc containing temp tables, declare all of the temp tables used in the proc at the very
beginning of the proc (this takes advantage of special SQL Server optimizations designed to eliminate
unnecessary recompiles due to temp tables)
Recall that temp tables don't necessarily exist at the time a proc is compiled or first run. To prevent
recompiles under these scenarios, Microsoft has written in a minimally documented optimization into
stored procedure query plan generation. First, for many reasons, don't use the "select into" construct.
Declare/create the temp table(s), and then insert into it/them. Second, put all of the temp table
declarations for the stored procedure at the very beginning of the proc, before anything else in the proc.
Then the optimizer will find the temp table definitions and interpret them as existing local tables for the
remainder of the proc.

Reduce to the absolute minimum any use of a temp table which is declared outside of the current
proc. It necessarily generates a re-compile.
Recall from above, that, under certain conditions temp tables can be created/declared and then
considered as existing, even though they don't yet exist, at query run-time. This optimization applies to
temp tables properly created/declared within a single stored procedure. If a temp table is used to transfer
data between two stored procedures, and thus is created outside of any particular proc, then that proc
won't be able, at compile time, to identify the particular instance of the temp table in tempdb which
contains the needed data (there may be several, depending on how many instances of the proc are
DB Design Standards, sheet 5 of 23
running at once). This scenario necessarily generates a re-compile for the relevant proc at run-time. Thus,
it should be avoided in high throughput applications wherever possible.
One can alternately consider the "insert exec" family of statements to return data up through a multi-proc
stack. Alternately, one can alter the architecture of a family of procs so that after a branch, data is
returned to the original call, and not subsequently processed by higher level procs.
The speed differences associated with observing these (tortured) constraints are often highly significant,
in high throughput production and on large datasets.
Here are two particularly good links which describe the issues involved in the above comments.
http://www.codeproject.com/KB/database/SQP_performance.aspx
http://blogs.msdn.com/sqlprogrammability/archive/2007/01/18/11-0-temporary-tables-table-variables-andrecompiles.aspx
7.
Naming Conventions






8.
Prefixes, if desired:
o “t” for tables, ex: tOrder
o “p” for procedures, ex: pOrderGet
o “sp_” prefix used only for system wide utility stored procedures compiled in the master db
o Often tables carry no prefix, but other objects in the db do.
Consider the noun-verb construction of procedure names: pRptOrdersByCustomerGet, object
names will sort together, making browsing easier.
For Wintel/SQL Server platforms, use mixed/camel case with the first letter of any variable name
capitalized (excluding type prefixes), ex: pRptOrdersByCustomerGet
For Oracle/Unix, use underscores: p_rpt_orders_by_customer_get
All table names should take the singular form, ex: tOrder, tOrderItem. Plurality depends on
context.
Field names in a db should be consistent throughout a db, so that when a field appears in two
different tables, it can be recognized as a key candidate between the two tables.
Coding for Readability






Put SQL keywords on different lines
Put the select field list on different lines, one field per line (handy for commenting out fields during
development)
Put where clauses on different lines (same thing – so you can disable a filter easily)
Put join clauses ahead of filter clauses on the where clauses of non SQL-92 (old syntax) select
statements (improves backward compatibility and in some cases may affect join logic)
Use tabs to indent for readability
Comment as appropriate. Don’t over or under comment, and emphasize why you’re doing what
you’re doing.
See Appendix A for a stored procedure style example
9.
Release & Documentation




The original developer should test every new or modified stored procedure fully before scheduling
it into a release.
All stored procedure source text should be stored in separate files in SourceSafe (or equivalent),
with an “.sql” extension and a filename the same as the procedure name.
Associate and document (link both directions) a feature id number or bug number with each new
or changed stored procedure or other production schema change.
All stored procedures & SQL scripts should be combined into an integrated release script for each
release.
DB Design Standards, sheet 6 of 23



10.
Each integrated release script should be fully tested by development DBAs prior to its release to
QA.
Notify db engineers of
o All new stored procedures,
o All stored procedure modifications,
o All proposed table or schema changes,
Sufficiently prior to release for meaningful review
o Current db description documents and/or Metadata should be maintained in SourceSafe.
Use the Metadata features of SQL Server very cautiously and sparingly. Their portability
is problematic.
Replication, Clustering, and Log Shipping
Often Replication, Clustering, and Log Shipping are considered for creating warm/hot stand-by systems,
or for creating distributed databases. Use Log Shipping if you can, Clustering if you must, and try to write
your own Replication.

Replication is highly problematic as a stand-by mechanism. It significantly loads the source
server(s), often entangles itself with source & destination db schemas, and the UI in Enterprise
Manager has holes. Beware: it is trivially easy (even after SQL2K/SP2) to break a replication
instance, via SQL Server Enterprise Manager, such that the system can’t be repaired except via
T-SQL. Don’t use any replication unless you can repair it with T-SQL, because you will need to
sooner than you think.

Replication can be used to implement distributed databases. Usually merge replication is used
(although one can hack bi-directional transactional replication, which introduces other problems).
Due to the way merge replication wraps itself around a source schema, it makes future schema
changes more difficult. For this reason the schema should be stable before using this tool.
Further, the tool will require significant ongoing maintenance. Don’t use it for hot OLTP systems,
because of the load it places on the source server. Many seasoned DBAs write their own
lightweight replication processes in lieu of using Microsoft’s canned stuff.

Clustering is Microsoft’s recommended High-Availability solution. However, their own internal IT
staff often uses Log Shipping instead. Cluster Services can fail over completely, automatically
and ~quickly, and doesn’t need any server re-naming to be performed on fail-over. However,
Clustering is distributed only with the Enterprise Edition of SQL Server, which is significantly more
expensive than the Standard Edition. Microsoft Clustering is also notoriously buggy (Win2K/SP3
& SQL2K/SP2 finally addressed some of the lingering problems), particularly for ancillary SQL
Server services (like the scheduler, mail, etc.), 3rd party apps on the Server, or even slightly nonHCL hardware. Several 3rd party clustering solutions are very good (NSI DoubleTake).

Log Shipping can lose some committed data on fail over (as can clustering), and requires a
server re-name, or other re-direct on fail over. Log shipping also loads the source server.
However, it’s relatively simple to create and troubleshoot, and is a known and utilized architecture
on many platforms, by many DBAs. SQL2K Enterprise Edition has a built-in version, and
Microsoft distributes several free lighter weight versions. There are also several versions on the
web. You can also write your own.
The economics of DBMS availability are tricky. Replication, Clustering, and Log Shipping are simply
mechanisms of increasing database availability. Often organizations over-spend on availability
products, under-design their internal support systems, and under-train their staff. This can result in
critical gaps, and un-economical overlap, in the DBMSs fault tolerance. Of particular economic
importance is not to spend extra money preventing two point failure scenarios, without a strong
understanding of the system fault tree and relative component failure rates.
A careful, quantitative availability analysis, using real data and valid stochastic methods, can often
yield a surprisingly different result than what is assumed.
DB Design Standards, sheet 7 of 23
Appendix A
Stored Procedure Template with Example
Blue text represents header & footer information
use master
go
IF EXISTS (select * from dbo.sysobjects where id = object_id('[dbo].[ sp_BcpDataOutOfDB]')
and OBJECTPROPERTY(id, 'IsProcedure') = 1)
BEGIN
PRINT 'Dropping Procedure sp_BcpDataOutOfDB'
DROP Procedure sp_BcpDataOutOfDB
END
GO
PRINT 'Creating Procedure sp_BcpDataOutOfDB'
GO
CREATE Procedure sp_BcpDataOutOfDB (
@OutputPath varchar(256) = 'c:\',
@DB_Name varchar(256) = '',
@ServerName varchar(64) = '',
@UserName varchar(64) = '',
@Password varchar(64) = ''
)
/*
sp_BcpDataOutOfDB sMull, 09/18/2000
This procedure copies out all of the dbo tables in a specified
db as bcp files in native or character format. TBD: add non-dbo tables &
check xp_cmdshell permissions ..
useage example(s):
sp_BcpDataOutOfDB @OutputPath='c:\working', @DB_Name='working'
*/
AS
begin
declare @sSQL varchar (512),
@iCounter int,
@Result int
----------------------- Initialization --------------------------if
@DB_Name = ''
select @DB_Name = DB_NAME()
if
@ServerName = ''
select @ServerName = @@SERVERNAME
if
@UserName = ''
select @UserName = SYSTEM_USER
if
right(@OutputPath,1) <> '\'
select @OutputPath = @OutputPath + '\'
create table #TableNames (
theCounter int Identity (1,1),
TableName sysname
)
Create unique clustered index idx_TblName on #TableNames(theCounter)
set
@sSQL = 'dir ' + @OutputPath + ', NO_OUTPUT'
EXEC
@result = xp_cmdshell @sSQL
IF
begin
@result <> 0
set
exec
@sSQL = 'md ' + @OutputPath
xp_cmdshell @sSQL
end
DB Design Standards, sheet 8 of 23
--------------- Load in table names from target db --------------set
exec
@sSQL = 'insert #TableNames select TableName = so.[name] from ' + @DB_Name +
'..sysobjects so where xtype = ' + '''' + 'U' + '''' + 'and substring(so.name,1,2) <> ' +
'''' + 'dt' + '''' + 'and substring(so.name,1,3) <> ' + '''' + 'sys' + ''''
(@sSQL)
set
@iCounter = @@ROWCOUNT
--------------- Process tables ----------------------------------while
begin
@iCounter > 0
set
@sSQL = ''
select @sSQL = 'xp_cmdshell ' + '''' + 'bcp "' + @DB_Name + '..' + tn.TableName +
'" out "' + @OutputPath + @DB_Name + '..' + tn.TableName + '.bcp" -n -S' +
@ServerName + ' -U' + @UserName + ' -P' + @Password + ''''
from
#TableNames tn
where theCounter = @iCounter
select
from
where
TableName = left(tn.TableName, 32),
DateTm = getdate()
#TableNames tn
theCounter = @iCounter
if @sSQL <> '' exec (@sSQL)
select @iCounter = @iCounter - 1
end
return (1)
end
go
grant
go
execute on sp_BcpDataOutOfDB to public
Fast De-normalization example code (SQL 2k5):
This process is much faster than various cursor-based processes ..
Use adventureworks;
SELECT p1.productmodelId,
( SELECT Name + '|'
FROM adventureworks.production.Product p2
WHERE p2.productmodelId = p1.productmodelId
ORDER BY Name
FOR XML PATH('') ) AS Product
FROM adventureworks.production.Product p1
GROUP BY productmodelId ;
DB Design Standards, sheet 9 of 23
APPENDIX B
SQL Programmability & API Development Team Blog
http://blogs.msdn.com/sqlprogrammability/archive/2007/01/18/11-0-temporary-tables-tablevariables-and-recompiles.aspx
All posts are AS IS, without any further guarantees or warranties.
11.0 Temporary Tables, Table Variables and Recompiles
11.1 Temporary Tables versus Table Variables
In order to determine if table variables or temporary tables is the best fit for your application, let us first examine some
characteristics of table variables and temporary tables:
1.
Table variables have a scope associated with them. If a table variable is declared in a stored procedure, it is
local to that stored procedure and cannot be referenced in a nested procedure.
2.
The transaction semantics for table variables is different from temporary tables. Table variables are not
affected by transaction rollbacks. Every operation on a table variable is committed immediately in a separate
transaction.
3.
It is not possible to create indexes on table variables except by specifying them as constraints (primary key
and unique key). In other words, no DDL is possible on table variables. On the other hand, it is possible to
create non key indexes on temporary tables. This is especially useful if the data set is large.
4.
Temporary tables can also be referenced in nested stored procedures and may be the right fit if the object
needs to be available for a longer duration (not just scoped to the batch like table variables).
5.
Both temporary tables and table variables are kept in memory until their size reaches a certain threshold
after which they are pushed to disk.
Let us examine the caching impact of temporary tables and table variables. There are no statistics based recompiles
for table variables. The general rule of thumb is to use temporary tables when operating on large datasets and table
variables for small datasets with frequent updates. Consider the example below where the table test_table has 10
rows of data and 100K rows are inserted into a table variable. Then a join is performed on the table variable and
test_table.
create procedure table_variable_proc
as
begin
declare @table_variable table(col1 int, col2 varchar(128));
declare @i int;
set @i = 0;
while (@i < 100000)
begin
insert into @table_variable values(@i, convert(varchar(128), @i));
set @i = @i + 1;
end
select * from @table_variable tv join test_table on tv.col1 = test_table.col1;
end
go
exec table_variable_proc
go
Now let us rewrite the same example with temporary tables to compare and contrast the two approaches:
create procedure temp_table_proc
as
begin
create table #table_name(col1 int, col2 varchar(128));
declare @i int;
set @i = 0;
while (@i < 100000)
begin
insert into #table_name values(@i, convert(varchar(128), @i));
set @i = @i + 1;
end
select * from #table_name join test_table on #table_name.col1 = test_table.col1;
DB Design Standards, sheet 10 of 23
end
go
exec temp_table_proc
go
Now query the DMVs to get the query plan and average CPU time:
select total_worker_time/execution_count as avg_cpu_time,
substring(st.text, (qs.statement_start_offset/2) + 1,
((case statement_end_offset
when -1
then datalength(st.text)
else
qs.statement_end_offset
end
- qs.statement_start_offset)/2) + 1) as statement_text
, cast(query_plan as xml)
from sys.dm_exec_query_stats qs
cross apply sys.dm_exec_sql_text(qs.sql_handle) st
cross
apply
sys.dm_exec_text_query_plan(qs.plan_handle,
qs.statement_end_offset)
order by total_worker_time/execution_count desc;
go
Avg_
Cpu_
Time
208412
Statement_
Text
Query_plan
select
*
from
@table_variable
tv
join test_table on
tv.col1
=
test_table.col1;
51978
select
*
from
#table_name
join
test_table
on
#table_name.col1
=
test_table.col1;
<ShowPlanXML xmlns="http://
schemas.microsoft.
com/sqlserver/2004/07/showplan"
Version="1.0" Build="9.00.0000.00">
<BatchSequence>
<Batch>
<Statements>
<StmtSimple StatementText=
"create procedure
table_variable_proc
&#xD;&#xA;as
.
.
<QueryPlan CachedPlanSize="8"
CompileTime="2" CompileCPU="2"
CompileMemory="120">
<RelOp NodeId="0"
PhysicalOp="Nested Loops"
LogicalOp="Inner Join"
EstimateRows="1" EstimateIO="0"
.
.
</ShowPlanXML>
<ShowPlanXML xmlns="http://
schemas.microsoft.
com/sqlserver/2004/07/
showplan" Version="1.0"
Build="9.00.0000.00">
<BatchSequence>
<Batch>
<Statements>
<StmtSimple StatementText=
"&#xD;&#xA;create procedure
temp_table_proc
.
.
<QueryPlan CachedPlanSize="21"
CompileTime="322"
CompileCPU="203"
CompileMemory="104">
<RelOp NodeId="0"
PhysicalOp="Hash Match"
LogicalOp="Inner Join"
EstimateRows="10"
.
qs.statement_start_offset,
DB Design Standards, sheet 11 of 23
.
</ShowPlanXML>
The temporary tables query outperforms the table variables query. Notice that the query plan for the table variables
query estimates 1 row at compile time and therefore chooses a nested loop join. In the temporary tables case,
however, the query plan chosen is a hash join which leads to better query performance. Since the query plan for table
variables always estimates the number of rows at compile time to be zero or one, table variables may be more
suitable when operating on smaller datasets.
11.2 Recompiles Based on Temporary Tables
Recompiles of queries with temporary tables occur for several reasons and can cause poor performance if temporary
tables are not used properly. Using examples we will look at the most common causes for recompiles based on
temporary tables. Consider the example below:
create procedure DemoProc1
as
begin
create table #t1(a int, b int);
insert into #t1 values(1,2);
select * from #t1;
end
go
Enable the SP:Recompile and SP:StmtRecompile events in profiler.
exec DemoProc1
go
TextData
EventClass
EventSubClass
insert
into
values(1,2);
#t1
SP:Recompile
3
Deferred
compile
insert
into
values(1,2);
select * from #t1;
#t1
SQL:StmtRecompile
3
Deferred
compile
3
Deferred
compile
3
Deferred
compile
select * from #t1;
SP:Recompile
SQL:StmtRecompile
When the stored procedure DemoProc1 is compiled, the insert and select query are not compiled. This is because
during initial compilation, the temporary table does not exist and the compilation of this query is deferred until
execution time. A compiled plan for the stored procedure is generated, but is incomplete. At execution time, the
temporary table is created, and the select and insert statement are compiled. Since the stored procedure is already in
execution, this compilation of the select and insert query are classified as a recompilation. It is important to note that
in SQL Server 2005, only the select and insert statement in the stored procedure are recompiled. In SQL Server
2000, the entire stored procedure is recompiled. Subsequent re-executions of this stored procedure do not result in
any more recompiles since the compiled plan is cached. Notice that even though the temporary table is re-created
each time the stored procedure is executed, we do not recompile the stored procedure each time. This is because the
temporary table is referenced in the plan by name and not by ID if they are created in the same module. Since the
temporary table is re-created each time with the same name, the same compiled plan is re-used. Now consider a
case when the temporary table is referenced in a second stored procedure as below:
create procedure DemoProc1
as
begin
create table #t1(a int, b int);
insert into #t1 values(1,2);
exec DemoProc2;
end
go
create procedure DemoProc2
as
begin
select * from #t1;
DB Design Standards, sheet 12 of 23
end
go
Now enable the SP:Recompile and SP:StmtRecompile events in profiler.
exec DemoProc1
go
exec DemoProc1
go
TextData
EventClass
EventSubClass
SP:Recompile
3
compile
Deferred
insert into #t1
values(1,2);
select * from #t1;
SQL:StmtRecompile
Deferred
select * from #t1;
SQL:StmtRecompile
3
compile
1
–
Changed
1
–
Changed
insert into
values(1,2);
#t1
SP:Recompile
Schema
Schema
Each execution of DemoProc1 leads to recompiles of the select statement. This is because the temporary table is
referenced by ID in DemoProc2 since the temporary table was not created in the same stored procedure. Since the
ID changes every time the temporary table is created, the select query in DemoProc2 is recompiled.
Now let us make a slight variation to DemoProc1 as illustrated below:
create procedure DemoProc1
as
begin
create table #t1(a int, b int);
insert into #t1 values(1,2);
exec DemoProc2;
exec DemoProc2;
end
go
create procedure DemoProc2
as
begin
select * from #t1;
end
go
exec DemoProc1
go
Notice that the second execution of DemoProc2 inside DemoProc1 causes no recompiles. This is because we
already have the cached query plan the select query on the temporary table and it can be re-used because the
temporary table ID is the same.
It is important to group together all DDL statements (like creating indexes) for temporary tables at the start of a stored
procedure. By placing these DDL statements together unnecessary compilations due to schema change can be
avoided. Some other common reasons for recompiles relating to temporary tables include: declare cursor statements
whose select statement references a temporary table, or in an exec or sp_executesql statement.
One of the most common reasons for recompiles of queries with temporary tables is row count modification. Consider
the example below:
create procedure RowCountDemo
as
begin
create table #t1 (a int, b int);
declare @i int;
set @i = 0;
while (@i < 20)
DB Design Standards, sheet 13 of 23
begin
insert into #t1 values (@i, 2*@i - 50);
select a from #t1 where a < 10 or ((b > 20 or a >=100) and (a < 10000)) group by a
;
set @i = @i + 1;
end
end
go
Before executing the stored procedure enable the SP:Recompile and SP:StmtRecompile events in profiler.
exec RowCountDemo
go
The trace event data is as follows:
TextData
EventClass
EventSubClass
select a from #t1
where a < 10 or ((b
> 20 or a >=100)
and (a < 10000))
group by a ;
select a from #t1
where a < 10 or ((b
> 20 or a >=100)
and (a < 10000))
group by a ;
SP:Recompile
2 - Statistics
changed
SQL:StmtRecompile
2 - Statistics
changed
After 6 modifications to an empty temporary table any stored procedure referencing that temporary table will need to
be recompiled because the temporary table statistics needs to be refreshed.
The recompilation threshold for a table partly determines the frequency with which queries that refer to the table
recompile. Recompilation threshold depends on the table type (permanent vs temporary), and the cardinality (number
of rows in the table) when a query plan is compiled. The recompilation thresholds for all of the tables referenced in a
batch are stored with the query plans of that batch.
Recompilation threshold is calculated as follows for temporary tables: n is the cardinality of the temporary table when
the query plan is compiled.
If n < 6, Recompilation threshold = 6.
If 6 <= n <= 500, Recompilation threshold = 500.
If n > 500, Recompilation threshold = 500 + 0.20 * n.
For table variables recompilation thresholds do not exist. Therefore, recompilations do not happen because of
changes in cardinalities of table variables.
Published Thursday, January 18, 2007 7:00 AM by sangeethashekar
Filed under: SQL Server 2005, Procedure Cache
DB Design Standards, sheet 14 of 23
APPENDIX C
Temporary Tables vs. Table Variables and Their Effect on SQL Server Performance
http://www.codeproject.com/KB/database/SQP_performance.aspx
Introduction
There are three major theoretical differences between temporary tables:
CREATE table #T (…)
AND table-variables
DECLARE @T table (…)
Let's Begin
The first one is that transaction logs are not recorded for the table-variables. Hence, they are out of scope
of the transaction mechanism, as is clearly visible from this example:
CREATE table #T (s varchar(128))
DECLARE @T table (s varchar(128))
INSERT into #T select 'old value #'
INSERT into @T select 'old value @'
BEGIN transaction
UPDATE #T set s='new value #'
UPDATE @T set s='new value @'
ROLLBACK transaction
SELECT * from #T
SELECT * from @T
s
--------------old value #
s
--------------new value @
After declaring our temporary table #T and our table-variable @T, we assign each one with the same "old
value" string. Then, we begin a transaction that updates their contents. At this point, both will now
contain the same "new value" string. But when we rollback the transaction, as you can see, the tablevariable @T retained its value instead of reverting back to the "old value" string. This happened because,
even though the table-variable was updated within the transaction, it is not a part of the transaction itself.
The second major difference is that any procedure with a temporary table cannot be pre-compiled, while
an execution plan of procedures with table-variables can be statically compiled in advance. Pre-compiling
a script gives a major advantage to its speed of execution. This advantage can be dramatic for long
procedures, where recompilation can be too pricy.
Finally, table-variables exist only in the same scope as variables. Contrary to the temporary tables, they
are not visible in inner stored procedures and in exec(string) statements. Also, they cannot be used in
INSERT/EXEC statements.
But let's compare both in terms of performance.
At first, we prepare a test table with 1 million records:
CREATE table NUM (n int primary key, s varchar(128))
GO
SET nocount on
DECLARE @n int
SET @n=1000000
WHILE @n>0 begin
DB Design Standards, sheet 15 of 23
INSERT into NUM
SELECT @n,'Value: '+convert(varchar,@n)
SET @n=@n-1
END
GO
Now we prepare our test procedure T1:
CREATE procedure T1
@total int
AS
CREATE table #T (n int, s varchar(128))
INSERT into #T select n,s from NUM
WHERE n%100>0 and n<=@total
DECLARE @res varchar(128)
SELECT @res=max(s) from NUM
WHERE n<=@total and
NOT exists(select * from #T
WHERE #T.n=NUM.n)
GO
Called with a parameter, which we will vary from 10, 100, 1000, 10'000, 100'000 up to 1'000'000, it
copies the given number of records into a temporary table (with some exceptions, it skips records where n
is divisible by 100), and then finds a max(s) of such missing records. Of course, the more records we
give, the longer the execution is.
To measure the execution time precisely, I use the code:
DECLARE @t1 datetime, @n int
SET @t1=getdate()
SET @n=100 – (**)
WHILE @n>0 begin
EXEC T1 1000 – (*)
SET @n=@n-1 end
SELECT datediff(ms,@t1,getdate())
GO
(*) is a parameter to our procedure, it is varied from 10 to 1'000'000 (**) if an execution time is too short,
I repeat the same loop 10 or 100 times. I run the code several times to get a result of a 'warm' execution.
The results can be found in Table 1 (see below).
Now let's try to improve our stored procedure by adding a primary key to the temporary table:
CREATE procedure T2
@total int
AS
CREATE table #T (n int primary key, s varchar(128))
INSERT into #T select n,s from NUM
WHERE n%100>0 and n<=@total
DECLARE @res varchar(128)
SELECT @res=max(s) from NUM
WHERE n<=@total and
NOT exists(select * from #T
WHERE #T.n=NUM.n)
GO
Then, let's create a third one. With a clustered index, it works much better. But let's create the index
AFTER we insert data into the temporary table – usually, it is better:
CREATE procedure T3
@total int
AS
CREATE table #T (n int, s varchar(128))
DB Design Standards, sheet 16 of 23
INSERT into #T select n,s from NUM
WHERE n%100>0 and n<=@total
CREATE clustered index Tind on #T (n)
DECLARE @res varchar(128)
SELECT @res=max(s) from NUM
WHERE n<=@total and
NOT exists(select * from #T
WHERE #T.n=NUM.n)
GO
Surprise! It not only takes longer for the big amounts of data; merely adding 10 records take an additional
13 milliseconds. The problem is that 'create index' statements force SQL Server to recompile stored
procedures, and slows down the execution significantly.
Now let's try the same using table-variables:
CREARE procedure V1
@total int
AS
DECLARE @V table (n int, s varchar(128))
INSERT into @V select n,s from NUM
WHERE n%100>0 and n<=@total
DECLARE @res varchar(128)
SELECT @res=max(s) from NUM
WHERE n<=@total and
NOT exists(select * from @V V
WHERE V.n=NUM.n)
GO
To our surprise, this version is not significantly faster than the version with the temporary table. This is a
result of a special optimization SQL Server has for the create table #T statements in the very beginning of
a stored procedure. For the whole range of values, V1 works better or the same as T1.
Now let's try the same with a primary key:
CREATE procedure V2
@total int
AS
DECLARE @V table (n int primary key, s varchar(128))
INSERT into @V select n,s from NUM
WHERE n%100>0 and n<=@total
DECLARE @res varchar(128)
SELECT @res=max(s) from NUM
WHERE n<=@total and
NOT exists(select * from @V V
WHEREre V.n=NUM.n)
GO
The result is much better, but T2 outruns this version.
Table 1, using SQL Server 2000, time in ms
T1
T2
T3
V1
V2
10
0.7
1
13.5
0.6
0.8
100
1.2
1.7
14.2
1.2
1.3
1000
7.1
5.5
27
7
5.3
10000
72
57
82
71
48
100000
883
480
580
840
510
Records
DB Design Standards, sheet 17 of 23
1000000
45056
6090
15220
20240
12010
But the real shock is when you try the same on SQL Server 2005:
Table 2
T1
T2
T3
V1
V2
10
0.5
0.5
5.3
0.2
0.2
100
2
1.2
6.4
61.8
2.5
1000
9.3
8.5
13.5
168
140
10000
67.4
79.2
71.3
17133
13910
100000
700
794
659
Too long!
Too long!
1000000
10556
8673
6440
Too long!
Too long!
N
In some cases, SQL 2005 was much faster then SQL 2000 (marked with green). But in many cases,
especially with huge amounts of data, procedures that used table variables took much longer (highlighted
with red). In four cases, I even gave up waiting.
Conclusion
1. There is no universal rule of when and where to use temporary tables or table variables. Try
them both and experiment.
2. In your tests, verify both sides of the spectrum – small amount/number of records and the
huge data sets.
3. Be careful with migrating to SQL 2005 when you use complicated logic in your stored
procedures. The same code can run 10-100 times slower on SQL Server 2005!
License
This article has no explicit license attached to it but may contain usage terms in the article text or the
download files themselves. If in doubt please contact the author via the discussion board below.
A list of licenses authors might use can be found here
About the Author
Dmitry Tsuranoff
Born in 1968, Dmitry Tsuranoff is SQL professional who addresses issues
and problems from the perspective of both a database developer and a
DBA. He has worked in the United States, France and Russia. Currently, he
is employed as Systems Architect and Team Manager at Lakeside
Technologies, developer of high-performance Lakeside SQL Server Tools.
For more information, please visit http://www.lakesidesql.com.
DB Design Standards, sheet 18 of 23
APPENDIX D
Optimize disk configuration in SQL Server
http://www.codeproject.com/KB/database/SQP_performance.aspx
By Denny Cherry
06.25.2007
Rating: -4.50- (out of 5)
One of the easiest ways to improve the lifetime performance of a SQL Server database is proper setup of
the physical and logical drives. While it's an easy technique, proper disk subsystem configuration is often
overlooked or handled by a member of the IT staff other than a SQL Server DBA.
All too often the disk subsystem, including the disk array, is configured based only on storage capacity,
with no question of drive performance. Let's go beyond storage capacity requirements and design for
drive performance.
Before you sit down with a storage administrator or engineer to map out your plans for disk
configuration, here are basic preparation steps to take. Start by getting familiar with the terms below and
communicate your requirements much easier.
 RAID – Redundant Array of Inexpensive Disks, also known as Redundant Array of Independent Disks.
 Disk subsystem – A general term that refers to the disks on the server.
 Spindle – Spindles are another way to refer to the physical disk drives that make up the RAID array.
 I/O Ops – Input/Output operations, usually measured per second.
 Queuing – Number of I/O Ops that are pending completion by the disk subsystem.
 SAN – Storage area networks are collections of storage devices and fibre switches connected together
along with the servers that access the storage on the device. SAN has also become a generic term,
which refers to the physical storage drives such as EMC, 3PAR and Hitachi.
 LUN – Logical Unit Number – This is the identification number assigned to a volume when created on a
SAN device.
 Physical drive – How Windows sees any RAID array, single drive or LUN that is attached to the
server.
 Logical drive – How Windows presents drives to the user (C:, D:, E:, etc.).
 Block size – The amount of data read from the spindles in a single read operation. This size varies per
vendor from 8 KB to 256 MB.
 Hardware array – A RAID array created using a physical RAID controller.
 Software array – A RAID array created within Windows using the computer management snap-in.
 Hot spare – A spindle that sits in the drive cage and is added to the array automatically in the event of
a drive failure. While this does not increase capacity, it does reduce the amount of time that the array is
susceptible to data loss because of a second failed drive.
 Recovery time – Amount of time needed for the RAID array to become fully redundant after a failed
drive has been replaced, either manually or automatically via a hot spare.
There are many RAID levels available on modern RAID controllers. Only a subset of these is most useful
when configuring a Microsoft SQL Server. Each RAID level has a specific use and benefit. Using the
wrong type of RAID level can not only hurt system performance, but also add more cost to your server
configuration. Experts recommend that you never use software arrays on a database server. Use of
software arrays requires additional CPU power from Windows in order to calculate which physical disk
the data is written to. In hardware arrays, this overhead is offloaded to a physical PCI, PCIe or PCIx card
within the computer (or within the SAN device), which has its own processor and software dedicated to
this task.
RAID 1 – Mirror. A RAID 1 array is most useful for high write files, such as the page file, transaction
DB Design Standards, sheet 19 of 23
logs and tempdb database. A RAID 1 array takes two physical disks and creates an exact duplicate of the
primary drive on the backup drive. There is no performance gain or loss when using a RAID 1 array. This
array can survive a single drive failure without incurring any data loss.
RAID 5 – Redundant Stripe Set. A RAID 5 array is most useful for high read files such as the database
files (mdf and ndf files) and file shares. It is the most cost-effective, high-speed RAID configuration.
With a RAID 5 array, there is a performance impact while writing data to the array because a parity bit
must be calculated for each write operation performed. For read performance, the basic formula is (n-1)*o
where n is the number of disks in the RAID 5 array and o is the number of I/O operations each disk can
perform. Note: While this calculation is not perfectly accurate, it is generally considered close enough for
most uses. A RAID 5 array can survive a single drive failure without incurring any data loss.
RAID 6 – Double Redundant Stripe Set. Like a RAID 5 array, a RAID 6 array is most useful for high
read files such as the database and file shares. With RAID 6, there is also a performance impact while
writing data to the array because two parity bits must be calculated for each write operation performed.
The same basic formula is used to calculate the potential performance of the drives (n-2)*o. A RAID 6
array can survive two drive failures without incurring any data loss.
Because of the dual parity bits with RAID 6, it is more expensive to purchase than a RAID 5 array.
However, RAID 6 offers a higher level of protection than RAID 5. When choosing between RAID 5 and
RAID 6, consider the length of time to rebuild your array, potential loss of a second drive during that
rebuild time, and cost.
RAID 10 – Mirrored Strip Sets. A RAID 10 array is most useful for high read or high write operations.
RAID 10 is extremely fast; however, it is also extremely expensive (compared to the other RAID levels
available). In basic terms, a RAID 10 array is several RAID 1 arrays stripped together for performance.
As with a RAID 1 array, as data is written to the active drive in the pair, it is also written to the secondary
drive in the pair. A RAID 10 array can survive several drive failures so long as no two drives in a single
pair are lost.
RAID 50 – Stripped RAID 5 Arrays. A RAID 50 array is an extremely high-performing RAID array
useful for very high-load databases. This type of array can typically only be done in a SAN environment.
Two or more RAID 5 arrays are taken and stripped together and data is then written to the various RAID
5 arrays. While there is no redundancy between RAID 5 arrays, it's unnecessary because the redundancy
is handled within the RAID 5 arrays. A RAID 50 array can survive several drive failures so long as only a
single drive per RAID 5 array fails.
 With the introduction of larger and larger hard drives, storage administrators have been presented with
a new issue: increased recovery times. In the past, 172 GB drives were the standard size deployed. Even
if you had five drives in a RAID 5 array, the total storage of the array was only ~688 Gigs. At that size,
recovery from a failed drive would take only hours. With those five drives now closing in on the 1TB mark
per drive, they now create a 4 TB array. This larger array could take up to a day to recover (depending on
system load, disk speed and so on).
Because of increased recovery time and, therefore, increased risk of data loss, it is recommended that
smaller disks be used in greater numbers. Instead of a single 2 TB array, use smaller drives and make
several 500 GB arrays. Not only will this improve your recovery time, but your system performance will
increase, too, as you now have many more disks in your arrays. Whenever possible, connect each RAID
array to its own RAID controller. That will minimize the possibility of overloading the RAID controller
with I/O operations.
 When setting up your disk subsystems, there are five basic groups of files to keep in mind. These are
the data files (mdf and ndf files), log files (ldf files), tempdb database, the SQL Server binaries (files that
DB Design Standards, sheet 20 of 23
are the actual SQL Server software) and Windows. Windows and
the SQL binaries will perform nicely on a single RAID 1 array, which
More on storage and disk should include the Windows page file. Note: Although it can be moved
arrays topics in SQL Server:
to another RAID 1 array, it's not necessary and will not provide any
additional performance benefit.
 Configure RAID for maximum
SQL Server I/O throughput
 SAN considerations for your Data files should be placed on one or more RAID 5 or RAID 6 arrays
(based on system needs). In certain systems, you should place the data
SQL Server environment
files on RAID 1 or RAID 10 arrays, but those systems are in the
minority. Place the transaction log files on one or more RAID 1 or RAID 10 arrays (again, based on size
and performance needs). The tempdb database should be placed on its own RAID 1 or RAID 10 array.
None of these file groups (other than Windows and the SQL binaries) should share physical drives. By
separating your files into these groups, you will see a definite improvement in performance.
To determine the number of spindles, you need to first know a few things about your hardware and your
database and application. For the hardware, find out the number of IOPs each physical drive can handle.
For the application, you need to know the typical high watermark that your database will need to function.
The number of spindles you have times the number of IOPs that each drive can handle must be higher
than the high watermark number of IOPs that your database requires.
Now that we have covered the basics, let's put all this information together and look at some server drive
layouts. These systems all have a RAID 1 drive holding the Operating System, page file and SQL Server
binaries:
The small database server. This would be a database system having a small number of users and a low
number of transactions per second. For a system of this type, putting the data files, transaction logs and
tempdb database on a single RAID 5 array will be fine. With a load this low, there should be minimal to
no performance loss because of having the files on the same array.
The medium database server. This would be a system that has a larger number of users and no more
than 100 transactions per second. At this system size, you will begin to separate parts of the database.
Start by separating the database files and log files onto different RAID arrays. The tempdb database can
probably remain on the same array as the database files. But, if the tempdb database is heavily used for
things like lots of temporary table usage, it should be moved to its own drive array.
The large database server. This system is for hundreds of users and hundreds or thousands of
transactions per second. When designing systems of this size, break apart not only the database files, log
files and tempdb database files, but also want database into smaller data files on multiple drives. This
includes moving the indexes to separate files on RAID 1 or RAID 5 arrays, moving blob data (data stored
in TEXT, NTEXT, IMAGE, VARCHAR(MAX) and NVARCHAR(MAX) data types) to a separate
RAID 5 array. You can also place different types of tables on different drives by assigning them to
different file groups.
It is recommended that for larger systems, each file group of the database should have between .25 to one
physical file per physical CPU core in the server. The tempdb database should have one data file per
physical CPU core in the server.
It is highly recommended that all partitions are configured by using the DISKPAR (Windows 2000) or
DISKPART (Windows 2003) commands. When using DISKPAR, adjust the alignment by 512 bytes, and
when using DISKPART, the disk should be aligned to 64. The reason for this is due to the original master
boot record design of WinTel based systems. The master boot record for all drives is 63 blocks (1 block =
512 bytes).
DB Design Standards, sheet 21 of 23
The physical disks want to read and write data in 64 block chunks. Because the master boot record is only
63 blocks, this puts the first block of actual data in block location 64, where it should be in block location
65. That forces the disk to read 128 blocks for each 64 blocks read to the disk, thereby increasing the
work needed to be done and decreasing performance.
It is so highly recommended that volumes be created with this 64 block offset that Microsoft is including
this procedure as the standard when creating partitions starting in Microsoft Windows 2008 Server. There
are no published figures on what sort of performance improvement will be seen by creating your disks
using this method. It's because any numbers would be relevant to only the system they were taken against,
as all databases are different. Unfortunately, once a partition has been created without the alignment
offset, there is no easy way to change the offset. The only method for doing that is to create a new volume
and partition with the offset, take down the SQL Server and manually migrate the files to the new drive in
an offline manor.
When designing larger database servers, it is important to pay careful attention to the disk configuration.
An improperly laid out disk configuration can easily have an adverse impact on your database
performance.
Even though laying out the disk subsystem of a SQL Server appears to be a simple task, take great care to
get the maximum amount of performance from your drive subsystem. A poorly configured array can and
will have a detrimental effect on the performance of your database system. When planning and
purchasing your storage hardware, think not only of the capacity, but also about the amount of I/O
operations that the disks will need to handle -- not only today, but far into the future. Proper planning will
give you a high-performing SQL Server today and years from now. When designing a system, your goal
should be that it performs well beyond the expected lifetime of the hardware it resides on.
ABOUT THE AUTHOR:
Denny Cherry is a DBA and database architect managing one of the largest
SQL Server installations in the world, supporting more than 175 million
users. Denny's primary areas of expertise are system architecture,
performance tuning, replication and troubleshooting.
DB Design Standards, sheet 22 of 23
Copyright 2007 TechTarget
http://searchsqlserver.techtarget.com/tip/0,289483,sid87_gci1262122,00.html#
DB Design Standards, sheet 23 of 23