Download Win XP Notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational algebra wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Access wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

PL/SQL wikipedia , lookup

Database wikipedia , lookup

SQL wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Microsoft.com Home
|
Site Map
Search Microsoft.com for:
Go
Search for
TechNet
TechNet Home > TechNet Security > Products & Technologies
Go
Windows XP
Help protect your desktop environment using the enhanced security
TechNet Security
Security Bulletin Search
Virus Alerts
Products & Technologies
Topics
Learning Paths
Tools
Downloads
Community
Events and Webcasts
Scripting for Security
Small Business Security
Related Resources
TechNet Home
Knowledge Base Search
Security Training
and reliability features included in the Microsoft Windows XP
operating system. Get started now with this collection of best
practices and guidance, which includes settings and policy
configuration information, how-to articles, and more.
On This Page
Guides
How-To Articles
Additional TechNet Windows XP Resources
Additional Security Resources
Guides
• Microsoft Shared Computer Toolkit for Windows XP
Handbook
• Antivirus Defense-in-Depth Guide Overview (Updated
August 25, 2004)
• Identity and Access Management Series
• Securing Wireless LANs with Certificate Services
• Security Risk Management
• Server and Domain Isolation Using IPsec and Group Policy
• The Administrator Accounts Security Guide
• The Patch Management Process
• The Services and Service Accounts Security Planning Guide
• Threats and Countermeasures
• Windows XP Security Guide v2 (Updated for Service Pack
2)
How-To Articles
• How To: Configure Memory Protection in Windows XP SP2
• How To: Configure Windows XP SP2 Network Protection
Technologies in an Active Directory Environment
• How To: Perform Patch Management Using SMS
• How To: Perform Patch Management Using SUS
• How To: Use Microsoft Baseline Security Analyzer (MBSA)
Additional TechNet Windows XP Resources
• Windows XP Professional
• Windows XP Service Pack 2
• Desktop Deployment Center
Additional Security Resources
• Global Security Centers
• Developer Security Center
• Computer Security at Home
Top of page
Beginners Guides: Little Known Features
of WindowsXP
Call it... Zen and the art of WindowsXP
Maintenance if you will. - Version 1.0.0
Bookmark this PCstats guide for future reference.
There can be little doubt that Windows XP is
Microsoft's best OS yet. While it has a few
disadvantages in terms of unnecessary bloat, its
balance of performance, stability and outward userfriendliness is hard to match.
As WindowsXP is based on Microsoft's line of server
operating systems, it is undoubtedly that which
provides it with a rather pleasing lack of crashes.
Compare WindowsXP to Windows 98, where the
daily reboot has pretty much been accepted as a
feature of the operating system, and you can see
why it has been embraced so well. This same severOS-origin also provides XP with a deep layer of
configurability. Not necessarily tweaks as such, but
tricks to getting a grip on what is happening behind
the scenes for those with an interest. Call it... Zen
and the art of WindowsXP Maintenance if you will.
In this PCstats Guide, we will explore some of the
little-known features and abilities of Windows XP
Home and Professional Editions, with an eye
towards providing a better understanding of the
capabilities of the operating system, and the options
available to the user.
Computer Management
Depending on the approach you took learning to use
Windows XP, the computer management screen is
either one of the first things you learned about, or
something you have never heard of. Derived directly
from one of the most useful features of Windows
2000, it offers an excellent way of managing many
of the most important elements of the operating
system from a single interface.
You can open the computer management interface
by right clicking on 'my computer' and selecting
'manage.' The computer management window is
divided into three sections. First is System Tools,
which is comprised of several tools to help you
manage and troubleshoot your computer.
© 2005 PCstats.com
The Event viewer
The event viewer is simply an easy interface to view the various logs that Windows
XP keeps by default. Any program or system errors are recorded here in the
application or system logs, and they can be an invaluable source of information if
you are having recurring problems.
The security log is inactive by default, and is only used if you decide to enable
auditing on your XP system. More on this later.
As the application and system logs record all programs and procedures started by
Windows, this is an excellent place to start if you want to know more about what is
going on behind the scenes.
Shared Folders
The shared folders heading contains a simple but useful list of all the folders that
have been enabled for sharing, or in other words folders that are available to a
remote user who may connect to your system over a network or the internet.
Another list, known as a Sessions List shows all remote users currently connected to
your computer, while the Open Files List illustrates which files are currently being
accessed by the remote users in question.
I don't have to tell you that this is an important set of screens if you are concerned
about security, as the sessions and open files lists can easily tell you if you have an
uninvited guest in your system.
Note that the list of shares contains two or more shares denoted by a $ (ADMIN$
and C$). These are the administration shares that are installed by default when you
load XP. The dollar sign indicates that they are hidden shares that will not show up in
an explorer window, but can be accessed directly (try going to 'start/run' and typing
'\\(yourcomputername)\c$').
What this means is that every file in your C: drive is shared by default, which should
illustrate the importance of using a password on ALL your user accounts. If
you do not, every file on your system is essentially wide open. Now one thing to
keep in mind as you pour through the lists of shared folder is that you cannot create
or remove shares from this interface; this has to be done on the properties of the
individual folders.
You do have the option of disconnecting remote users from the sessions list,
however.
That task can be accomplished by highlighting the user, going to the 'action' menu
and selecting 'disconnect session.'
«Previous Page
© 2005 PCstats.com
Next Page»
Local users and groups
Provides an interface for managing users and passwords, as well as the groups (XP
professional only) that they belong to. Groups in Windows 2000/XP are simply an
easy way of assigning or restricting rights and privileges to various aspects of
Windows to multiple users.
For example, if your computer is used by many people, you could ensure that other
users do not have the ability to access most system configuration options simply by
adding their user names to the default 'users' group. This group is restricted from
installing most software and using system applications. Moving the users over the
default group also removes the individuals from all other groups, especially the
administrators group which has unlimited access to the system.
Custom groups can also be created with desired sets of privileges and restrictions. As
for the users window, it is simply a list of all user accounts currently created,
including ones created by the system. There will likely be some accounts here you do
not recognize, especially if you are using XP Professional.
WindowsXP Professional comes with Microsoft's web server software (IIS - Internet
Information Server), which creates the 'IUSR- (yourcomputername) account to allow
anonymous remote users to access web pages you create.
While you are here, do yourself a favour and assure that the 'guest' account is
disabled, as it is essentially useless and a potential security hole. Also ensure that
your 'administrator' account and all user accounts are equipped with passwords.
Make sure of this last step as both XP Home and Professional make users created as
part of the installation process, members of the 'administrators' group, but do NOT
give them passwords. Yikes!
Storage Management
The storage section of computer management consists of three sections, but only
one is of any real significance. Removable storage is simply a list of your removable
storage devices like tapes, floppies and CD-ROM drives, and any media that is
currently present in them.
Disk defragmenter is the hard drive defragmentation utility, also accessible from the
'system tools' submenu under 'programs/accessories.'
For more information about disk defragmenter, see PCstats Annual PC Checkup
guide. The important entry is the 'disk management' window.
«Previous Page
© 2005 PCstats
Disk Management
This important window allows you to configure all logical aspects of your hard disk(s)
from a single location. From here you can partition and format new hard disks,
assign new drive letters if necessary, and even mount new drives as directories on
your other hard disks.
Windows XP also can create striped volumes (for increased disk performance) from
here. For a comprehensive how-to on RAID, see our guide here. For the rest of the
features of XP disk management, read on.
From the main computer management screen, you can view a graphical
representation of the way that your system's hard disks are partitioned, and what file
systems they are using, as well as a quickie diagnosis of that drive's relative health.
To start with, any and all disks, volumes (what Windows sees as a logical 'drive,' C:,
D;, etc.) or partition options can be accessed from this screen by right clicking on
either the disks at the bottom of the screen. Partitions are the sections of the disk's
free space that are organized for use by a file system, and if you hadn't already
guessed, the first disk is numbered "0", and the second disk "1," etc. Hitting
'properties' or by right-clicking on the individual volumes (C:, D:, etc.) at the top of
the screen and selecting options from the menu will also give you some added
insight.
«Previous Page
© 2005 PCstats.com
Next Page»
Page Index:
Mounting drives as folders
As a safety feature, Windows does not allow you to format (redo the file system on
the disk, erasing all information) or delete the partition containing the Windows
directory from this management screen, but you can carry out these operations on
any other partition by right clicking on it and selecting the 'format' or 'delete
partition/logical drive' options.
If you have installed an additional hard disk and wish to partition and format it, the
disk will be represented as grey 'unpartitioned space' in the graphical display at the
bottom of the screen, and can be partitioned and formatted by right clicking on the
drive and selecting the partition option to start a wizard that will guide you through
the operation.
We have covered the utilities available from the 'properties' menu of individual
drives, such as hard drive defragmentation, backups and sharing in several recent
articles, so for information on these topics try the preceding links.
Mounting drives as folders
One rather interesting option available with the disk manager is the ability to mount
individual partitions as directories in another volume. For example, if you had a
computer with a 20GB disk formatted into a single partition and volume (drive c:),
you could purchase a second drive, partition and format it from disk manager and
then instead of giving it its own drive letter, add it to your c: drive as a directory.
Any files added to that directory would of course be stored in the new HD. This can
come in extremely handy, as certain applications (databases come to mind) can
grow extremely large, but may not support storing data on a separate drive.
As far as Windows is concerned, a drive mounted as a directory is just a directory, so
no extra drive letters are involved. This can also cut down on storage confusion for
the average user, and it's easy to do, though it can only be done with NTFS
formatted partitions. Also, the boot partition cannot be used this way, though other
partitions can be added to the boot partition.
Also note that shuffling the partition around in this way has no effect on the data
stored in it. You can move an NTFS partition from directory to directory, then give it
back a drive letter if you choose, while maintaining complete access to the data
inside. No reboot is necessary.
One other note: If you have installed software on a partition you plan to mount as a
directory, it is best to uninstall and reinstall it, since the move may stop the software
from working correctly. Windows will warn you about this...
«Previous Page
To mount a partition as a directory
© 2005 PCstats.com
Open disk manager, Right click on the partition you wish to mount as a directory in
the graphical partition window (lower pane).
Select 'change drive letter and paths…'
Remove the current option (if any), then click add. Choose the 'mount in the
following empty NTFS folder,' browse to the desired volume and add a directory for
your drive. Click 'ok.'
That's it. If you wish to return things back to the way they were, simply repeat the
procedure, removing the directory location and choosing a drive letter instead. The
data on the drive will be unharmed.
Dynamic disks and volumes (XP Professional only)
An option that was added to the Windows repertoire in Win2K, dynamic disks and
volumes are a new way of handling hard drive storage, supplemental to the standard
file system used on each disk to organize files for access by the operating system.
When one or more drives are made dynamic, a database is created by Windows and
stored in the last megabyte of space on all dynamic disks. This database, the
dynamic disk database, contains information about all of the dynamic drives on the
system. As these drives all share a copy of the database, they share the information
about the makeup of each drive in the disk group (a collection of dynamic disks
sharing a database).
This sharing of information provides any dynamic drives in the group with several
options not possible on simple (non-dynamic) drives. To start with, the area of space
on the physical disk used by a dynamic volume (a logical drive like C: contained on a
dynamic disk) no longer needs to be continuous, and can be resized within Windows.
In other words, you can take a physical disk with a couple of partitions, convert it to
a dynamic disk, delete one volume and then resize the remaining dynamic volume to
use the entire available space, all without leaving the disk management window.
«Previous Page
© 2005 PCstats.com
Next Page»
How to Disable a Service
To disable a service first open the services window at 'computer
management/services and applications/services.'
Highlight the service you wish to change, right click and select 'properties.' Hit the
'stop' button to stop the service, then set the 'startup type' dropdown box to
'disabled.'
This stops the service and ensures that it will not reload upon restarting the
computer.
Local security policies (XP professional only)
Accessed through the 'administrative tools' menu found in the control panel, the local
security policies window controls the various XP security options like auditing
(keeping a log of which users log into the computer, and what resources they
access.), password complexity requirements, what users are allowed to log into the
computer remotely, etc.
All important stuff, but generally something that is useful more at the enterprise
level than for individual home users. If your PC is used by several people, or if you
have had problems in the past with someone breaking into your PC, you might want
to consider some of these settings.
Going through the groups, 'account policies' governs password settings like minimum
length and complexity requirements of user passwords, and whether there is a limit
to the amount of times they can be tried before the account is disabled for a period
of time.
The 'local policies' section contains auditing options, which when set will add reports
to the 'security' log in the event log section of the computer management window, so
you can see who is accessing the resources you have audited.
All auditing is disabled by default, and generally speaking if you wish to enable it,
limit it to one or two options, like auditing account logon events, not the whole
bunch, or you will be overwhelmed with pointless log entries.
Accessibility Options
Also in local policies are the 'user rights assignment' and 'security' sections, both of
which contain a huge amount of user based options for securing various parts of the
operating system.
An option you may wish to consider here is to remove permission for any account to
'access this computer from the network' (in user rights assignment section),
assuming that you do not wish to access the computer remotely, or host a WWW or
FTP site.
The 'public key policies' section is most often used for enabling EFS, the Encrypting
File System, for encrypting personal documents and information.PCstats covers this
topic extensively in the Encryption and Online Privacy Guide.
'Software restriction policies' and 'IP security policies on local computer' govern
setting rules for restricting software that can be used on the computer, and securing
network traffic through the use of encryption respectively. Both are best left to
centrally managed business environments.
Accessibility options
Windows XP comes equipped with a large variety of what Microsoft calls 'accessibility
options,' tools to make Windows easier to use for people with visual difficulties or
other problems and disabilities.
These can be accessed most easily from the accessibility wizard, found at
'start/programs/accessories/accessibility/accessibility wizard.'
Through this program you can manually change the default Windows text size, scroll
bar size, icon size, choose a high contrast colour scheme and mouse cursor, activate
captions for supporting programs and visual indicators to replace sound effects for
the hard of hearing as well as activate a range of other options by indicating to the
wizard where your difficulties using the system lie. Besides above options, the
various accessibility features you can enable are:
StickyKeys: Allows any key combination that includes CTRL, ALT or SHIFT to be
entered one key at a time instead of simultaneously.
BounceKeys: Windows will ignore held down or rapidly repeated keystrokes on the
same key.
ToggleKeys: Windows will play a sound when any of the 'lock' keys are pressed,
such as Caps Lock or Num Lock. Very useful this.
MouseKeys: The numeric keypad can be used to control the mouse pointer.
Magnifier: Opens a window at the top of the screen that displays a magnified view
of the area around the cursor.
Narrator: Narrates the contents of system Windows, including the status of things
like checkboxes and options, for the visually impaired. Rather difficult to use, and
reminiscent of Hal 9000 in voice.
On-Screen keyboard: Provides a keyboard option for users who cannot operate a
physical keyboard.
A utility manager is provided to manage settings for these three programs,
controlling if they start automatically when Windows is loaded, for example.
«Previous Page
© 2005 PCstats.com
Next
Built in backup utility
Windows XP contains a built in backup program that allows you to make data
backups to a tape or hard drive. It can be accessed at
'start/programs/accessories/system tools/backup.' Users of XP Home must add the
backup utility from the CD using add/remove programs from the control
panel.PCstats has already covered this feature in our XP backup Guide here.
Files and setting transfer wizard
The files and settings wizard is a new tool added to XP to allow users to transfer their
documents, email and desktop settings from other computers or other Windows
installations automatically.
It works with any Windows operating system from Windows 95 on, and requires
some form of network connection if you wish to transfer the data between
computers. It works by transferring what it considers user specific data, such as the
contents of the desktop and the 'my documents' folder, to the new computer, along
with Windows settings for desktop themes, accessibility options, etc.
The idea is to make your new computer's
working environment identical to that of
your old one. It also will transfer the
settings of certain popular third party
programs like Photoshop, provided the
same application is installed on the new
computer.
Note, the wizard will not adjust hardware settings, so features like the desktop
resolution and refresh rate will have to be changed manually. To use the files and
settings transfer wizard, you will need to first run it from the old computer, either by
creating a floppy disk (which can be done from within the wizard), or by inserting the
Windows XP CD into the old computer and running the wizard from there.
To transfer files and settings from your old computer:
If your computer has a CD-ROM drive, the best way to start the process is to insert
your XP CD, select 'perform additional tasks' from the autorun menu, then 'transfer
files and settings' to launch the wizard.
If you do not have a CD-ROM on the old computer, you will need to create a wizard
disk by running the files and settings transfer wizard, selecting 'new computer' then
following the options to make a disk.
«Previous Page
Files and setting transfer wizard
Once you have launched the wizard on your old computer, choose the method you
will use to transfer the information.
Network and direct cable connections can be used, as can floppy or ZIP disks, and
you can also store the information on a drive, either in the current (old) computer or
on a network drive shared out from the new one.
Now choose whether you wish the program to transfer both your files and settings,
either one, or your own custom set of files. A list of the files and settings to be
transferred is provided.
Please note that although the wizard can transfer the settings of various Microsoft
and third party software applications, you will need to have actually installed the
relevant software on your new computer before you do the transfer, as only your
settings are moved over, not the programs themselves.
Windows will create one or more compressed .dat files in the location you chose,
depending on the amount of data to be moved across. This will take a considerable
amount of time if you have large files in you're my documents folder or on the
desktop.
Once this process is finished, move to the new computer and start the files and
settings transfer wizard from the accessories/system tools menu. Select 'new
computer,' and indicate that you have already collected files and settings from your
old computer.
It will begin the transfer of settings, which also may take a considerable amount of
time.
If you are running XP and have not yet applied service pack 1, please do so before
you attempt to use the files and settings transfer wizard, as it contains several
relevant bug fixes.
System information
The system information window, reached from 'start/programs/accessories/system
tools/system information,' contains more information about your computer and it's
current installation of Windows than you could ever possibly want to know.
If you need some specific information about hardware or software installed in your
computer for tech support, chances are it can be found here.
Hash Indices
1. A hash index organizes the search keys with their associated pointers into a hash
file structure.
2. We apply a hash function on a search key to identify a bucket, and store the key
and its associated pointers in the bucket (or in overflow buckets).
3. Strictly speaking, hash indices are only secondary index structures, since if a file
itself is organized using hashing, there is no need for a separate hash index
structure on it.
Dynamic Hashing
1. As the database grows over time, we have three options:
o Choose hash function based on current file size. Get performance
degradation as file grows.
o Choose hash function based on anticipated file size. Space is wasted
initially.
o Periodically re-organize hash structure as file grows. Requires selecting
new hash function, recomputing all addresses and generating new bucket
assignments. Costly, and shuts down database.
2. Some hashing techniques allow the hash function to be modified dynamically to
accommodate the growth or shrinking of the database. These are called dynamic
hash functions.
o Extendable hashing is one form of dynamic hashing.
o Extendable hashing splits and coalesces buckets as database size changes.
o This imposes some performance overhead, but space efficiency is
maintained.
o As reorganization is on one bucket at a time, overhead is acceptably low.
3. How does it work?
Figure
1.
General extendable hash structure.
o
We choose a hash function that is uniform and random that generates
values over a relatively large range.
o
o
o
Range is b-bit binary integers (typically b=32).
is over 4 billion, so we don't generate that many buckets!
Instead we create buckets on demand, and do not use all b bits of the hash
initially.
o
o
o
o
o
At any point we use i bits where
.
The i bits are used as an offset into a table of bucket addresses.
Value of i grows and shrinks with the database.
Figure 11.19 shows an extendable hash structure.
Note that the i appearing over the bucket address table tells how many bits
are required to determine the correct bucket.
It may be the case that several entries point to the same bucket.
All such entries will have a common hash prefix, but the length of this
prefix may be less than i.
So we give each bucket an integer giving the length of the common hash
prefix.
o
o
o
o
o
This is shown in Figure 11.9 (textbook 11.19) as .
Number of bucket entries pointing to bucket j is then
2. To find the bucket containing search key value
o
Compute
.
:
.
o
o
o
Take the first i high order bits of
.
Look at the corresponding table entry for this i-bit string.
Follow the bucket pointer in the table entry.
3. We now look at insertions in an extendable hashing scheme.
o Follow the same procedure for lookup, ending up in some bucket j.
o If there is room in the bucket, insert information and insert record in the
file.
o If the bucket is full, we must split the bucket, and redistribute the records.
o If bucket is split we may need to increase the number of bits we use in the
hash.
4. Two cases exist:
1. If
o
, then only one entry in the bucket address table points to bucket j.
o
o
o
Then we need to increase the size of the bucket address table so that we
can include pointers to the two buckets that result from splitting bucket j.
We increment i by one, thus considering more of the hash, and doubling
the size of the bucket address table.
Each entry is replaced by two entries, each containing original value.
Now two entries in bucket address table point to bucket j.
We allocate a new bucket z, and set the second pointer to point to z.
o
o
o
Set and to i.
Rehash all records in bucket j which are put in either j or z.
Now insert new record.
o
o
o
2. If
o
o
It is remotely possible, but unlikely, that the new hash will still put all of
the records in one bucket.
If so, split again and increment i again.
, then more than one entry in the bucket address table points to bucket j.
Then we can split bucket j without increasing the size of the bucket
address table (why?).
Note that all entries that point to bucket j correspond to hash prefixes that
have the same value on the leftmost bits.
o
o
We allocate a new bucket z, and set and to the original value plus 1.
Now adjust entries in the bucket address table that previously pointed to
bucket j.
o Leave the first half pointing to bucket j, and make the rest point to bucket
z.
o Rehash each record in bucket j as before.
o Reattempt new insert.
5. Note that in both cases we only need to rehash records in bucket j.
6. Deletion of records is similar. Buckets may have to be coalesced, and bucket
address table may have to be halved.
7. Insertion is illustrated for the example deposit file of Figure 11.20.
o 32-bit hash values on bname are shown in Figure 11.21.
o An initial empty hash structure is shown in Figure 11.22.
o We insert records one by one.
o We (unrealistically) assume that a bucket can only hold 2 records, in order
to illustrate both situations described.
o As we insert the Perryridge and Round Hill records, this first bucket
becomes full.
o When we insert the next record (Downtown), we must split the bucket.
o
o
o
o
o
o
o
o
o
o
Since
, we need to increase the number of bits we use from the hash.
We now use 1 bit, allowing us
buckets.
This makes us double the size of the bucket address table to two entries.
We split the bucket, placing the records whose search key hash begins
with 1 in the new bucket, and those with a 0 in the old bucket (Figure
11.23).
Next we attempt to insert the Redwood record, and find it hashes to 1.
That bucket is full, and
.
So we must split that bucket, increasing the number of bits we must use to
2.
This necessitates doubling the bucket address table again to four entries
(Figure 11.24).
We rehash the entries in the old bucket.
We continue on for the deposit records of Figure 11.20, obtaining the
extendable hash structure of Figure 11.25.
8. Advantages:
o Extendable hashing provides performance that does not degrade as the file
grows.
o Minimal space overhead - no buckets need be reserved for future use.
Bucket address table only contains one pointer for each hash value of
current prefix length.
9. Disadvantages:
o Extra level of indirection in the bucket address table
o Added complexity
10. Summary: A highly attractive technique, provided we accept added complexity.
Comparison of Indexing and Hashing
1. To make a wise choice between the methods seen, database designer must
consider the following issues:
o Is the cost of periodic re-organization of index or hash structure
acceptable?
o What is the relative frequence of insertion and deletion?
o Is it desirable to optimize average access time at the expense of increasing
worst-case access time?
o What types of queries are users likely to pose?
2. The last issue is critical to the choice between indexing and hashing. If most
queries are of the form
3.
4.
aaaaaaaaaaaa¯select
5.
from r
6.
7.
where
then to process this query the system will perform a lookup on an index or hash structure
for attribute
with value c.
8. For these sorts of queries a hashing scheme is preferable.
o Index lookup takes time proportional to log of number of values in R for
.
Hash structure provides lookup average time that is a small constant
(independent of database size).
9. However, the worst-case favors indexing:
o Hash worst-case gives time proportional to the number of values in R for
o
o
.
Index worst case still log of number of values in R.
10. Index methods are preferable where a range of values is specified in the query,
e.g.
11.
12.
aaaaaaaaaaaa¯select
13.
from r
14.
15.
where
This query finds records with
and
values in the range from
to
.
16.
o
o
o
o
o
o
o
o
o
Using an index structure, we can find the bucket for value , and then
follow the pointer chain to read the next buckets in alphabetic (or numeric)
order until we find .
If we have a hash structure instead of an index, we can find a bucket for
easily, but it is not easy to find the ``next bucket''.
A good hash function assigns values randomly to buckets.
Also, each bucket may be assigned many search key values, so we cannot
chain them together.
To support range queries using a hash structure, we need a hash function
that preserves order.
For example, if
and
are search key values and
then
.
Such a function would ensure that buckets are in key order.
Order-preserving hash functions that also provide randomness and
uniformity are extremely difficult to find.
Thus most systems use indexing in preference to hashing unless it is
known in advance that range queries will be infrequent.
Index Definition in SQL
1. Some SQL implementations includes data definition commands to create and drop
indices. The IBM SAA-SQL commands are
o An index is created by
o
o
aaaaaaaaaaaa¯create index <index-name>
o
on
r (<attribute-list>)
o
o
The attribute list is the list of attributes in relation r that form the search
key for the index.
o
o
o
To create an index on bname for the branch relation:
aaaaaaaaaaaa¯create index b-index
o
on
branch (bname)
o
o
o
o
If the search key is a candidate key, we add the word unique to the
definition:
aaaaaaaaaaaa¯create unique index b-index
o
on
branch (bname)
o
o
o
If bname is not a candidate key, an error message will appear.
If the index creation succeeds, any attempt to insert a tuple violating this
requirement will fail.
o The unique keyword is redundant if primary keys have been defined with
integrity constraints already.
2. To remove an index, the command is
3.
aaaaaaaaaaaa¯drop index <index-name>
Query Optimization Techniques: Contrasting Various
Optimizer Implementations with Microsoft SQL Server
Microsoft Corporation
Created: February 1992
"Related Readings" revised: February 1994
Overview
As companies began to rely more heavily on computerized business data, it became increasingly clear that
the traditional file-based methods of storing and retrieving data were both inflexible and cumbersome to
maintain. Because application code for accessing the data contained hard-coded pointers to the underlying
data structures, a new report could take months to produce. Even minor changes were complicated and
expensive to implement. In many cases, there was simply no method available for producing useful
analysis of the data. These real business needs drove the relational database revolution.
The true power of a relational database resides in its ability to break the link between data access and the
underlying data itself. Using a high-level access language such as SQL (structured query language), users
can access all of their corporate data dynamically without any knowledge of how the underlying data is
actually stored. To maintain both system performance and throughput, the relational database system
must accept a diverse variety of user input queries and convert them to a format that efficiently accesses
the stored data. This is the task of the query optimizer.
This technical article presents the steps involved in the query transformation process, discusses the
various methods of query optimization currently being used, and describes the query optimization
techniques employed by the Microsoft® relational database management system, SQL Server.
Query Transformation
Whenever a data manipulation language (DML) such as SQL is used to submit a query to a relational
database management system (RDBMS), distinct process steps are invoked to transform the original
query. Each of these steps must occur before the query can be processed by the RDBMS and a result set
returned. This technical article deals solely with queries sent to RDBMS for the purpose of returning
results; however, these steps are also used to handle DML statements that modify data and data
definition language (DDL) statements that maintain objects within the RDBMS.
Although many texts on the subject of query processing disagree about how each process is differentiated,
they do agree that certain distinct process steps must occur.
The Parsing Process
The parsing process has two functions:

It checks the incoming query for correct syntax.

It breaks down the syntax into component parts that can be understood by the RDBMS.
These component parts are stored in an internal structure such as a graph or, more typically, a query
tree. (This technical article focuses on a query tree structure.) A query tree is an internal representation of
the component parts of the query that can be easily manipulated by the RDBMS. After this tree has been
produced, the parsing process is complete.
The Standardization Process
Unlike a strictly hierarchical system, one of the great strengths of an RDBMS is its ability to accept highlevel dynamic queries from users who have no knowledge of the underlying data structures. As a result,
as individual queries become more complex, the system must be able to accept and resolve a large
variety of combinational statements submitted for the purpose of retrieving the same data result set.
The purpose of the standardization process is to transform these queries into a useful format for
optimization. The standardization process applies a set of tree manipulation rules to the query tree
produced by the parsing process. Because these rules are independent of the underlying data values, they
are correct for all operations. During this process, the RDBMS rearranges the query tree into a more
standardized, canonical format. In many cases, it completely removes redundant syntax clauses. This
standardization of the query tree produces a structure that can be used by the RDBMS query optimizer.
The Query Optimizer
The goal of the query optimizer is to produce an efficient execution plan for processing the query
represented by a standardized, canonical query tree. Although an optimizer can theoretically find the
"optimal" execution plan for any query tree, an optimizer really produces an acceptably efficient execution
plan. This is because the possible number of table join combinations increases combinatorially as a query
becomes more complex. Without using pruning techniques or other heuristical methods to limit the
number of data combinations evaluated, the time it takes the query optimizer to arrive at the best query
execution plan for a complex query can easily be longer than the time required to use the least efficient
plan.
Various RDBMS implementations have used differing optimization techniques to obtain efficient execution
plans. This section discusses some of these techniques.
Heuristic Optimization
Heuristic optimization is a rules-based method of producing an efficient query execution plan. Because the
query output of the standardization process is represented as a canonical query tree, each node of the
tree maps directly to a relational algebraic expression. The function of a heuristic query optimizer is to
apply relational algebraic rules of equivalence to this expression tree and transform it into a more efficient
representation. Using relational algebraic equivalence rules ensures that no necessary information is lost
during the transformation of the tree.
These are the major steps involved in heuristic optimization:
1.
2.
3.
4.
Break conjunctive selects into cascading selects.
Move selects down the query tree to reduce the number of returned "tuples." ("Tuple" rhymes
with "couple." In a database table (relation), a set of related values, one for each attribute
(column). A tuple is stored as a row in a relational database management system. It is the analog of
a record in a nonrelational file. [Definition from Microsoft Press Computer Dictionary, 1991.])
Move projects down the query tree to eliminate the return of unnecessary attributes.
Combine any Cartesian product operation followed by a select operation into a single join
operation.
When these steps have been accomplished, the efficiency of a query can be further improved by
rearranging the remaining select and join operations so that they are accomplished with the least amount
of system overhead. Heuristic optimizers, however, do not attempt this further analysis of the query.
Syntactical Optimization
Syntactical optimization relies on the user's understanding of both the underlying database schema and
the distribution of the data stored within the tables. All tables are joined in the original order specified by
the user query. The optimizer attempts to improve the efficiency of these joins by identifying indexes that
are useful for data retrieval. This type of optimization can be extremely efficient when accessing data in a
relatively static environment. Using syntactical optimization, indexes can be created and tuned to improve
the efficiency of a fixed set of queries. Problems occur with syntactical optimization whenever the
underlying data is fairly dynamic. Query access schemas can be degraded over time, and it is up to the
user to find a more efficient method of accessing the data. Another problem is that applications using
embedded SQL to query dynamically changing data often need to be recompiled to improve their data
access performance. Cost-based optimization was developed to resolve these problems.
Cost-Based Optimization
To perform cost-based optimization, an optimizer needs specific information about the stored data. This
information is extremely system-dependent and can include information such as file size, file structure
types, available primary and secondary indexes, and attribute selectivity (the percentage of tuples
expected to be retrieved for a given equality selection). Because the goal of any optimization process is to
retrieve the required information as efficiently as possible, a cost-based optimizer uses its knowledge of
the underlying data and storage structures to assign an estimated cost in terms of numbers of tuples
returned and, more importantly, physical disk I/O for each relational operation. By evaluating various
orderings of the relational operations required to produce the result set, a cost-based optimizer then
arrives at an execution plan based on a combination of operational orderings and data access methods
that have the lowest estimated cost in terms of system overhead.
As mentioned earlier, the realistic goal of a cost-based optimizer is not to produce the "optimal" execution
plan for retrieving the required data, but is to provide a reasonable execution plan. For complex queries,
the cost estimate is based on the evaluation of a subset of all possible orderings and on statistical
information that estimates the selectivity of each relational operation. These cost estimates can be only as
accurate as the available statistical data. Due to the overhead of keeping this information current for data
that can be altered dynamically, most relational database management systems maintain this information
in system tables or catalogs that must be updated manually. The database system administrator must
keep this information current so that a cost-based optimizer can accurately estimate the cost of various
operations.
Semantic Optimization
Although not yet an implemented optimization technique, semantic optimization is currently the focus of
considerable research. Semantic optimization operates on the premise that the optimizer has a basic
understanding of the actual database schema. When a query is submitted, the optimizer uses its
knowledge of system constraints to simplify or to ignore a particular query if it is guaranteed to return an
empty result set. This technique holds great promise for providing even more improvements to query
processing efficiency in future relational database systems.
The Microsoft SQL Server Query Optimizer
The Microsoft SQL Server database engine uses a cost-based query optimizer to automatically optimize
data manipulation queries that are submitted using SQL. (A data manipulation query is any query that
supports the WHERE or HAVING keywords in SQL; for example, SELECT, DELETE, and UPDATE.)
This optimization is accomplished in three phases:

Query analysis

Index selection

Join selection
Query Analysis
In the query analysis phase, the SQL Server optimizer looks at each clause represented by the canonical
query tree and determines whether it can be optimized. SQL Server attempts to optimize clauses that limit
a scan; for example, search or join clauses. However, not all valid SQL syntax can be broken into
optimizable clauses, such as clauses containing the SQL relational operator <> (not equal). Because <> is
an exclusive rather than an inclusive operator, the selectivity of the clause cannot be determined before
scanning the entire underlying table. When a relational query contains non-optimizable clauses, the
execution plan accesses these portions of the query using table scans. If the query tree contains any
optimizable SQL syntax, the optimizer performs index selection for each of these clauses.
Index Selection
For each optimizable clause, the optimizer checks the database system tables to see if there is an
associated index useful for accessing the data. An index is considered useful only if a prefix of the columns
contained in the index exactly matches the columns in the clause of the query. This must be an exact
match, because an index is built based on the column order presented at creation time. For a clustered
index, the underlying data is also sorted based on this index column order. Attempting to use only a
secondary column of an index to access data would be similar to attempting to use a phone book to look
up all the entries with a particular first name: the ordering would be of little use because you would still
have to check every row to find all of the qualifying entries. If a useful index exists for a clause, the
optimizer then attempts to determine the clause's selectivity.
In the earlier discussion on cost-based optimization, it was stated that a cost-based optimizer produces
cost estimates for a clause based on statistical information. This statistical information is used to estimate
a clause's selectivity (the percentage of tuples in a table that are returned for the clause). Microsoft SQL
Server stores this statistical information in a specialized data distribution page associated with each index.
This statistical information is updated only at the following two times:

During the initial creation of the index (if there is existing data in the table)

Whenever the UPDATE STATISTICS command is executed for either the index or the associated
table
To provide SQL Server with accurate statistics that reflect the actual tuple distribution of a populated
table, the database system administrator must keep the statistical information for the table indexes
reasonably current. If no statistical information is available for the index, a heuristic based on the
relational operator of the clause is used to produce an estimate of selectivity.
Information about the selectivity of the clause and the type of available index is used to calculate a cost
estimate for the clause. SQL Server estimates the amount of physical disk I/O that occurs if the index is
used to retrieve the result set from the table. If this estimate is lower than the physical I/O cost of
scanning the entire table, an access plan that employs the index is created.
Join Selection
When index selection is complete and all clauses have an associated processing cost based on their access
plan, the optimizer performs join selection. Join selection is used to find an efficient order for combining
the clause access plans. To accomplish this, the optimizer compares various orderings of the clauses and
then selects the join plan with the lowest estimated processing costs in terms of physical disk I/O.
Because the number of clause combinations can grow combinatorially as the complexity of a query
increases, the SQL Server query optimizer uses tree pruning techniques to minimize the overhead
associated with these comparisons. When this join selection phase is complete, the SQL Server query
optimizer provides a cost-based query execution plan that takes advantage of available indexes when they
are useful and accesses the underlying data in an order that minimizes system overhead and improves
performance.
Summary
This technical article has shown you the steps required for a relational database management system to
process a high-level query. It has discussed the need for query optimization and has shown several
different methods of achieving query optimization. Finally, it has illustrated the various phases of
optimization employed by the cost-based optimizer of the Microsoft RDBMS, SQL Server. We hope this
document has helped you gain a better understanding of both the query optimization process and the
Microsoft cost-based query optimizer, one of the many features that clearly define SQL Server as the
premier database server for the PC environment.
Related Readings
Date, C. J. An Introduction to Database Systems. Volume I. Addison/Wesley, 1990, 455–473.
Elmasri, R., and S. B. Navathe. Fundamentals of Database Systems. Benjamin/Cummings, 1989, 501–
532.
Moffatt, Christopher. "Microsoft SQL Server Network Integration Architecture." MSDN Library, Technical
Articles.
"Microsoft Open Data Services: Application Sourcebook." MSDN Library, Technical Articles.
Shelly, D. B. "Understanding the Microsoft SQL Server Optimizer." Microsoft Networking Journal, Vol. 1,
No. 1, January 1991.
Yao, S. B. "Optimization of Query Evaluation Algorithms." ACM TODS, Vol. 4, No. 2, June 1979.
Additional Information
To receive more information about Microsoft SQL Server or to have other technical notes faxed to you, call
Microsoft Developer Services Fax Request at (206) 635-2222.
SYSTEM
STRATEGIES
TECHNICAL SUPPORT JULY 1996
In order to optimize queries accurately,
sufficient information must be available
o determine which data access techniques
are most effective (for example, table
and column cardinality, organization
information, and index availability). In a
distributed, client/server environment, data
location becomes a major factor. This article
will discuss how adding location considerations
to the optimization process increases
complexity.
COMPONENTS OF DISTRIBUTED
QUERY OPTIMIZATION
There are three components of distributed
query optimization:
n Access Method — In most RDBMS
products, tables can be accessed in one
of two ways: by completely scanning
the entire table or by using an index.
The best access method to use will
always depend upon the circumstances.
For example, if 90 percent of the rows
in the table are going to be accessed,
you would not want to use an index.
Scanning all of the rows would actually
reduce I/O and overall cost. Whereas,
when scanning 10 percent of the total
rows, an index will usually provide more
efficient access. Of course, some products
provide additional access methods, such
as hashing. Table scans and indexed
access, however, can be found in all
of the "Big Six" RDBMS products
(i.e., DB2, Sybase, Oracle, Informix,
Ingres, and Microsoft).
n Join Criteria — If more than one table
is accessed, the manner in which they
are to be joined together must be determined.
Usually the DBMS will provide
several different methods of joining
tables. For example, DB2 provides three
different join methods: merge scan join,
nested loop join, and hybrid join. The
optimizer must consider factors such as
the order in which to join the tables and
the number of qualifying rows for each
join when calculating an optimal access
path. In a distributed environment, which
site to begin with in joining the tables
is also a consideration.
n Transmission Costs — If data from
multiple sites must be joined to satisfy
a single query, then the cost of transmitting
the results from intermediate steps
needs to be factored into the equation.
At times, it may be more cost effective
simply to ship entire tables across the
network to enable processing to occur
at a single site, thereby reducing overall
transmission costs. This component
of query optimization is an issue only
in a distributed environment.
BY CRAIG S. MULLINS
Distributed Query
Optimization
Query optimization
is a difficult task in
a distributed client/server
environment and data
location becomes a major
factor. Understanding the
issues involved enables
programmers to develop
efficient distributed
optimization choices.
D
atabase queries have become increasingly complex in the age of
the distributed DBMS (DDBMS). This poses a difficulty for the
programmer but also for the DDBMS. Query optimization is a
difficult enough task in a non-distributed environment. Anyone
who has tried to study and understand a cost-based query optimizer for
a relational DBMS (such as DB2 or Sybase SQL Server) can readily attest to
this fact. When adding distributed data into the mix, query optimization
becomes even more complicated.
SYSTEM STRATEGIES
TECHNICAL SUPPORT JULY 1996
SYSTEMATIC VS. PROGRAMMATIC OPTIMIZATION
There are two manners in which query optimization can occur: systematically
or programmatically. Systematic optimization occurs when
the RDBMS contains optimization algorithms that can be used internally
to optimize each query.
Although systematic optimization is desirable, the optimizer is not
always robust enough to be able to determine how best to join tables at
disparate sites. Indeed, quite often the RDBMS does not even permit a
distributed request joining multiple tables in a single SQL statement.
In the absence of systematic optimization, the programmer can optimize
each request by coding the actual algorithms for selecting and
joining between sites into each application program. This is referred to
as progra m m atic optimization. With systematic optimization the RDBMS
does all of the work.
Fa c t o rs to consider when coding optimization logic into yo u r
ap p l i c ation programs incl u d e :
n the size of the tables;
n the location of the tables;
n the availability of indexes;
n the need for procedural logic to support complex requests
that can't be coded using SQL alone;
n the availability of denormalized structures
(fragments, replicas, snapshots); and
n consider using common, reusable routines
for each distinct request, simplifying
maintenance and modification.
AN OPTIMIZATION EXAMPLE
In order to understand distributed query optimization more fully,
let's take a look at an example of a query accessing tables in multiple
locations. Consider the ramifications of coding a program to simply
retrieve a list of all teachers who have taught physics to seniors.
Furthermore, assume that the COURSE table and the ENROLLMENT
table exist at Site 1; the STUDENT table exists at Site 2.
If either all of the tables existed at a single site, or the DBMS supported
distributed multi-site requests, the SQL shown in Figure 1
would satisfy the requirements. However, if the DMBS can not perform
(or optimize) distributed multi-site requests, programmatic optimization
must be performed. There are at least six different ways to go
about optimizing this three-table join.
Option 1: Start with Site 1 and join COURSE and ENROLLMENT,
selecting only physics courses. For each qualifying row, move it to Site
2 to be joined with STUDENT to see if any are seniors.
Option 2: Start with Site 1 and join COURSE and ENROLLMENT,
selecting only physics courses, and move the entire result set to Site 2
to be joined with STUDENT, checking for senior students only.
Option 3: Start with Site 2 and select only seniors from STUDENT.
For each of these examine the join of COURSE and ENROLLMENT
at Site 1 for physics classes.
Option 4: Start with Site 2 and select only seniors from STUDENT at
Site 2, and move the entire result set to Site 1 to be joined with
COURSE and ENROLLMENT, checking for physics classes only.
Option 5: Move the COURSE and ENROLLMENT tables to Site 2
and proceed with a local three-table join.
Option 6: Move the STUDENT to Site 1 and proceed with a local
three-table join.
Wh i ch of these six options will perfo rm the best? Unfo rt u n at e ly, t h e
o n ly correct answer is "It depends." The optimal choice will depend upon:
n the size of the tables;
n the size of the result sets — that is, the number of qualifying rows
and their length in bytes; and
n the efficiency of the network.
Try different combinations at your site to optimize distributed
q u e ries. But re m e m b e r, n e t wo rk tra ffic is usually the cause of most
p e rformance problems in a distributed environment. So devoting most
of your energy to options involving the least amount of netwo rk traffic is
a wise approach. In addition, bad design can also be the cause of many
d i s t ri buted perfo rmance pro bl e m s .
NOT QUITE SO SIMPLE
The previous example is necessari ly simplistic in order to demonstrat e
the inherent complexity of optimizing distributed queries. By adding
more sites and/or more tables to the mix, the difficulty of optimization
will increase because the number of options available increases.
Additionally, the specific query used is also quite simple. Instead of
a simple three table join, the query could be a combination of joins,
subqueries, and unions over more than three tables. The same number
of options is available for any combination of two tables in the query.
Indeed, there are probably more options than those covered in this
article. Consider a scenario similar to the one posed above in which we
have three tables being joined over two sites. Tables A and B exist at
Site 1 and Table C exists at Site 2. It is quite possible that it would be
more efficient to process A at Site 1 and ship the results to Site 2. At
site 2, the results would be joined to Table C. Those results would then
be shipped back to Site 1 to be joined to Table B. It is not probable that
this scenario would produce a more optimal strategy than the six outlined
above, but in certain situation, it is possible.
Furthermore, some types of processing require procedural logic
(such as looping and conditional if-then processing) to be interspersed
with multiple SQL queries to produce a result. In these cases, the procedural
logic should be factored into the optimization equation for
optimal results. Howeve r, the optimize rs ava i l able in the major
RDBMS products don't do a good job of this for non-distributed
queries, so the hope of a distributed optimizer performing this type of
optimization any time soon is not good.
Fi n a l ly, t h e re is a laundry list of other considerations that must be take n
into account that I have skipped for the sake of brev i t y. For ex a m p l e :
n The security and authorization implication of who can access what
information at which site need to be examined and implemented.
n In a multi-site environment, it is possible (indeed quite likely over
time) that one of the sites will not be available for any number of reasons
(software upgrade, power outage, hardware/software failure, etc.).
n Declarative referential integrity among multiple sites, in which the
data relationships are specified in each table's DDL, are not available
Figure 1: SQL to Satisfy Single Site or Multi-Site Requests
SELECT C.TEACHER
FROM COURSE C,
ENROLLMENT E,
STUDENT S
WHERE C.COURSE_NO=E.COURSE_NO
AND E.STUDENT_NO=S.STUDENT_NO
AND S.STUDENT_LEVEL="SENIOR"
AND C.COURSE_TYPE="PHYSICS"
SYSTEM STRATEGIES
in any DDBMS to date. The specification of
these relationships would greatly assist application
development efforts, as well as distributed
query optimization.
n Distributed structures can be implemented
to augment performance. A multi-site, multitable
index structure could be created that
would contain information on the physical
location of tables, as well as the physical
location of the data items within that table.
This stru c t u re, h owever helpful from
a performance perspective, would be difficult
to maintain and administer due to its reliance
on multiple sites.
n The optimization process will be highly
dependent upon the implementation and usage
of the network. The amount of network traffic
can vary from day-to-day, and even hour-toh
o u r, t h e reby impacting the optimizat i o n
choice. Whenever the network is modified in
any way (tuned, new release, additional nodes
added, etc.), the optimization choice should
be re-addressed as a new, more optimal path
m ay now be ava i l abl e. This can quick ly
become a drain on the resources of the system
(and the personnel administering the system).
SYNOPSIS
Introducing data distribution into the query
optimization process makes a complex issue
that much more complex. Until the distributed
DBMS products support the systematic optim
i z ation of distri buted mu l t i - t able SQL
requests, programmatic optimization will be a
fact of distri buted life. Understanding the
issues involved will enable application programmers
to develop efficient distributed optimization
choices.
Craig S. Mullins is a senior technical advisor
and team leader of the Technical Communications
group at PLATINUM t e c h n o l o g y, inc. C r a i g ’s book,
DB2 Developers Guide, contains more than 1,200
pages of tips and guidelines for DB2 and can
be ordered directly from the publisher, SAMS
Publishing, at 1-800-428-5331. Craig can be
reached via the Internet ([email protected]),
CompuServe (70410,237), America Online
(CraMullins), or at PLATINUM technology, inc.
(800-442-6861, fax: 708-691-0709).
©1996 Technical Enterprises, Inc. Reprinted
with permission of Te chnical Support m ag -
azine. For subscription information, email
[email protected] or call 414-768-8000,
Ext. 116.
ts
TECHNICAL SUPPORT JULY 1996
Until the distributed
DBMS products support
the systematic
optimization of distributed
multi-table SQL requests,
programmatic optimization
will be a fact
of distributed life.
Semi join operation
Databases and Distributed Database Management Systems - DBMSs - had their origin in
large organisations' needs for centrally-controlled information management, and the
software and associated administrative procedures were developed for that environment.
Later, with the advent of small business machines and particularly PCs, single-user
DBMSs were widely adopted to provide reliable and simple information-processing
facilities for individuals or small working groups. There is now tendency to link
machines together in networks, intended to give the advantages of local processing while
maintaining overall control over system integrity and security. Developments in database
technology obviously reflect this trend.
A DBMS like Oracle can for instance be run under Unix in CLIENT-SERVER mode. Using the TCP/IP
protocol, an Oracle application running on a workstation can communicate with Oracle on a server,
the tasks being shared between the two processors - the server handles updating and retrieval; the
client application handles screen management, user data entry, and report generation. When
implemented this should provide better performance.
Note that a DBMS which supports a fully relational interface is important for the success of this
approach, as for the fully distributed databases to be discussed later. Using a relational language,
interactions between the server and the client involve retrieving sets of records, which puts less
load on the network than single-record transactions. Database servers are sometimes referred to as
SQL ENGINES in that their only interaction with client machines is through basic SQL commands to
accept and produce data. Standard SQL provides a common language with which completely
different software products can communicate.
In a true distributed database, the data itself is located on more than one machine. There are
various possible approaches, depending on the needs of the application and the degree of emphasis
placed on central control versus local autonomy. In general, organisations may wish to:




reduce data communications costs by putting data at the location where it is most
often used,
aggregate information from different sources,
provide a more robust system (e.g. when one node goes down the others continue
working),
build in extra security by maintaining copies of the database at different sites.
Distributed systems are not always designed from scratch - they may evolve from
traditional systems as organisational needs become apparent. One possibility is that a
complete central database is maintained and updated in the normal way, but that local
copies (in whole or part) are sent periodically to remote sites, to be used for fast and
cheap retrieval. Any local updates have no effect on the central database. The implication
here is that consistency between all copies of the database at all times is not crucial - it
may for instance be enough to send new data to node sites overnight when networks are
less busy.
Alternatively, distributed database development may involve the linking together of previously
separate systems, perhaps running on different machine architectures with different software
packages. A possible scenario is that individual sites manage and update their own databases for
standard operational applications, but that information is collected and aggregated for higher-level
decision support functions. In this case there is no single location where the whole database is
stored; it is genuinely split over two or more sites. Once again, however, total consistency may not
be looked for - local databases are kept up to date and there is periodical transmission of data back
to the centre. To manage a system like this a product such as Oracle's SQL*Net is required. This
enables, by the use of SQL drivers provided by the host RDBMS (e.g. Oracle ODBC drivers for
Access), data stored on, say, an Access DBMS to be interrogated / updated by an Oracle DBMS or
vice-versa. Note that under these circumstances it is essential that the appropriate driver is capable
of generating standard SQL; SQL is the universal database language used for communicating
between different RDBMSs.
A third possibility is that the database is designed from the start to be distributed, and that all
nodes in the network may in principle query and update the database at any location. Codd has
specified a set of criteria to characterise a genuinely distributed system; these are not in fact
satisfied by any actual DDBMS commercially available today but, as with the 12 "commandments"
about relational systems, they provide a framework for explanation and evaluation.
8.1. Local autonomy.
8.2. No reliance on a central site.
These points concern the overall control mechanism within a DDB, and in particular the
location of data dictionary and system catalogue material. How much is it necessary for
each site to know about data held elsewhere? In a local area network it is feasible for any
node to broadcast a query to all the others, and await response from the one with the
relevant information. In wide area networks, communications costs become more
significant, and it is necessary to decide how and where to place information to determine
the routing of queries. This point will be explored further after the next three rules have
been explained.
8.3. Data fragmentation.
8.4. Transparency of location
8.5. Replication
The requirement is that the user of a DDB need not know how the data is partitioned,
where any part of it is stored, or how many copies exist, as the system will be intelligent
enough to present it as a seamless whole. No current general-purpose DDBMS can
achieve this, although it is always possible to write code for particular applications which
hides lower-level details from end-users. Decisions about fragmentation, location and
replication are very important for the design of a DDB, and are now discussed in more
detail.
A relational database is partitioned by first dividing it into a number of FRAGMENTS. In theory a
fragment may be a complete table, or any HORIZONTAL / VERTICAL subset of it which can be
described in terms of relational select and project - in other words groups of records or groups of
fields. Choice of fragments will be based on expectations about likely usage.
1. HORIZONTAL FRAGMENTATION might depend on geographical divisions
within an organisation so that, e.g. payroll or customer records are held in the
location where they are most likely to be created and accessed. It should partition
tables into discrete groups, based either directly on field values or indirectly on
joins with another horizontally fragmented table (derived fragmentation). It
should not result in missing records or overlaps!
2. VERTICAL fragmentation might depend on functional divisions within an
organisation, so that, e.g. the site normally dealing with taxation has the relevant
fields from all employee-records. There must in this case be some overlap - at
least the primary key of vertically-fragmented tables will be repeated, and the
designer may define clusters of fields to eliminate the potential need for many
cross-site joins.
The fragments are now allocated to individual sites, on the basis of where they are most
likely to be needed. Decisions are based on the COST and the BENEFIT of having a data
fragment at a particular site, where:


The BENEFIT relates to the number of times it will be needed to answer queries,
The COST relates to the number of times it will be changed as a result of a
transaction at another site.
The site with the best COST BENEFIT RATIO will be selected as the location for the
fragment.
The designer may choose to replicate the data, i.e. keep several copies of each fragment in different
locations. This provides extra security, and flexibility in that there is more than one way to answer
the same question. However it increases the potential update cost and in practice it has been found
that the benefits of holding more than two or three replicated copies will not generally outweigh the
cost. At this stage the question may arise as to whether total consistency between copies is always
necessary - such a requirement will place a particularly heavy load on the transaction management
software.
The final design stage involves the MAPPING of global database fragments to tables in local
databases. It is important to adopt a naming system which allows unambiguous reference to
sections of the global database, while retaining users' freedom to select their own local names.
Global names will generally incorporate site names (which must be unique), and in some systems
may have a more complex structure. In IBM's experimental DDBMS (R*) every database unit has a
name identifying:
1.
2.
3.
4.
CREATOR_NAME,
CREATOR_SITE,
LOCAL_NAME,
BIRTH_SITE.
where BIRTH_SITE is the name of the site where the data was originally created. This
name is guaranteed never to change, and will normally be mapped to local names by way
of SQL CREATE SYNONYM clauses. It provides a convenient mechanism for actually
finding database fragments, as will be described shortly.
The next important design decision is about where to locate the SYSTEM CATALOGUE. Query
processing in a DDB may require access to the following information:







GLOBAL SCHEMA
FRAGMENTATION SCHEMA
ALLOCATION SCHEMA
LOCAL MAPPINGS
AUTHORISATION RULES
ACCESS METHODS
DATABASE STATISTICS
Note that this information is not static - in principle changes may occur in any of the
above categories and in particular database fragments may over time migrate from one
site to another as patterns of access evolve. With a truly distribution-independent DBMS
any alterations should be invisible to existing applications.
In principle it is possible to adopt one of the following strategies for holding the system catalogue.
Each choice has advantages and disadvantages.


Hold one copy only, in a central site. This is the simplest solution to manage,
since there is no redundancy and a single point of control. The disadvantage is
that the central site acts as a bottleneck - if the catalogue there becomes
unavailable for any reason the rest of the network is also out of action. It is the
solution adopted in practice by many current organisations but it violates Codd's
criteria for categorisation as a full DDBMS.
Replicate copies of the complete catalogue over all sites. This allows any site to
carry out every stage of query processing, even down to generating and
optimising query plans. Total replication produces a high overhead, particularly if
changes to any part of the catalogue must be propagated throughout the network.
Some systems operate a CACHEING mechanism whereby sites hold and use
versions of the catalogue which are not guaranteed to be up-to-date, but may
allow some queries to be processed without access to the latest version.
Another compromise is to replicate only part of the catalogue, e.g. the RDBMS INGRES
arranges that all sites hold items (a), (b), and (c) - i.e. CREATOR_SITE,
LOCAL_NAME and BIRTH_SITE - from the list given above. Any site knows where to
direct queries, but the task of generating query plans is delegated to the site where the
data is held. This may prove a barrier to global distributed query optimisation.

Maintain only local catalogues. This solution does provide complete site
autonomy but may give rise to extensive network traffic, since all sites must be
interrogated for every query to see if they are holding relevant information. While
perhaps tolerable on a small system using a local area network, this solution
cannot be adopted in systems with high communication costs. However, using a
convention where local names are mapped onto global names via synonyms, it is
possible to ensure that any data element is accessible in at most two moves. For
example, the R* system mentioned above holds complete catalogues for all
database elements at both their birth-site and their current site, if these are
different. Query processing now involves the following actions:
o convert from local synonym to global name,
o identify the birth-site and interrogate it for the data,
o the birth-site will either return the data or, if it has migrated elsewhere,
will know its current location, and inform the query site accordingly,
o the query site can now interrogate the site where the data is currently
stored.
8.6. Continuous operation
It should not be necessary to halt normal use of the database while it is re-organised,
archived, etc. It may be easier to achieve this in a distributed rather than centralised
environment, since processing nodes may be able to substitute for one another where
necessary.
8.7. Distributed query processing
The DDBMS should be capable of gathering and presenting data from more than one site
to answer a single query. In theory a distributed system can handle queries more quickly
than a centralised one, by exploiting parallelism and reducing disc contention; in practice
the main delays (and costs) will be imposed by the communications network. Routing
algorithms must take many factors into account to determine the location and ordering of
operations. Communications costs for each link in the network are relevant, as also are
variable processing capabilities and loadings for different nodes, and (where data
fragments are replicated) trade-offs between cost and currency. If some nodes are
updated less frequently than others there may be a choice between querying the local outof-date copy very cheaply and getting a more up-to-date answer by accessing a distant
location.
The ability to do query optimisation is essential in this context - the main objective being to
minimise the quantity of data to be moved around. As with single-site databases, one must consider
both generalised operations on internal query representations, and the exploitation of information
about the current state of the database. A few examples follow.
1. Operations on the query tree should have the effect of executing the less
expensive operation first, preferably at a local site to reduce the total quantity of
data to be moved. Distributed query processing often requires the use of UNION
operations to put together disjoint horizontal fragments of the same table - these
are obviously expensive, and should (like join operations) be postponed as late as
possible. A good strategy is to carry out all the REDUCER operations locally and
union together the final results. This applies not only to select and project, but
also to the aggregation functions. Note however that:
COUNT( UNION(frag1,frag2,frag3))
must be implemented as
SUM( COUNT(frag1), COUNT(frag2), COUNT(frag3)),
and
AVERAGE( UNION (frag1, frag2, frag3) )
must be implemented as:
SUM( SUM(frag1), SUM(frag2), SUM(frag3) )
_______________________________________________
SUM( COUNT(frag1), COUNT(frag2), COUNT(frag3) )
 Given that the original fragmentation predicates were based on expected access
requirements (e.g. records partitioned on a location field), frequently-used queries should
need to access only a subset of the tables to which they refer. In the absence of
fragmentation transparency, such queries can of course be directed to particular sites. By
contrast any DDBMS with full distribution independence should be able to detect
whether some branches of the query tree will by definition produce null results, and
eliminate them before execution. In the general case this will require a theorem-proving
capability in the query optimiser, i.e. the ability to detect contradictions between the
original fragmentation predicates and those specified by the current query.
 A knowledge of database statistics is necessary when deciding how to move data
around the network. A cross- site join requires at least one tables to be transferred to
another site, and for a complex query it may also be necessary to carry intermediate
results between sites. Relevant statistics about each table are:



CARDINALITY (no. of records),
the size of each record,
the number of distinct values in join or select fields.
The last point is relevant in estimating the size of intermediate results. For example in a
table of 10,000 records, containing a field with 1000 distinct values uniformly
distributed, the number of records selected by a condition on that field will be 10 X the
number of values in the select range. It is especially useful to estimate the cardinality of
join results. In the worst case (where every record matches every other) this will be M x
N, M and N being the cardinality of the original tables. In practice the upper bound is
generally much lower and can be estimated on the number of distinct values in the two
join fields. To take a common occurrence, if we have a primary -> foreign key join
between tables R and S, the upper bound of the result is the cardinality of table S.
A common strategy for reducing the cost of cross-site joins is to introduce a preliminary SEMI-JOIN
operation. Suppose we have:
Emp(Empno,ename,manager,Deptno) at site New York
Dept(Deptno,dname,location, etc) at site London
Join condition: Dept.deptno = Emp.deptno.
We may perform the join as follows:
New York: Project on Emp.deptno. Send the
result to London.
London :
Dept.deptno;
Join the results from New York with
select only matching records.
Send the matching records from London back to
New York to
participate in a full join.
This method produces economies if the join is intended to make a selection from Dept,
and not simply to link together corresponding records from both tables. However, unless
an index exists for Dept.deptno it still involves a full sort on Dept.
An alternative method sends a long bit vector as a filter so that Dept need not be sorted before the
semi-join. The vector is formed by applying a hash function to every value of Emp.deptno, each
time setting the appropriate bit to 1. It is then sent to London, and the same hash function applied
to each value of Dept.deptno in turn - those which match a 1-bit are selected. The relevant records
are sent to New York and the full join carried out as before. Since hashing functions always produce
synonyms some records will be selected unnecessarily, but experiments using IBM's R* system
showed that this method overall gives better performance than normal semi-join.
Security threats
Virus Information
Recently Discovered Viruses
Top 10 Viruses Tracked by AVERT
Recently Updated Viruses
Virus Hoaxes
Tool Box
Virus Removal Tools
Download the latest virus removal tools from McAfee Security. These tools
automatically perform virus detection and removal tasks for specific viruses. If
your system is infected, the tools will remove the virus and repair any damage.
Virus Map
Get a real-time, bird’s-eye view of
where the latest viruses are infecting
computers worldwide.
Regional Virus Info
Find out which viruses are infecting
PCs in your neighborhood and around
the world.
Virus Calendar
Be prepared for the next scheduled virus payloads
strike with the help of this comprehensive calendar.
Definitions
What is a Virus?
A virus is a manmade program or piece of code that causes an unexpected,
usually negative, event. Viruses are often disguised games or images with
clever marketing titles such as "Me, nude."
What is a Worm?
Computer Worms are viruses that reside in the active memory of a computer and duplicate themselves. They
may send copies of themselves to other computers, such as through email or Internet Relay Chat (IRC).
What is a Trojan Horse?
A Trojan horse program is a malicious program that pretends to be a benign application; a Trojan horse program
purposefully does something the user does not expect. Trojans are not viruses since they do not replicate, but
Trojan horse programs can be just as destructive.
Many people use the term to refer only to non-replicating malicious programs, thus making a distinction between
Trojans and viruses.
Have more questions?
Look up more definitions in our Virus Glossary.
Current Threats Add to Your Site

Virus NoticeW32/Sober.r@MM is a Low-Profiled worm.
Virus Search
Enter Virus Name
Search
Free Virus News More
Enter Email Address Sign Up
Related Links





Security News Network
Online Guide for Parents
Virus Removal Services
Anti-Virus Tips
eSecurity News Archives
We also recommend...
Keep your PC safe. Automatically checks for virus updates, so your protection stays up-to-date.
Learn More