* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Win XP Notes
Survey
Document related concepts
Relational algebra wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Access wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Functional Database Model wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Clusterpoint wikipedia , lookup
Transcript
Microsoft.com Home | Site Map Search Microsoft.com for: Go Search for TechNet TechNet Home > TechNet Security > Products & Technologies Go Windows XP Help protect your desktop environment using the enhanced security TechNet Security Security Bulletin Search Virus Alerts Products & Technologies Topics Learning Paths Tools Downloads Community Events and Webcasts Scripting for Security Small Business Security Related Resources TechNet Home Knowledge Base Search Security Training and reliability features included in the Microsoft Windows XP operating system. Get started now with this collection of best practices and guidance, which includes settings and policy configuration information, how-to articles, and more. On This Page Guides How-To Articles Additional TechNet Windows XP Resources Additional Security Resources Guides • Microsoft Shared Computer Toolkit for Windows XP Handbook • Antivirus Defense-in-Depth Guide Overview (Updated August 25, 2004) • Identity and Access Management Series • Securing Wireless LANs with Certificate Services • Security Risk Management • Server and Domain Isolation Using IPsec and Group Policy • The Administrator Accounts Security Guide • The Patch Management Process • The Services and Service Accounts Security Planning Guide • Threats and Countermeasures • Windows XP Security Guide v2 (Updated for Service Pack 2) How-To Articles • How To: Configure Memory Protection in Windows XP SP2 • How To: Configure Windows XP SP2 Network Protection Technologies in an Active Directory Environment • How To: Perform Patch Management Using SMS • How To: Perform Patch Management Using SUS • How To: Use Microsoft Baseline Security Analyzer (MBSA) Additional TechNet Windows XP Resources • Windows XP Professional • Windows XP Service Pack 2 • Desktop Deployment Center Additional Security Resources • Global Security Centers • Developer Security Center • Computer Security at Home Top of page Beginners Guides: Little Known Features of WindowsXP Call it... Zen and the art of WindowsXP Maintenance if you will. - Version 1.0.0 Bookmark this PCstats guide for future reference. There can be little doubt that Windows XP is Microsoft's best OS yet. While it has a few disadvantages in terms of unnecessary bloat, its balance of performance, stability and outward userfriendliness is hard to match. As WindowsXP is based on Microsoft's line of server operating systems, it is undoubtedly that which provides it with a rather pleasing lack of crashes. Compare WindowsXP to Windows 98, where the daily reboot has pretty much been accepted as a feature of the operating system, and you can see why it has been embraced so well. This same severOS-origin also provides XP with a deep layer of configurability. Not necessarily tweaks as such, but tricks to getting a grip on what is happening behind the scenes for those with an interest. Call it... Zen and the art of WindowsXP Maintenance if you will. In this PCstats Guide, we will explore some of the little-known features and abilities of Windows XP Home and Professional Editions, with an eye towards providing a better understanding of the capabilities of the operating system, and the options available to the user. Computer Management Depending on the approach you took learning to use Windows XP, the computer management screen is either one of the first things you learned about, or something you have never heard of. Derived directly from one of the most useful features of Windows 2000, it offers an excellent way of managing many of the most important elements of the operating system from a single interface. You can open the computer management interface by right clicking on 'my computer' and selecting 'manage.' The computer management window is divided into three sections. First is System Tools, which is comprised of several tools to help you manage and troubleshoot your computer. © 2005 PCstats.com The Event viewer The event viewer is simply an easy interface to view the various logs that Windows XP keeps by default. Any program or system errors are recorded here in the application or system logs, and they can be an invaluable source of information if you are having recurring problems. The security log is inactive by default, and is only used if you decide to enable auditing on your XP system. More on this later. As the application and system logs record all programs and procedures started by Windows, this is an excellent place to start if you want to know more about what is going on behind the scenes. Shared Folders The shared folders heading contains a simple but useful list of all the folders that have been enabled for sharing, or in other words folders that are available to a remote user who may connect to your system over a network or the internet. Another list, known as a Sessions List shows all remote users currently connected to your computer, while the Open Files List illustrates which files are currently being accessed by the remote users in question. I don't have to tell you that this is an important set of screens if you are concerned about security, as the sessions and open files lists can easily tell you if you have an uninvited guest in your system. Note that the list of shares contains two or more shares denoted by a $ (ADMIN$ and C$). These are the administration shares that are installed by default when you load XP. The dollar sign indicates that they are hidden shares that will not show up in an explorer window, but can be accessed directly (try going to 'start/run' and typing '\\(yourcomputername)\c$'). What this means is that every file in your C: drive is shared by default, which should illustrate the importance of using a password on ALL your user accounts. If you do not, every file on your system is essentially wide open. Now one thing to keep in mind as you pour through the lists of shared folder is that you cannot create or remove shares from this interface; this has to be done on the properties of the individual folders. You do have the option of disconnecting remote users from the sessions list, however. That task can be accomplished by highlighting the user, going to the 'action' menu and selecting 'disconnect session.' «Previous Page © 2005 PCstats.com Next Page» Local users and groups Provides an interface for managing users and passwords, as well as the groups (XP professional only) that they belong to. Groups in Windows 2000/XP are simply an easy way of assigning or restricting rights and privileges to various aspects of Windows to multiple users. For example, if your computer is used by many people, you could ensure that other users do not have the ability to access most system configuration options simply by adding their user names to the default 'users' group. This group is restricted from installing most software and using system applications. Moving the users over the default group also removes the individuals from all other groups, especially the administrators group which has unlimited access to the system. Custom groups can also be created with desired sets of privileges and restrictions. As for the users window, it is simply a list of all user accounts currently created, including ones created by the system. There will likely be some accounts here you do not recognize, especially if you are using XP Professional. WindowsXP Professional comes with Microsoft's web server software (IIS - Internet Information Server), which creates the 'IUSR- (yourcomputername) account to allow anonymous remote users to access web pages you create. While you are here, do yourself a favour and assure that the 'guest' account is disabled, as it is essentially useless and a potential security hole. Also ensure that your 'administrator' account and all user accounts are equipped with passwords. Make sure of this last step as both XP Home and Professional make users created as part of the installation process, members of the 'administrators' group, but do NOT give them passwords. Yikes! Storage Management The storage section of computer management consists of three sections, but only one is of any real significance. Removable storage is simply a list of your removable storage devices like tapes, floppies and CD-ROM drives, and any media that is currently present in them. Disk defragmenter is the hard drive defragmentation utility, also accessible from the 'system tools' submenu under 'programs/accessories.' For more information about disk defragmenter, see PCstats Annual PC Checkup guide. The important entry is the 'disk management' window. «Previous Page © 2005 PCstats Disk Management This important window allows you to configure all logical aspects of your hard disk(s) from a single location. From here you can partition and format new hard disks, assign new drive letters if necessary, and even mount new drives as directories on your other hard disks. Windows XP also can create striped volumes (for increased disk performance) from here. For a comprehensive how-to on RAID, see our guide here. For the rest of the features of XP disk management, read on. From the main computer management screen, you can view a graphical representation of the way that your system's hard disks are partitioned, and what file systems they are using, as well as a quickie diagnosis of that drive's relative health. To start with, any and all disks, volumes (what Windows sees as a logical 'drive,' C:, D;, etc.) or partition options can be accessed from this screen by right clicking on either the disks at the bottom of the screen. Partitions are the sections of the disk's free space that are organized for use by a file system, and if you hadn't already guessed, the first disk is numbered "0", and the second disk "1," etc. Hitting 'properties' or by right-clicking on the individual volumes (C:, D:, etc.) at the top of the screen and selecting options from the menu will also give you some added insight. «Previous Page © 2005 PCstats.com Next Page» Page Index: Mounting drives as folders As a safety feature, Windows does not allow you to format (redo the file system on the disk, erasing all information) or delete the partition containing the Windows directory from this management screen, but you can carry out these operations on any other partition by right clicking on it and selecting the 'format' or 'delete partition/logical drive' options. If you have installed an additional hard disk and wish to partition and format it, the disk will be represented as grey 'unpartitioned space' in the graphical display at the bottom of the screen, and can be partitioned and formatted by right clicking on the drive and selecting the partition option to start a wizard that will guide you through the operation. We have covered the utilities available from the 'properties' menu of individual drives, such as hard drive defragmentation, backups and sharing in several recent articles, so for information on these topics try the preceding links. Mounting drives as folders One rather interesting option available with the disk manager is the ability to mount individual partitions as directories in another volume. For example, if you had a computer with a 20GB disk formatted into a single partition and volume (drive c:), you could purchase a second drive, partition and format it from disk manager and then instead of giving it its own drive letter, add it to your c: drive as a directory. Any files added to that directory would of course be stored in the new HD. This can come in extremely handy, as certain applications (databases come to mind) can grow extremely large, but may not support storing data on a separate drive. As far as Windows is concerned, a drive mounted as a directory is just a directory, so no extra drive letters are involved. This can also cut down on storage confusion for the average user, and it's easy to do, though it can only be done with NTFS formatted partitions. Also, the boot partition cannot be used this way, though other partitions can be added to the boot partition. Also note that shuffling the partition around in this way has no effect on the data stored in it. You can move an NTFS partition from directory to directory, then give it back a drive letter if you choose, while maintaining complete access to the data inside. No reboot is necessary. One other note: If you have installed software on a partition you plan to mount as a directory, it is best to uninstall and reinstall it, since the move may stop the software from working correctly. Windows will warn you about this... «Previous Page To mount a partition as a directory © 2005 PCstats.com Open disk manager, Right click on the partition you wish to mount as a directory in the graphical partition window (lower pane). Select 'change drive letter and paths…' Remove the current option (if any), then click add. Choose the 'mount in the following empty NTFS folder,' browse to the desired volume and add a directory for your drive. Click 'ok.' That's it. If you wish to return things back to the way they were, simply repeat the procedure, removing the directory location and choosing a drive letter instead. The data on the drive will be unharmed. Dynamic disks and volumes (XP Professional only) An option that was added to the Windows repertoire in Win2K, dynamic disks and volumes are a new way of handling hard drive storage, supplemental to the standard file system used on each disk to organize files for access by the operating system. When one or more drives are made dynamic, a database is created by Windows and stored in the last megabyte of space on all dynamic disks. This database, the dynamic disk database, contains information about all of the dynamic drives on the system. As these drives all share a copy of the database, they share the information about the makeup of each drive in the disk group (a collection of dynamic disks sharing a database). This sharing of information provides any dynamic drives in the group with several options not possible on simple (non-dynamic) drives. To start with, the area of space on the physical disk used by a dynamic volume (a logical drive like C: contained on a dynamic disk) no longer needs to be continuous, and can be resized within Windows. In other words, you can take a physical disk with a couple of partitions, convert it to a dynamic disk, delete one volume and then resize the remaining dynamic volume to use the entire available space, all without leaving the disk management window. «Previous Page © 2005 PCstats.com Next Page» How to Disable a Service To disable a service first open the services window at 'computer management/services and applications/services.' Highlight the service you wish to change, right click and select 'properties.' Hit the 'stop' button to stop the service, then set the 'startup type' dropdown box to 'disabled.' This stops the service and ensures that it will not reload upon restarting the computer. Local security policies (XP professional only) Accessed through the 'administrative tools' menu found in the control panel, the local security policies window controls the various XP security options like auditing (keeping a log of which users log into the computer, and what resources they access.), password complexity requirements, what users are allowed to log into the computer remotely, etc. All important stuff, but generally something that is useful more at the enterprise level than for individual home users. If your PC is used by several people, or if you have had problems in the past with someone breaking into your PC, you might want to consider some of these settings. Going through the groups, 'account policies' governs password settings like minimum length and complexity requirements of user passwords, and whether there is a limit to the amount of times they can be tried before the account is disabled for a period of time. The 'local policies' section contains auditing options, which when set will add reports to the 'security' log in the event log section of the computer management window, so you can see who is accessing the resources you have audited. All auditing is disabled by default, and generally speaking if you wish to enable it, limit it to one or two options, like auditing account logon events, not the whole bunch, or you will be overwhelmed with pointless log entries. Accessibility Options Also in local policies are the 'user rights assignment' and 'security' sections, both of which contain a huge amount of user based options for securing various parts of the operating system. An option you may wish to consider here is to remove permission for any account to 'access this computer from the network' (in user rights assignment section), assuming that you do not wish to access the computer remotely, or host a WWW or FTP site. The 'public key policies' section is most often used for enabling EFS, the Encrypting File System, for encrypting personal documents and information.PCstats covers this topic extensively in the Encryption and Online Privacy Guide. 'Software restriction policies' and 'IP security policies on local computer' govern setting rules for restricting software that can be used on the computer, and securing network traffic through the use of encryption respectively. Both are best left to centrally managed business environments. Accessibility options Windows XP comes equipped with a large variety of what Microsoft calls 'accessibility options,' tools to make Windows easier to use for people with visual difficulties or other problems and disabilities. These can be accessed most easily from the accessibility wizard, found at 'start/programs/accessories/accessibility/accessibility wizard.' Through this program you can manually change the default Windows text size, scroll bar size, icon size, choose a high contrast colour scheme and mouse cursor, activate captions for supporting programs and visual indicators to replace sound effects for the hard of hearing as well as activate a range of other options by indicating to the wizard where your difficulties using the system lie. Besides above options, the various accessibility features you can enable are: StickyKeys: Allows any key combination that includes CTRL, ALT or SHIFT to be entered one key at a time instead of simultaneously. BounceKeys: Windows will ignore held down or rapidly repeated keystrokes on the same key. ToggleKeys: Windows will play a sound when any of the 'lock' keys are pressed, such as Caps Lock or Num Lock. Very useful this. MouseKeys: The numeric keypad can be used to control the mouse pointer. Magnifier: Opens a window at the top of the screen that displays a magnified view of the area around the cursor. Narrator: Narrates the contents of system Windows, including the status of things like checkboxes and options, for the visually impaired. Rather difficult to use, and reminiscent of Hal 9000 in voice. On-Screen keyboard: Provides a keyboard option for users who cannot operate a physical keyboard. A utility manager is provided to manage settings for these three programs, controlling if they start automatically when Windows is loaded, for example. «Previous Page © 2005 PCstats.com Next Built in backup utility Windows XP contains a built in backup program that allows you to make data backups to a tape or hard drive. It can be accessed at 'start/programs/accessories/system tools/backup.' Users of XP Home must add the backup utility from the CD using add/remove programs from the control panel.PCstats has already covered this feature in our XP backup Guide here. Files and setting transfer wizard The files and settings wizard is a new tool added to XP to allow users to transfer their documents, email and desktop settings from other computers or other Windows installations automatically. It works with any Windows operating system from Windows 95 on, and requires some form of network connection if you wish to transfer the data between computers. It works by transferring what it considers user specific data, such as the contents of the desktop and the 'my documents' folder, to the new computer, along with Windows settings for desktop themes, accessibility options, etc. The idea is to make your new computer's working environment identical to that of your old one. It also will transfer the settings of certain popular third party programs like Photoshop, provided the same application is installed on the new computer. Note, the wizard will not adjust hardware settings, so features like the desktop resolution and refresh rate will have to be changed manually. To use the files and settings transfer wizard, you will need to first run it from the old computer, either by creating a floppy disk (which can be done from within the wizard), or by inserting the Windows XP CD into the old computer and running the wizard from there. To transfer files and settings from your old computer: If your computer has a CD-ROM drive, the best way to start the process is to insert your XP CD, select 'perform additional tasks' from the autorun menu, then 'transfer files and settings' to launch the wizard. If you do not have a CD-ROM on the old computer, you will need to create a wizard disk by running the files and settings transfer wizard, selecting 'new computer' then following the options to make a disk. «Previous Page Files and setting transfer wizard Once you have launched the wizard on your old computer, choose the method you will use to transfer the information. Network and direct cable connections can be used, as can floppy or ZIP disks, and you can also store the information on a drive, either in the current (old) computer or on a network drive shared out from the new one. Now choose whether you wish the program to transfer both your files and settings, either one, or your own custom set of files. A list of the files and settings to be transferred is provided. Please note that although the wizard can transfer the settings of various Microsoft and third party software applications, you will need to have actually installed the relevant software on your new computer before you do the transfer, as only your settings are moved over, not the programs themselves. Windows will create one or more compressed .dat files in the location you chose, depending on the amount of data to be moved across. This will take a considerable amount of time if you have large files in you're my documents folder or on the desktop. Once this process is finished, move to the new computer and start the files and settings transfer wizard from the accessories/system tools menu. Select 'new computer,' and indicate that you have already collected files and settings from your old computer. It will begin the transfer of settings, which also may take a considerable amount of time. If you are running XP and have not yet applied service pack 1, please do so before you attempt to use the files and settings transfer wizard, as it contains several relevant bug fixes. System information The system information window, reached from 'start/programs/accessories/system tools/system information,' contains more information about your computer and it's current installation of Windows than you could ever possibly want to know. If you need some specific information about hardware or software installed in your computer for tech support, chances are it can be found here. Hash Indices 1. A hash index organizes the search keys with their associated pointers into a hash file structure. 2. We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets). 3. Strictly speaking, hash indices are only secondary index structures, since if a file itself is organized using hashing, there is no need for a separate hash index structure on it. Dynamic Hashing 1. As the database grows over time, we have three options: o Choose hash function based on current file size. Get performance degradation as file grows. o Choose hash function based on anticipated file size. Space is wasted initially. o Periodically re-organize hash structure as file grows. Requires selecting new hash function, recomputing all addresses and generating new bucket assignments. Costly, and shuts down database. 2. Some hashing techniques allow the hash function to be modified dynamically to accommodate the growth or shrinking of the database. These are called dynamic hash functions. o Extendable hashing is one form of dynamic hashing. o Extendable hashing splits and coalesces buckets as database size changes. o This imposes some performance overhead, but space efficiency is maintained. o As reorganization is on one bucket at a time, overhead is acceptably low. 3. How does it work? Figure 1. General extendable hash structure. o We choose a hash function that is uniform and random that generates values over a relatively large range. o o o Range is b-bit binary integers (typically b=32). is over 4 billion, so we don't generate that many buckets! Instead we create buckets on demand, and do not use all b bits of the hash initially. o o o o o At any point we use i bits where . The i bits are used as an offset into a table of bucket addresses. Value of i grows and shrinks with the database. Figure 11.19 shows an extendable hash structure. Note that the i appearing over the bucket address table tells how many bits are required to determine the correct bucket. It may be the case that several entries point to the same bucket. All such entries will have a common hash prefix, but the length of this prefix may be less than i. So we give each bucket an integer giving the length of the common hash prefix. o o o o o This is shown in Figure 11.9 (textbook 11.19) as . Number of bucket entries pointing to bucket j is then 2. To find the bucket containing search key value o Compute . : . o o o Take the first i high order bits of . Look at the corresponding table entry for this i-bit string. Follow the bucket pointer in the table entry. 3. We now look at insertions in an extendable hashing scheme. o Follow the same procedure for lookup, ending up in some bucket j. o If there is room in the bucket, insert information and insert record in the file. o If the bucket is full, we must split the bucket, and redistribute the records. o If bucket is split we may need to increase the number of bits we use in the hash. 4. Two cases exist: 1. If o , then only one entry in the bucket address table points to bucket j. o o o Then we need to increase the size of the bucket address table so that we can include pointers to the two buckets that result from splitting bucket j. We increment i by one, thus considering more of the hash, and doubling the size of the bucket address table. Each entry is replaced by two entries, each containing original value. Now two entries in bucket address table point to bucket j. We allocate a new bucket z, and set the second pointer to point to z. o o o Set and to i. Rehash all records in bucket j which are put in either j or z. Now insert new record. o o o 2. If o o It is remotely possible, but unlikely, that the new hash will still put all of the records in one bucket. If so, split again and increment i again. , then more than one entry in the bucket address table points to bucket j. Then we can split bucket j without increasing the size of the bucket address table (why?). Note that all entries that point to bucket j correspond to hash prefixes that have the same value on the leftmost bits. o o We allocate a new bucket z, and set and to the original value plus 1. Now adjust entries in the bucket address table that previously pointed to bucket j. o Leave the first half pointing to bucket j, and make the rest point to bucket z. o Rehash each record in bucket j as before. o Reattempt new insert. 5. Note that in both cases we only need to rehash records in bucket j. 6. Deletion of records is similar. Buckets may have to be coalesced, and bucket address table may have to be halved. 7. Insertion is illustrated for the example deposit file of Figure 11.20. o 32-bit hash values on bname are shown in Figure 11.21. o An initial empty hash structure is shown in Figure 11.22. o We insert records one by one. o We (unrealistically) assume that a bucket can only hold 2 records, in order to illustrate both situations described. o As we insert the Perryridge and Round Hill records, this first bucket becomes full. o When we insert the next record (Downtown), we must split the bucket. o o o o o o o o o o Since , we need to increase the number of bits we use from the hash. We now use 1 bit, allowing us buckets. This makes us double the size of the bucket address table to two entries. We split the bucket, placing the records whose search key hash begins with 1 in the new bucket, and those with a 0 in the old bucket (Figure 11.23). Next we attempt to insert the Redwood record, and find it hashes to 1. That bucket is full, and . So we must split that bucket, increasing the number of bits we must use to 2. This necessitates doubling the bucket address table again to four entries (Figure 11.24). We rehash the entries in the old bucket. We continue on for the deposit records of Figure 11.20, obtaining the extendable hash structure of Figure 11.25. 8. Advantages: o Extendable hashing provides performance that does not degrade as the file grows. o Minimal space overhead - no buckets need be reserved for future use. Bucket address table only contains one pointer for each hash value of current prefix length. 9. Disadvantages: o Extra level of indirection in the bucket address table o Added complexity 10. Summary: A highly attractive technique, provided we accept added complexity. Comparison of Indexing and Hashing 1. To make a wise choice between the methods seen, database designer must consider the following issues: o Is the cost of periodic re-organization of index or hash structure acceptable? o What is the relative frequence of insertion and deletion? o Is it desirable to optimize average access time at the expense of increasing worst-case access time? o What types of queries are users likely to pose? 2. The last issue is critical to the choice between indexing and hashing. If most queries are of the form 3. 4. aaaaaaaaaaaa¯select 5. from r 6. 7. where then to process this query the system will perform a lookup on an index or hash structure for attribute with value c. 8. For these sorts of queries a hashing scheme is preferable. o Index lookup takes time proportional to log of number of values in R for . Hash structure provides lookup average time that is a small constant (independent of database size). 9. However, the worst-case favors indexing: o Hash worst-case gives time proportional to the number of values in R for o o . Index worst case still log of number of values in R. 10. Index methods are preferable where a range of values is specified in the query, e.g. 11. 12. aaaaaaaaaaaa¯select 13. from r 14. 15. where This query finds records with and values in the range from to . 16. o o o o o o o o o Using an index structure, we can find the bucket for value , and then follow the pointer chain to read the next buckets in alphabetic (or numeric) order until we find . If we have a hash structure instead of an index, we can find a bucket for easily, but it is not easy to find the ``next bucket''. A good hash function assigns values randomly to buckets. Also, each bucket may be assigned many search key values, so we cannot chain them together. To support range queries using a hash structure, we need a hash function that preserves order. For example, if and are search key values and then . Such a function would ensure that buckets are in key order. Order-preserving hash functions that also provide randomness and uniformity are extremely difficult to find. Thus most systems use indexing in preference to hashing unless it is known in advance that range queries will be infrequent. Index Definition in SQL 1. Some SQL implementations includes data definition commands to create and drop indices. The IBM SAA-SQL commands are o An index is created by o o aaaaaaaaaaaa¯create index <index-name> o on r (<attribute-list>) o o The attribute list is the list of attributes in relation r that form the search key for the index. o o o To create an index on bname for the branch relation: aaaaaaaaaaaa¯create index b-index o on branch (bname) o o o o If the search key is a candidate key, we add the word unique to the definition: aaaaaaaaaaaa¯create unique index b-index o on branch (bname) o o o If bname is not a candidate key, an error message will appear. If the index creation succeeds, any attempt to insert a tuple violating this requirement will fail. o The unique keyword is redundant if primary keys have been defined with integrity constraints already. 2. To remove an index, the command is 3. aaaaaaaaaaaa¯drop index <index-name> Query Optimization Techniques: Contrasting Various Optimizer Implementations with Microsoft SQL Server Microsoft Corporation Created: February 1992 "Related Readings" revised: February 1994 Overview As companies began to rely more heavily on computerized business data, it became increasingly clear that the traditional file-based methods of storing and retrieving data were both inflexible and cumbersome to maintain. Because application code for accessing the data contained hard-coded pointers to the underlying data structures, a new report could take months to produce. Even minor changes were complicated and expensive to implement. In many cases, there was simply no method available for producing useful analysis of the data. These real business needs drove the relational database revolution. The true power of a relational database resides in its ability to break the link between data access and the underlying data itself. Using a high-level access language such as SQL (structured query language), users can access all of their corporate data dynamically without any knowledge of how the underlying data is actually stored. To maintain both system performance and throughput, the relational database system must accept a diverse variety of user input queries and convert them to a format that efficiently accesses the stored data. This is the task of the query optimizer. This technical article presents the steps involved in the query transformation process, discusses the various methods of query optimization currently being used, and describes the query optimization techniques employed by the Microsoft® relational database management system, SQL Server. Query Transformation Whenever a data manipulation language (DML) such as SQL is used to submit a query to a relational database management system (RDBMS), distinct process steps are invoked to transform the original query. Each of these steps must occur before the query can be processed by the RDBMS and a result set returned. This technical article deals solely with queries sent to RDBMS for the purpose of returning results; however, these steps are also used to handle DML statements that modify data and data definition language (DDL) statements that maintain objects within the RDBMS. Although many texts on the subject of query processing disagree about how each process is differentiated, they do agree that certain distinct process steps must occur. The Parsing Process The parsing process has two functions: It checks the incoming query for correct syntax. It breaks down the syntax into component parts that can be understood by the RDBMS. These component parts are stored in an internal structure such as a graph or, more typically, a query tree. (This technical article focuses on a query tree structure.) A query tree is an internal representation of the component parts of the query that can be easily manipulated by the RDBMS. After this tree has been produced, the parsing process is complete. The Standardization Process Unlike a strictly hierarchical system, one of the great strengths of an RDBMS is its ability to accept highlevel dynamic queries from users who have no knowledge of the underlying data structures. As a result, as individual queries become more complex, the system must be able to accept and resolve a large variety of combinational statements submitted for the purpose of retrieving the same data result set. The purpose of the standardization process is to transform these queries into a useful format for optimization. The standardization process applies a set of tree manipulation rules to the query tree produced by the parsing process. Because these rules are independent of the underlying data values, they are correct for all operations. During this process, the RDBMS rearranges the query tree into a more standardized, canonical format. In many cases, it completely removes redundant syntax clauses. This standardization of the query tree produces a structure that can be used by the RDBMS query optimizer. The Query Optimizer The goal of the query optimizer is to produce an efficient execution plan for processing the query represented by a standardized, canonical query tree. Although an optimizer can theoretically find the "optimal" execution plan for any query tree, an optimizer really produces an acceptably efficient execution plan. This is because the possible number of table join combinations increases combinatorially as a query becomes more complex. Without using pruning techniques or other heuristical methods to limit the number of data combinations evaluated, the time it takes the query optimizer to arrive at the best query execution plan for a complex query can easily be longer than the time required to use the least efficient plan. Various RDBMS implementations have used differing optimization techniques to obtain efficient execution plans. This section discusses some of these techniques. Heuristic Optimization Heuristic optimization is a rules-based method of producing an efficient query execution plan. Because the query output of the standardization process is represented as a canonical query tree, each node of the tree maps directly to a relational algebraic expression. The function of a heuristic query optimizer is to apply relational algebraic rules of equivalence to this expression tree and transform it into a more efficient representation. Using relational algebraic equivalence rules ensures that no necessary information is lost during the transformation of the tree. These are the major steps involved in heuristic optimization: 1. 2. 3. 4. Break conjunctive selects into cascading selects. Move selects down the query tree to reduce the number of returned "tuples." ("Tuple" rhymes with "couple." In a database table (relation), a set of related values, one for each attribute (column). A tuple is stored as a row in a relational database management system. It is the analog of a record in a nonrelational file. [Definition from Microsoft Press Computer Dictionary, 1991.]) Move projects down the query tree to eliminate the return of unnecessary attributes. Combine any Cartesian product operation followed by a select operation into a single join operation. When these steps have been accomplished, the efficiency of a query can be further improved by rearranging the remaining select and join operations so that they are accomplished with the least amount of system overhead. Heuristic optimizers, however, do not attempt this further analysis of the query. Syntactical Optimization Syntactical optimization relies on the user's understanding of both the underlying database schema and the distribution of the data stored within the tables. All tables are joined in the original order specified by the user query. The optimizer attempts to improve the efficiency of these joins by identifying indexes that are useful for data retrieval. This type of optimization can be extremely efficient when accessing data in a relatively static environment. Using syntactical optimization, indexes can be created and tuned to improve the efficiency of a fixed set of queries. Problems occur with syntactical optimization whenever the underlying data is fairly dynamic. Query access schemas can be degraded over time, and it is up to the user to find a more efficient method of accessing the data. Another problem is that applications using embedded SQL to query dynamically changing data often need to be recompiled to improve their data access performance. Cost-based optimization was developed to resolve these problems. Cost-Based Optimization To perform cost-based optimization, an optimizer needs specific information about the stored data. This information is extremely system-dependent and can include information such as file size, file structure types, available primary and secondary indexes, and attribute selectivity (the percentage of tuples expected to be retrieved for a given equality selection). Because the goal of any optimization process is to retrieve the required information as efficiently as possible, a cost-based optimizer uses its knowledge of the underlying data and storage structures to assign an estimated cost in terms of numbers of tuples returned and, more importantly, physical disk I/O for each relational operation. By evaluating various orderings of the relational operations required to produce the result set, a cost-based optimizer then arrives at an execution plan based on a combination of operational orderings and data access methods that have the lowest estimated cost in terms of system overhead. As mentioned earlier, the realistic goal of a cost-based optimizer is not to produce the "optimal" execution plan for retrieving the required data, but is to provide a reasonable execution plan. For complex queries, the cost estimate is based on the evaluation of a subset of all possible orderings and on statistical information that estimates the selectivity of each relational operation. These cost estimates can be only as accurate as the available statistical data. Due to the overhead of keeping this information current for data that can be altered dynamically, most relational database management systems maintain this information in system tables or catalogs that must be updated manually. The database system administrator must keep this information current so that a cost-based optimizer can accurately estimate the cost of various operations. Semantic Optimization Although not yet an implemented optimization technique, semantic optimization is currently the focus of considerable research. Semantic optimization operates on the premise that the optimizer has a basic understanding of the actual database schema. When a query is submitted, the optimizer uses its knowledge of system constraints to simplify or to ignore a particular query if it is guaranteed to return an empty result set. This technique holds great promise for providing even more improvements to query processing efficiency in future relational database systems. The Microsoft SQL Server Query Optimizer The Microsoft SQL Server database engine uses a cost-based query optimizer to automatically optimize data manipulation queries that are submitted using SQL. (A data manipulation query is any query that supports the WHERE or HAVING keywords in SQL; for example, SELECT, DELETE, and UPDATE.) This optimization is accomplished in three phases: Query analysis Index selection Join selection Query Analysis In the query analysis phase, the SQL Server optimizer looks at each clause represented by the canonical query tree and determines whether it can be optimized. SQL Server attempts to optimize clauses that limit a scan; for example, search or join clauses. However, not all valid SQL syntax can be broken into optimizable clauses, such as clauses containing the SQL relational operator <> (not equal). Because <> is an exclusive rather than an inclusive operator, the selectivity of the clause cannot be determined before scanning the entire underlying table. When a relational query contains non-optimizable clauses, the execution plan accesses these portions of the query using table scans. If the query tree contains any optimizable SQL syntax, the optimizer performs index selection for each of these clauses. Index Selection For each optimizable clause, the optimizer checks the database system tables to see if there is an associated index useful for accessing the data. An index is considered useful only if a prefix of the columns contained in the index exactly matches the columns in the clause of the query. This must be an exact match, because an index is built based on the column order presented at creation time. For a clustered index, the underlying data is also sorted based on this index column order. Attempting to use only a secondary column of an index to access data would be similar to attempting to use a phone book to look up all the entries with a particular first name: the ordering would be of little use because you would still have to check every row to find all of the qualifying entries. If a useful index exists for a clause, the optimizer then attempts to determine the clause's selectivity. In the earlier discussion on cost-based optimization, it was stated that a cost-based optimizer produces cost estimates for a clause based on statistical information. This statistical information is used to estimate a clause's selectivity (the percentage of tuples in a table that are returned for the clause). Microsoft SQL Server stores this statistical information in a specialized data distribution page associated with each index. This statistical information is updated only at the following two times: During the initial creation of the index (if there is existing data in the table) Whenever the UPDATE STATISTICS command is executed for either the index or the associated table To provide SQL Server with accurate statistics that reflect the actual tuple distribution of a populated table, the database system administrator must keep the statistical information for the table indexes reasonably current. If no statistical information is available for the index, a heuristic based on the relational operator of the clause is used to produce an estimate of selectivity. Information about the selectivity of the clause and the type of available index is used to calculate a cost estimate for the clause. SQL Server estimates the amount of physical disk I/O that occurs if the index is used to retrieve the result set from the table. If this estimate is lower than the physical I/O cost of scanning the entire table, an access plan that employs the index is created. Join Selection When index selection is complete and all clauses have an associated processing cost based on their access plan, the optimizer performs join selection. Join selection is used to find an efficient order for combining the clause access plans. To accomplish this, the optimizer compares various orderings of the clauses and then selects the join plan with the lowest estimated processing costs in terms of physical disk I/O. Because the number of clause combinations can grow combinatorially as the complexity of a query increases, the SQL Server query optimizer uses tree pruning techniques to minimize the overhead associated with these comparisons. When this join selection phase is complete, the SQL Server query optimizer provides a cost-based query execution plan that takes advantage of available indexes when they are useful and accesses the underlying data in an order that minimizes system overhead and improves performance. Summary This technical article has shown you the steps required for a relational database management system to process a high-level query. It has discussed the need for query optimization and has shown several different methods of achieving query optimization. Finally, it has illustrated the various phases of optimization employed by the cost-based optimizer of the Microsoft RDBMS, SQL Server. We hope this document has helped you gain a better understanding of both the query optimization process and the Microsoft cost-based query optimizer, one of the many features that clearly define SQL Server as the premier database server for the PC environment. Related Readings Date, C. J. An Introduction to Database Systems. Volume I. Addison/Wesley, 1990, 455–473. Elmasri, R., and S. B. Navathe. Fundamentals of Database Systems. Benjamin/Cummings, 1989, 501– 532. Moffatt, Christopher. "Microsoft SQL Server Network Integration Architecture." MSDN Library, Technical Articles. "Microsoft Open Data Services: Application Sourcebook." MSDN Library, Technical Articles. Shelly, D. B. "Understanding the Microsoft SQL Server Optimizer." Microsoft Networking Journal, Vol. 1, No. 1, January 1991. Yao, S. B. "Optimization of Query Evaluation Algorithms." ACM TODS, Vol. 4, No. 2, June 1979. Additional Information To receive more information about Microsoft SQL Server or to have other technical notes faxed to you, call Microsoft Developer Services Fax Request at (206) 635-2222. SYSTEM STRATEGIES TECHNICAL SUPPORT JULY 1996 In order to optimize queries accurately, sufficient information must be available o determine which data access techniques are most effective (for example, table and column cardinality, organization information, and index availability). In a distributed, client/server environment, data location becomes a major factor. This article will discuss how adding location considerations to the optimization process increases complexity. COMPONENTS OF DISTRIBUTED QUERY OPTIMIZATION There are three components of distributed query optimization: n Access Method — In most RDBMS products, tables can be accessed in one of two ways: by completely scanning the entire table or by using an index. The best access method to use will always depend upon the circumstances. For example, if 90 percent of the rows in the table are going to be accessed, you would not want to use an index. Scanning all of the rows would actually reduce I/O and overall cost. Whereas, when scanning 10 percent of the total rows, an index will usually provide more efficient access. Of course, some products provide additional access methods, such as hashing. Table scans and indexed access, however, can be found in all of the "Big Six" RDBMS products (i.e., DB2, Sybase, Oracle, Informix, Ingres, and Microsoft). n Join Criteria — If more than one table is accessed, the manner in which they are to be joined together must be determined. Usually the DBMS will provide several different methods of joining tables. For example, DB2 provides three different join methods: merge scan join, nested loop join, and hybrid join. The optimizer must consider factors such as the order in which to join the tables and the number of qualifying rows for each join when calculating an optimal access path. In a distributed environment, which site to begin with in joining the tables is also a consideration. n Transmission Costs — If data from multiple sites must be joined to satisfy a single query, then the cost of transmitting the results from intermediate steps needs to be factored into the equation. At times, it may be more cost effective simply to ship entire tables across the network to enable processing to occur at a single site, thereby reducing overall transmission costs. This component of query optimization is an issue only in a distributed environment. BY CRAIG S. MULLINS Distributed Query Optimization Query optimization is a difficult task in a distributed client/server environment and data location becomes a major factor. Understanding the issues involved enables programmers to develop efficient distributed optimization choices. D atabase queries have become increasingly complex in the age of the distributed DBMS (DDBMS). This poses a difficulty for the programmer but also for the DDBMS. Query optimization is a difficult enough task in a non-distributed environment. Anyone who has tried to study and understand a cost-based query optimizer for a relational DBMS (such as DB2 or Sybase SQL Server) can readily attest to this fact. When adding distributed data into the mix, query optimization becomes even more complicated. SYSTEM STRATEGIES TECHNICAL SUPPORT JULY 1996 SYSTEMATIC VS. PROGRAMMATIC OPTIMIZATION There are two manners in which query optimization can occur: systematically or programmatically. Systematic optimization occurs when the RDBMS contains optimization algorithms that can be used internally to optimize each query. Although systematic optimization is desirable, the optimizer is not always robust enough to be able to determine how best to join tables at disparate sites. Indeed, quite often the RDBMS does not even permit a distributed request joining multiple tables in a single SQL statement. In the absence of systematic optimization, the programmer can optimize each request by coding the actual algorithms for selecting and joining between sites into each application program. This is referred to as progra m m atic optimization. With systematic optimization the RDBMS does all of the work. Fa c t o rs to consider when coding optimization logic into yo u r ap p l i c ation programs incl u d e : n the size of the tables; n the location of the tables; n the availability of indexes; n the need for procedural logic to support complex requests that can't be coded using SQL alone; n the availability of denormalized structures (fragments, replicas, snapshots); and n consider using common, reusable routines for each distinct request, simplifying maintenance and modification. AN OPTIMIZATION EXAMPLE In order to understand distributed query optimization more fully, let's take a look at an example of a query accessing tables in multiple locations. Consider the ramifications of coding a program to simply retrieve a list of all teachers who have taught physics to seniors. Furthermore, assume that the COURSE table and the ENROLLMENT table exist at Site 1; the STUDENT table exists at Site 2. If either all of the tables existed at a single site, or the DBMS supported distributed multi-site requests, the SQL shown in Figure 1 would satisfy the requirements. However, if the DMBS can not perform (or optimize) distributed multi-site requests, programmatic optimization must be performed. There are at least six different ways to go about optimizing this three-table join. Option 1: Start with Site 1 and join COURSE and ENROLLMENT, selecting only physics courses. For each qualifying row, move it to Site 2 to be joined with STUDENT to see if any are seniors. Option 2: Start with Site 1 and join COURSE and ENROLLMENT, selecting only physics courses, and move the entire result set to Site 2 to be joined with STUDENT, checking for senior students only. Option 3: Start with Site 2 and select only seniors from STUDENT. For each of these examine the join of COURSE and ENROLLMENT at Site 1 for physics classes. Option 4: Start with Site 2 and select only seniors from STUDENT at Site 2, and move the entire result set to Site 1 to be joined with COURSE and ENROLLMENT, checking for physics classes only. Option 5: Move the COURSE and ENROLLMENT tables to Site 2 and proceed with a local three-table join. Option 6: Move the STUDENT to Site 1 and proceed with a local three-table join. Wh i ch of these six options will perfo rm the best? Unfo rt u n at e ly, t h e o n ly correct answer is "It depends." The optimal choice will depend upon: n the size of the tables; n the size of the result sets — that is, the number of qualifying rows and their length in bytes; and n the efficiency of the network. Try different combinations at your site to optimize distributed q u e ries. But re m e m b e r, n e t wo rk tra ffic is usually the cause of most p e rformance problems in a distributed environment. So devoting most of your energy to options involving the least amount of netwo rk traffic is a wise approach. In addition, bad design can also be the cause of many d i s t ri buted perfo rmance pro bl e m s . NOT QUITE SO SIMPLE The previous example is necessari ly simplistic in order to demonstrat e the inherent complexity of optimizing distributed queries. By adding more sites and/or more tables to the mix, the difficulty of optimization will increase because the number of options available increases. Additionally, the specific query used is also quite simple. Instead of a simple three table join, the query could be a combination of joins, subqueries, and unions over more than three tables. The same number of options is available for any combination of two tables in the query. Indeed, there are probably more options than those covered in this article. Consider a scenario similar to the one posed above in which we have three tables being joined over two sites. Tables A and B exist at Site 1 and Table C exists at Site 2. It is quite possible that it would be more efficient to process A at Site 1 and ship the results to Site 2. At site 2, the results would be joined to Table C. Those results would then be shipped back to Site 1 to be joined to Table B. It is not probable that this scenario would produce a more optimal strategy than the six outlined above, but in certain situation, it is possible. Furthermore, some types of processing require procedural logic (such as looping and conditional if-then processing) to be interspersed with multiple SQL queries to produce a result. In these cases, the procedural logic should be factored into the optimization equation for optimal results. Howeve r, the optimize rs ava i l able in the major RDBMS products don't do a good job of this for non-distributed queries, so the hope of a distributed optimizer performing this type of optimization any time soon is not good. Fi n a l ly, t h e re is a laundry list of other considerations that must be take n into account that I have skipped for the sake of brev i t y. For ex a m p l e : n The security and authorization implication of who can access what information at which site need to be examined and implemented. n In a multi-site environment, it is possible (indeed quite likely over time) that one of the sites will not be available for any number of reasons (software upgrade, power outage, hardware/software failure, etc.). n Declarative referential integrity among multiple sites, in which the data relationships are specified in each table's DDL, are not available Figure 1: SQL to Satisfy Single Site or Multi-Site Requests SELECT C.TEACHER FROM COURSE C, ENROLLMENT E, STUDENT S WHERE C.COURSE_NO=E.COURSE_NO AND E.STUDENT_NO=S.STUDENT_NO AND S.STUDENT_LEVEL="SENIOR" AND C.COURSE_TYPE="PHYSICS" SYSTEM STRATEGIES in any DDBMS to date. The specification of these relationships would greatly assist application development efforts, as well as distributed query optimization. n Distributed structures can be implemented to augment performance. A multi-site, multitable index structure could be created that would contain information on the physical location of tables, as well as the physical location of the data items within that table. This stru c t u re, h owever helpful from a performance perspective, would be difficult to maintain and administer due to its reliance on multiple sites. n The optimization process will be highly dependent upon the implementation and usage of the network. The amount of network traffic can vary from day-to-day, and even hour-toh o u r, t h e reby impacting the optimizat i o n choice. Whenever the network is modified in any way (tuned, new release, additional nodes added, etc.), the optimization choice should be re-addressed as a new, more optimal path m ay now be ava i l abl e. This can quick ly become a drain on the resources of the system (and the personnel administering the system). SYNOPSIS Introducing data distribution into the query optimization process makes a complex issue that much more complex. Until the distributed DBMS products support the systematic optim i z ation of distri buted mu l t i - t able SQL requests, programmatic optimization will be a fact of distri buted life. Understanding the issues involved will enable application programmers to develop efficient distributed optimization choices. Craig S. Mullins is a senior technical advisor and team leader of the Technical Communications group at PLATINUM t e c h n o l o g y, inc. C r a i g ’s book, DB2 Developers Guide, contains more than 1,200 pages of tips and guidelines for DB2 and can be ordered directly from the publisher, SAMS Publishing, at 1-800-428-5331. Craig can be reached via the Internet ([email protected]), CompuServe (70410,237), America Online (CraMullins), or at PLATINUM technology, inc. (800-442-6861, fax: 708-691-0709). ©1996 Technical Enterprises, Inc. Reprinted with permission of Te chnical Support m ag - azine. For subscription information, email [email protected] or call 414-768-8000, Ext. 116. ts TECHNICAL SUPPORT JULY 1996 Until the distributed DBMS products support the systematic optimization of distributed multi-table SQL requests, programmatic optimization will be a fact of distributed life. Semi join operation Databases and Distributed Database Management Systems - DBMSs - had their origin in large organisations' needs for centrally-controlled information management, and the software and associated administrative procedures were developed for that environment. Later, with the advent of small business machines and particularly PCs, single-user DBMSs were widely adopted to provide reliable and simple information-processing facilities for individuals or small working groups. There is now tendency to link machines together in networks, intended to give the advantages of local processing while maintaining overall control over system integrity and security. Developments in database technology obviously reflect this trend. A DBMS like Oracle can for instance be run under Unix in CLIENT-SERVER mode. Using the TCP/IP protocol, an Oracle application running on a workstation can communicate with Oracle on a server, the tasks being shared between the two processors - the server handles updating and retrieval; the client application handles screen management, user data entry, and report generation. When implemented this should provide better performance. Note that a DBMS which supports a fully relational interface is important for the success of this approach, as for the fully distributed databases to be discussed later. Using a relational language, interactions between the server and the client involve retrieving sets of records, which puts less load on the network than single-record transactions. Database servers are sometimes referred to as SQL ENGINES in that their only interaction with client machines is through basic SQL commands to accept and produce data. Standard SQL provides a common language with which completely different software products can communicate. In a true distributed database, the data itself is located on more than one machine. There are various possible approaches, depending on the needs of the application and the degree of emphasis placed on central control versus local autonomy. In general, organisations may wish to: reduce data communications costs by putting data at the location where it is most often used, aggregate information from different sources, provide a more robust system (e.g. when one node goes down the others continue working), build in extra security by maintaining copies of the database at different sites. Distributed systems are not always designed from scratch - they may evolve from traditional systems as organisational needs become apparent. One possibility is that a complete central database is maintained and updated in the normal way, but that local copies (in whole or part) are sent periodically to remote sites, to be used for fast and cheap retrieval. Any local updates have no effect on the central database. The implication here is that consistency between all copies of the database at all times is not crucial - it may for instance be enough to send new data to node sites overnight when networks are less busy. Alternatively, distributed database development may involve the linking together of previously separate systems, perhaps running on different machine architectures with different software packages. A possible scenario is that individual sites manage and update their own databases for standard operational applications, but that information is collected and aggregated for higher-level decision support functions. In this case there is no single location where the whole database is stored; it is genuinely split over two or more sites. Once again, however, total consistency may not be looked for - local databases are kept up to date and there is periodical transmission of data back to the centre. To manage a system like this a product such as Oracle's SQL*Net is required. This enables, by the use of SQL drivers provided by the host RDBMS (e.g. Oracle ODBC drivers for Access), data stored on, say, an Access DBMS to be interrogated / updated by an Oracle DBMS or vice-versa. Note that under these circumstances it is essential that the appropriate driver is capable of generating standard SQL; SQL is the universal database language used for communicating between different RDBMSs. A third possibility is that the database is designed from the start to be distributed, and that all nodes in the network may in principle query and update the database at any location. Codd has specified a set of criteria to characterise a genuinely distributed system; these are not in fact satisfied by any actual DDBMS commercially available today but, as with the 12 "commandments" about relational systems, they provide a framework for explanation and evaluation. 8.1. Local autonomy. 8.2. No reliance on a central site. These points concern the overall control mechanism within a DDB, and in particular the location of data dictionary and system catalogue material. How much is it necessary for each site to know about data held elsewhere? In a local area network it is feasible for any node to broadcast a query to all the others, and await response from the one with the relevant information. In wide area networks, communications costs become more significant, and it is necessary to decide how and where to place information to determine the routing of queries. This point will be explored further after the next three rules have been explained. 8.3. Data fragmentation. 8.4. Transparency of location 8.5. Replication The requirement is that the user of a DDB need not know how the data is partitioned, where any part of it is stored, or how many copies exist, as the system will be intelligent enough to present it as a seamless whole. No current general-purpose DDBMS can achieve this, although it is always possible to write code for particular applications which hides lower-level details from end-users. Decisions about fragmentation, location and replication are very important for the design of a DDB, and are now discussed in more detail. A relational database is partitioned by first dividing it into a number of FRAGMENTS. In theory a fragment may be a complete table, or any HORIZONTAL / VERTICAL subset of it which can be described in terms of relational select and project - in other words groups of records or groups of fields. Choice of fragments will be based on expectations about likely usage. 1. HORIZONTAL FRAGMENTATION might depend on geographical divisions within an organisation so that, e.g. payroll or customer records are held in the location where they are most likely to be created and accessed. It should partition tables into discrete groups, based either directly on field values or indirectly on joins with another horizontally fragmented table (derived fragmentation). It should not result in missing records or overlaps! 2. VERTICAL fragmentation might depend on functional divisions within an organisation, so that, e.g. the site normally dealing with taxation has the relevant fields from all employee-records. There must in this case be some overlap - at least the primary key of vertically-fragmented tables will be repeated, and the designer may define clusters of fields to eliminate the potential need for many cross-site joins. The fragments are now allocated to individual sites, on the basis of where they are most likely to be needed. Decisions are based on the COST and the BENEFIT of having a data fragment at a particular site, where: The BENEFIT relates to the number of times it will be needed to answer queries, The COST relates to the number of times it will be changed as a result of a transaction at another site. The site with the best COST BENEFIT RATIO will be selected as the location for the fragment. The designer may choose to replicate the data, i.e. keep several copies of each fragment in different locations. This provides extra security, and flexibility in that there is more than one way to answer the same question. However it increases the potential update cost and in practice it has been found that the benefits of holding more than two or three replicated copies will not generally outweigh the cost. At this stage the question may arise as to whether total consistency between copies is always necessary - such a requirement will place a particularly heavy load on the transaction management software. The final design stage involves the MAPPING of global database fragments to tables in local databases. It is important to adopt a naming system which allows unambiguous reference to sections of the global database, while retaining users' freedom to select their own local names. Global names will generally incorporate site names (which must be unique), and in some systems may have a more complex structure. In IBM's experimental DDBMS (R*) every database unit has a name identifying: 1. 2. 3. 4. CREATOR_NAME, CREATOR_SITE, LOCAL_NAME, BIRTH_SITE. where BIRTH_SITE is the name of the site where the data was originally created. This name is guaranteed never to change, and will normally be mapped to local names by way of SQL CREATE SYNONYM clauses. It provides a convenient mechanism for actually finding database fragments, as will be described shortly. The next important design decision is about where to locate the SYSTEM CATALOGUE. Query processing in a DDB may require access to the following information: GLOBAL SCHEMA FRAGMENTATION SCHEMA ALLOCATION SCHEMA LOCAL MAPPINGS AUTHORISATION RULES ACCESS METHODS DATABASE STATISTICS Note that this information is not static - in principle changes may occur in any of the above categories and in particular database fragments may over time migrate from one site to another as patterns of access evolve. With a truly distribution-independent DBMS any alterations should be invisible to existing applications. In principle it is possible to adopt one of the following strategies for holding the system catalogue. Each choice has advantages and disadvantages. Hold one copy only, in a central site. This is the simplest solution to manage, since there is no redundancy and a single point of control. The disadvantage is that the central site acts as a bottleneck - if the catalogue there becomes unavailable for any reason the rest of the network is also out of action. It is the solution adopted in practice by many current organisations but it violates Codd's criteria for categorisation as a full DDBMS. Replicate copies of the complete catalogue over all sites. This allows any site to carry out every stage of query processing, even down to generating and optimising query plans. Total replication produces a high overhead, particularly if changes to any part of the catalogue must be propagated throughout the network. Some systems operate a CACHEING mechanism whereby sites hold and use versions of the catalogue which are not guaranteed to be up-to-date, but may allow some queries to be processed without access to the latest version. Another compromise is to replicate only part of the catalogue, e.g. the RDBMS INGRES arranges that all sites hold items (a), (b), and (c) - i.e. CREATOR_SITE, LOCAL_NAME and BIRTH_SITE - from the list given above. Any site knows where to direct queries, but the task of generating query plans is delegated to the site where the data is held. This may prove a barrier to global distributed query optimisation. Maintain only local catalogues. This solution does provide complete site autonomy but may give rise to extensive network traffic, since all sites must be interrogated for every query to see if they are holding relevant information. While perhaps tolerable on a small system using a local area network, this solution cannot be adopted in systems with high communication costs. However, using a convention where local names are mapped onto global names via synonyms, it is possible to ensure that any data element is accessible in at most two moves. For example, the R* system mentioned above holds complete catalogues for all database elements at both their birth-site and their current site, if these are different. Query processing now involves the following actions: o convert from local synonym to global name, o identify the birth-site and interrogate it for the data, o the birth-site will either return the data or, if it has migrated elsewhere, will know its current location, and inform the query site accordingly, o the query site can now interrogate the site where the data is currently stored. 8.6. Continuous operation It should not be necessary to halt normal use of the database while it is re-organised, archived, etc. It may be easier to achieve this in a distributed rather than centralised environment, since processing nodes may be able to substitute for one another where necessary. 8.7. Distributed query processing The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralised one, by exploiting parallelism and reducing disc contention; in practice the main delays (and costs) will be imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency. If some nodes are updated less frequently than others there may be a choice between querying the local outof-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. The ability to do query optimisation is essential in this context - the main objective being to minimise the quantity of data to be moved around. As with single-site databases, one must consider both generalised operations on internal query representations, and the exploitation of information about the current state of the database. A few examples follow. 1. Operations on the query tree should have the effect of executing the less expensive operation first, preferably at a local site to reduce the total quantity of data to be moved. Distributed query processing often requires the use of UNION operations to put together disjoint horizontal fragments of the same table - these are obviously expensive, and should (like join operations) be postponed as late as possible. A good strategy is to carry out all the REDUCER operations locally and union together the final results. This applies not only to select and project, but also to the aggregation functions. Note however that: COUNT( UNION(frag1,frag2,frag3)) must be implemented as SUM( COUNT(frag1), COUNT(frag2), COUNT(frag3)), and AVERAGE( UNION (frag1, frag2, frag3) ) must be implemented as: SUM( SUM(frag1), SUM(frag2), SUM(frag3) ) _______________________________________________ SUM( COUNT(frag1), COUNT(frag2), COUNT(frag3) ) Given that the original fragmentation predicates were based on expected access requirements (e.g. records partitioned on a location field), frequently-used queries should need to access only a subset of the tables to which they refer. In the absence of fragmentation transparency, such queries can of course be directed to particular sites. By contrast any DDBMS with full distribution independence should be able to detect whether some branches of the query tree will by definition produce null results, and eliminate them before execution. In the general case this will require a theorem-proving capability in the query optimiser, i.e. the ability to detect contradictions between the original fragmentation predicates and those specified by the current query. A knowledge of database statistics is necessary when deciding how to move data around the network. A cross- site join requires at least one tables to be transferred to another site, and for a complex query it may also be necessary to carry intermediate results between sites. Relevant statistics about each table are: CARDINALITY (no. of records), the size of each record, the number of distinct values in join or select fields. The last point is relevant in estimating the size of intermediate results. For example in a table of 10,000 records, containing a field with 1000 distinct values uniformly distributed, the number of records selected by a condition on that field will be 10 X the number of values in the select range. It is especially useful to estimate the cardinality of join results. In the worst case (where every record matches every other) this will be M x N, M and N being the cardinality of the original tables. In practice the upper bound is generally much lower and can be estimated on the number of distinct values in the two join fields. To take a common occurrence, if we have a primary -> foreign key join between tables R and S, the upper bound of the result is the cardinality of table S. A common strategy for reducing the cost of cross-site joins is to introduce a preliminary SEMI-JOIN operation. Suppose we have: Emp(Empno,ename,manager,Deptno) at site New York Dept(Deptno,dname,location, etc) at site London Join condition: Dept.deptno = Emp.deptno. We may perform the join as follows: New York: Project on Emp.deptno. Send the result to London. London : Dept.deptno; Join the results from New York with select only matching records. Send the matching records from London back to New York to participate in a full join. This method produces economies if the join is intended to make a selection from Dept, and not simply to link together corresponding records from both tables. However, unless an index exists for Dept.deptno it still involves a full sort on Dept. An alternative method sends a long bit vector as a filter so that Dept need not be sorted before the semi-join. The vector is formed by applying a hash function to every value of Emp.deptno, each time setting the appropriate bit to 1. It is then sent to London, and the same hash function applied to each value of Dept.deptno in turn - those which match a 1-bit are selected. The relevant records are sent to New York and the full join carried out as before. Since hashing functions always produce synonyms some records will be selected unnecessarily, but experiments using IBM's R* system showed that this method overall gives better performance than normal semi-join. Security threats Virus Information Recently Discovered Viruses Top 10 Viruses Tracked by AVERT Recently Updated Viruses Virus Hoaxes Tool Box Virus Removal Tools Download the latest virus removal tools from McAfee Security. These tools automatically perform virus detection and removal tasks for specific viruses. If your system is infected, the tools will remove the virus and repair any damage. Virus Map Get a real-time, bird’s-eye view of where the latest viruses are infecting computers worldwide. Regional Virus Info Find out which viruses are infecting PCs in your neighborhood and around the world. Virus Calendar Be prepared for the next scheduled virus payloads strike with the help of this comprehensive calendar. Definitions What is a Virus? A virus is a manmade program or piece of code that causes an unexpected, usually negative, event. Viruses are often disguised games or images with clever marketing titles such as "Me, nude." What is a Worm? Computer Worms are viruses that reside in the active memory of a computer and duplicate themselves. They may send copies of themselves to other computers, such as through email or Internet Relay Chat (IRC). What is a Trojan Horse? A Trojan horse program is a malicious program that pretends to be a benign application; a Trojan horse program purposefully does something the user does not expect. Trojans are not viruses since they do not replicate, but Trojan horse programs can be just as destructive. Many people use the term to refer only to non-replicating malicious programs, thus making a distinction between Trojans and viruses. Have more questions? Look up more definitions in our Virus Glossary. Current Threats Add to Your Site Virus NoticeW32/Sober.r@MM is a Low-Profiled worm. Virus Search Enter Virus Name Search Free Virus News More Enter Email Address Sign Up Related Links Security News Network Online Guide for Parents Virus Removal Services Anti-Virus Tips eSecurity News Archives We also recommend... Keep your PC safe. Automatically checks for virus updates, so your protection stays up-to-date. Learn More