Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Chapter 2 Designing a Database Getting Started The most important step in building a database is seeking input from users. If your users get an efficient system that meets their needs, you have built a successful database. Therefore, discovering what the users need will lay the foundation for your system. Before you start building, make sure that you adequately Gather information from users Define information requirements Develop an implementation strategy Gather Information from Users Check with the users to make sure you've accurately clarified the scope of the database. How you collect user input (e.g., formal or informal interviews, questionnaires, brainstorming sessions) depends on the size and composition of your user groups and your personal preference. Whatever techniques you decide to use, keep in mind these important points: Input should be secured from a truly representative sample of users Users should be made to feel that their input is important and that their needs are being seriously considered in the design of the database This second point is particularly important. Users are likely to have a more positive attitude toward a database they have helped to design. Designing a Database 33 Define Information Requirements An analysis of typical database usage (interactive inquiries and reports) will provide a comprehensive basis for determining which categories of information (records and fields) should be included in your database. When you determine these fields, write out sample data the way you think it should be displayed. Also, make certain that data for each field you intend to include in the database is, in fact, readily available. If you can't collect certain information in a timely manner or if it requires a great deal of preparation time for data entry, you should reassess the need to include it in the database. Although you can add records and fields at any point in the life of a database, it is best to minimize the need for such changes. A thorough analysis of user needs in the initial planning phase is well worth the effort. Here are some examples of questions you can ask your users that will help you come up with a workable design: What are the system outputs? Avoid describing the system in terms of inputs. The purpose of the system is to produce outputs, and you need to know what the system will help them do in broad terms. What are the major pieces of information and how are they related? This question should be very easy for your users. They know their information. What are the major business processes and how are they related? Users will probably feel comfortable describing this process in chronological order (i.e., for a mail order company, the employees would: take the order, see if the products are in stock, and then ship the products). This description is fine, but make sure that you have a clear idea of the process in reverse order. For example, the last thing the order department does is print out the shipping labels. Think of what processes must happen before shipping (invoice must be matched with current stock, current stock lists must be updated daily, complete shipping information must be taken on the order, etc.). By looking at the final process first, you can assess what pieces of information the final process needs to obtain from the initial process. What information supports each process? You'll want to know where the information in each of these processes comes from (i.e., information taken over the phone for orders, PR department collects clippings, forms are filled out by scientists, etc.). 34 Designing a Database Keep in mind that database design is a direct derivative of the system requirements. If nobody knows what the requirements for the system are, any database design will do. If you get the impression that a user just wants to store data, ask him or her, “Why? What reports do you want? What are some typical queries? What does the data do for you?” There is no such thing as an input-only system. Develop an Implementation Strategy You can ensure a successful transition from your existing information retrieval system to your new system by carefully planning. That way you'll help minimize confusion and avoid disruptive misunderstandings during implementation. Here are some planning guidelines for implementation: 1. Establish procedures and schedules and assign responsibilities for the transition period (i.e., from the existing system to the new BASIS system). 2. Clarify ongoing responsibilities and schedules (e.g., data entry, updating, etc.) so that people will have a clear idea of what their responsibilities will be once the new database is implemented. 3. Consider establishing formal procedures for: * Orienting and training users * Monitoring database use * Receiving and acting on user recommendations * Keeping users up-to-date on changes made in the database and related issues 4. Prepare a detailed plan describing what you intend to do to encourage user involvement. Understanding Relationships in a Database If you understand how record types can be linked together in relational DBMSs, you can build good design into your database. Designing a Database 35 Mappings In a BASIS database, every field can be associated with other fields. These associations, which are called mappings, are used to connect various fields to form the logical model or structure of the database. There are three categories of mappings. A one-to-one mapping (1:1) exists when each value of field X is related to exactly one value of Y. It is said that X identifies Y. For example, in a data file of state names and their abbreviations the CA value of field X is related to the CALIFORNIA value of Field Y, and the OH value of field X is related to the OHIO value of field Y. If X identifies Y and Y identifies X, a one-to-one mapping exists between X and Y. A one-to-one relationship is commutative. A one-to-many mapping (1:M) exists when each value of field X is related to zero or more values of Y. The SQUARE ROOTS relation is one-to-many (actually 1:2) since each input value for X produces two square roots for Y. For example, a value of 9 for field X yields -3 and +3 values for field Y. Both 1:1 and 1:M mappings are considered simple mappings. A many-to-many mapping (M:M) is a complex mapping. In this case, each value of X is associated with many values of Y and each value of Y is associated with many values of X. Simple Mappings One-to-One One-to- Many 1:1 1:M Complex Mapping Many-to-Many M:M Figure 2-1: Mapping notation A database containing employee information could make use of all 3 kinds of relationships. If there were no employees with the same name, a one-to-one or identity relationship would exist between employee name and employee number. However, if the company employed two or more people with the same name, the employee number would identify the employee name but the relationship would not be commutative because employee name would not identify employee number. 36 Designing a Database Employee Number 1001 1823 3724 4421 Employee Name Joe Davis John Smith Mary Jackson John Smith Figure 2-2: A one-way identity relationship (not one-to-one) Each project has one manager, but a project manager may be in charge of several projects. Therefore, the relationship between a project manager and his or her projects is one-to-many (1:M). Project Managers Linda Lemon John Miller Peter Smith Projects Data Entry System Inventory Control Monthly Reports System Maintenance System Performance Figure 2-3: A one-to-many relationship A project will often have more than one employee working on it. Each employee could easily be involved in several projects at the same time. So, the relationship between projects and employees is many-to-many (M:M). An employee would have a list of many projects with which he or she is associated, and for each project we could list the employees who are working on it. Designing a Database 37 Employee Name Projects Henry Bush Data Entry System Tim Jones Inventory Control Paula Fox Joan Dark System Maintenance Carol White Susan Smith System Performance Shirley Lewis Pat Cook Figure 2-4: A many-to-many relationship We recommend that you index fields frequently used to join different record definitions. The following chart shows how the three types of relationships can be represented with indexed fields. 38 Designing a Database Table 2-1: Relationships and their index types Mapping Field X Field Y 1:1 Unique Index Unique Index 1:M Unique Index Exact Index M:M Exact Index Exact Index These mappings can be found in inter-record relationships and are easily represented by using indexes, but they also appear in intra-record relationships (i.e., associations among fields of the same record). For example, the employee number:employee name relationship is an intra-record relationship. The following chart shows how intra-record relationships can be modeled using simple and compound fields. Table 2-2: Intra-record relationships expressed by field types Mapping Field X Field Y 1:1 Simple Field Simple Field 1:M Simple Field Compound Field M:M Compound Field Compound Field Understanding Database Structures Entities and relationships form a logical database structure. These structures are classified as follows: SIMPLE TREE PLEX (HIERARCHICAL) (NETWORK) Designing a Database 39 Simple Structure A simple structure has only one record definition, so there are no inter-record relationships. This type of database describes only one kind of entity. For example, a bibliographic database will often use a simple structure. Tree Structure A tree structure is built from several record definitions in which all of the inter-record relationships are one-to-one or one-to-many. Many management structures and classification systems use these tree or hierarchical structures to arrange information. Figure 2-5: A tree structure An inverse tree structure is traversed by starting at the root node (or entity) and working down to the terminating leaf nodes. At each node there is only one way to go up, but there can be several ways to go down. In a database that uses a hierarchical structure there will be many instances of tree structures. Figure 2-6: An inverted tree structure 40 Designing a Database Plex Structure A plex structure is built from several record definitions in which the inter-record relationships are 1:1, 1:M, and M:M. It is the most general type of structure and can be used to model any situation. Simple and tree structures are actually special cases of plex structures. A plex is often called a “network.” Figure 2-7: Plex structures One interesting kind of plex is called a “Bill-of-Materials” structure. In this plex the parts that are used to form an object are listed for each object. Each part may have subparts and can be used in more than one object. This leads to M:M relationships in some cases, but when we are trying to determine what a particular object is made from, it is easy to view the structure as a tree. Designing a Database 41 Product A Product B Subassembly 1 Subassembly 2 Part W Part X Part Y Part Z Figure 2-8: Example of a bill-of-materials plex Understanding Relational Databases The relational approach to database management is based on mathematical concepts involving sets and relations. You can use the BASIS system to create and manage any relational database. Note: Relational database theory and normalization, introduced in the rest of this chapter, are complex topics. Attempts to provide a comprehensive understanding of them are beyond the scope of this manual. For more information about relational database theory and normalization, consult the books listed at the end of the Preface of this manual. Relational database theory uses terminology taken from the mathematical theory of relations. A relational database is said to consist of a number of relations. Relations are usually represented as tables in which the rows are called tuples, and each column is referred to as a domain. Each column can also be thought of as an attribute, and its set of values forms one of the domains of the relation. The degree of the relation is the number of columns in the relation, and the cardinality of a relation is the number of rows it has. A relation can be represented by a table. In the following table the column headings EMP_NAME, ENO, and SALARY represent attributes. Because the table has 3 columns, its degree is 3. Because the table has 5 rows or tuples, its cardinality is 5. 42 Designing a Database Table 2-3: A table is often used to represent a relation EMP_NAME ENO SALARY Davis, Joe 1001 8000 Smith, John 1823 2000 Martin, Peter 2458 1050 Jackson, Mary 3724 2700 Adams, Paula 5734 1100 In BASIS, each relation corresponds to a record definition and each tuple is a record. The attributes are fields. Tables that represent relations have the following properties: 1. Each row represents a complete piece of information. 2. No two rows are identical. 3. The order of the rows is unimportant. 4. The order of the columns is essential; we can interchange complete columns only. 5. Only certain values are allowed in each column. 6. Adding or removing rows does not change the meaning of the relation. 7. Adding or removing columns changes the meaning of the relation. 8. Certain rows may not be allowed if they violate an integrity constraint. Some column or combination of columns will uniquely identify every row. This is called a candidate key. One candidate key will be used as the primary key. In Table 2-3 the primary key is the ENO column (field). Set operations manipulate relations. The two principal set operations are projections and joins. Designing a Database 43 Projection A projection selects certain columns from a relation and puts them in a specific order. Projections form new relations. Table 2-4: Two projections of the employee relation EMPLOYEE ENO NAME JOB SALARY MGR-NO DNO 1001 Davis, Joe President 8000 1000 100 1823 Smith, John Salesman 2000 3724 201 2458 Martin, Peter Clerk 1050 2843 301 3724 Jackson, Mary Manager 2700 1001 201 5734 Adams, Paula Clerk 1101 3724 201 PROJECTION-A NAME ENO SALARY Davis, Joe 1001 8000 Smith, John 1823 2000 Martin, Peter 2458 1050 Jackson, Mary 3724 2700 Adams, Paula 5734 1101 44 Designing a Database PROJECTION-B MGR-NO DNO 1000 100 3724 201 2843 301 1001 201 Join A join forms a relation from one or more other relations that share a common attribute (field). There are several types of joins: the outer join (directed join), the theta-join, and the equi-join. The equi-join, described below, is the most commonly used. Two relations are equi-joined using a common attribute, and those tuples that share the same value for the attribute are used to create the resulting relation. Given relations X, Y, and Z below: Table 2-5: Examples of equi-joins X a b c U 1 Red V 2 Blue W 3 Green X 5 Yellow Z 7 Orange Designing a Database 45 Y b d 1 Cherry 3 Grape 6 Lemon Z a b L 1 A 4 U 6 B 8 W 9 We can make the following equi-joins: X(b,c) join Y (b,d) on b b c d 1 Red Cherry 3 Green Grape 46 Designing a Database X(a,b,c) join Z(a) on a a b c U 1 Red W 3 Green Y(b,d) join Z(b,a) on b b d a 1 Cherry L 6 Lemon U X(b,c) join Y (d) join Z(a) on b b c d a 1 Red Cherry L Normalization The process of forming records and determining inter-record relationships can be complex. One method for designing relational databases is called normalization. This method can easily be applied to any database structure. Normalization is a technique you can use to eliminate insertion, deletion, and update anomalies that could exist in a database because of various functional dependencies among attributes of entities. Insertion, deletion, and update anomalies can occur when a relation is used to represent more than one fact. Let's examine a relation called ORDER that contains the following attributes: ORDER_NUMBER ORDER_DATE PRODUCT_NAME PRODUCT_PRICE SUPPLIER_NAME SUPPLIER_ADDRESS Designing a Database 47 The ORDER relation describes a purchase, but since product and supplier information is also present, it tracks suppliers of products. In this case, the relation represents more than one fact, so various anomalies could occur. An insertion anomaly occurs whenever we add new supplier information without filling in the order data. What does a null value for ORDER_NUMBER or ORDER_DATE signify? In fact, the ORDER_NUMBER would probably be required. So how do we add a new product supplier without an order? A deletion anomaly occurs if the deletion of an order destroys the only information available about a supplier of a product. This would happen if there were no other orders for the same product. Update anomalies occur when one attribute must be changed in many places. If the price of a product needs to be corrected, changing the price may require updates to several orders. This may not be easy if a number of orders change while we are trying to fix the price. If you are designing a complex database, you must define each relation so it represents only one fact. To do this, study the dependencies that exist among the attributes of a relation. Determine which attributes are dependent on other attributes. For example, attribute B of a relation R is functionally dependent on attribute A of R if, at every instant of time, each value in A has no more than one value in B associated with it in relation R. This means that if B is functionally dependent on A, then A identifies B. There is a simple correspondence between A and B (A—>B). The attributes of set B (which could contain one attribute) of a relation R are fully functionally dependent on set A of attributes in R if B is functionally dependent on the whole of A but not on any subset of A. We cannot uniquely identify B without using all of the attributes of set A. When you examine a relation, find one or more groups of attributes on which all of the other attributes depend. Such a group (possibly one attribute) is called a candidate key of the relation. Any attribute that is used to form a candidate key is called a prime attribute. The value of the key uniquely identifies each tuple in the relation. Every attribute in the group that defines the candidate key must be required. If any attribute were eliminated from the group, those remaining could not form a candidate key that uniquely identifies each tuple. If more than one candidate key exists for a relation, one of the candidate keys is designated as the primary key. Each record type must have a primary key field. For example, to determine the functional dependencies that exist in two relations, look at the ORDER relation used earlier and a subset of it called the PRODUCT relation. 48 Designing a Database ORDER: ORDER_NUMBER PRODUCT: PRODUCT_NAME ORDER_DATE SUPPLIER_NAME PRODUCT_NAME SUPPLIER_ADDRESS PRODUCT_PRICE PRODUCT_PRICE SUPPLIER_NAME SUPPLIER_ADDRESS We must study each attribute of a relation and decide which of the other attributes in the relation uniquely identify it. This diagram shows the dependencies. ORDER_NUMBER* ORDER_DATE PRODUCT_NAME PRODUCT_PRICE SUPPLIER_NAME SUPPLIER_ADDRESS Figure 2-9: Dependencies in ORDER relation The * indicates which attributes are prime attributes and the arrows are used to show which attributes can be used to identify other attributes. All of the attributes are identified by the ORDER_NUMBER. The SUPPLIER_ADDRESS can be determined from knowing only the SUPPLIER_NAME, and the PRODUCT_PRICE is identified by using both the PRODUCT_NAME and the SUPPLIER_NAME together. Designing a Database 49 The dependencies of the product relation are PRODUCT_NAME* SUPPLIER_NAME* SUPPLIER_ADDRESS PRODUCT_PRICE Figure 2-10: Dependencies in PRODUCT relation Notice that the SUPPLIER_ADDRESS is identified by the SUPPLIER_NAME but that both the SUPPLIER_NAME and the PRODUCT_NAME are required to determine the PRODUCT_PRICE. These diagrams show the type of functional dependencies that we want to eliminate. Since the PRODUCT_NAME with the SUPPLIER_NAME is the only group of attributes that will uniquely identify the other attributes in the PRODUCT relation, they form the only candidate key. (By default, they also form the primary key.) But the SUPPLIER_ADDRESS is not fully functionally dependent on this key. It is identified by the SUPPLIER_NAME alone. We would like to remove this nonfull dependence. In the ORDER relation, it is easy to see that the ORDER_NUMBER is the only good key. Problems appear in this relation. Notice that we can identify the SUPPLIER_ADDRESS by knowing the SUPPLIER_NAME, which is not a candidate key. This could cause the address to occur in many tuples and lead to anomalies. This is called a transitive dependence and must be removed. The normalization process is used to remove nonfull and transitive dependencies. This eliminates many possible anomalies. Various degrees of normalization are derived by looking for dependencies and converting a database into normal form. Many databases are normalized to third normal form. Higher degrees of normalization have been discovered. They involve multi-valued dependencies, but they will not be explained here. First Normal Form Getting a database into first normal form is fairly easy. We must eliminate any repeating groups and use simple fields for each attribute of every relation. We must be able to represent the relations as tables where only one value is present in each column of every row. This sometimes requires us to invent new fields that are needed to help uniquely identify each value that was a member of a repeating group and now resides in its own record. 50 Designing a Database Starting with the relation EMPLOYEE where DEPENDENT and PROJECT_NO are repeating groups: EMPLOYEE: ENO * Occurs (1) Time NAME Occurs (1) Time SALARY Occurs (1) Time DEPENDENT Occurs (0:20) Times PROJECT_NO Occurs (0:15) Times To “flatten” the original EMPLOYEE relation, we need to create three dependent relations: EMPLOYEE : ENO * NAME SALARY DEPENDENT : DEPENDENT NAME DEPENDENT_NO * ENO * ASSIGNMENT: PROJECT_NO * ASSIGNMENT_NO * ENO (The prime attributes are denoted by a *.) The name of each dependent is put in a separate record and so is each project number. Designing a Database 51 Second Normal Form The goal of normalization is to get the database into third normal form. We use tables that are in first normal form to determine the existing functional dependencies. According to theory, we should first eliminate all of the nonfull dependencies, which leaves us in second normal form. Next we should eliminate the transitive dependencies, which converts the database to third normal form. In actual practice we will simultaneously eliminate both kinds of dependencies and go directly from first to third normal form. Third Normal Form We know the database is in third normal form if every determinant for the relation is a candidate key. (A determinant is a group of attributes that identifies another attribute.) To convert the ORDER relation to third normal form, we need to have three relations. ORDER : ORDER_NUMBER * ORDER_DATE PRODUCT_NAME SUPPLIER_NAME PRODUCT : PRODUCT_NAME * SUPPLIER_NAME * PRODUCT_PRICE SUPPLIER: SUPPLIER_NAME * SUPPLIER_ADDRESS 52 Designing a Database Summary The steps in the normalization process are 1. Represent every relation as a table where only one value is present in each column of every row. 2. Remove nonfull dependencies. Change A* into A* B* B* C C and B* D D 3. Remove transitive dependencies. Change A* B into A* B and B* C C Designing a Database 53 Diagramming Record Relationships Once you have ideas from your users about what records and fields you need, you can organize record types into a logically designed database. You'll probably find it helpful to draw a record relationship diagram, a simple way to illustrate a complex system. The ovals in Figure 2-11 represent the records, and the lines between them show how the records are related. DEPENDENT DEPARTMENT employee has dependents employee manages departments employee manages employees department has employees EMPLOYEE employee works with clients employee updates places CLIENT PLACE client goes to places client has travel schedules SCHED Figure 2-11: Record relationship diagram for TOUR database Note: The Windows operating system does not include the PLACE record because it is a sectioned record, and Windows does not support sectioned records. 54 Designing a Database Planning Inter-Record Relationships Think about how you want to store and retrieve records. When you retrieve your records, you may want to join some records together. Take it a step further and think of which fields you will use to join the records. At this point in the design, you can identify similar fields in different records. If you define them similarly, you can ensure that their values are stored the same way and can be used to join records. Joining Records Joins enable users to display related information in two or more separate records simultaneously. For example, in the TOUR database the DEPENDENT and EMPLOYEE record types can be joined, based on a common set of values for the ENO field. Domains, ranges of values a field or logically similar fields could have, help insure that a value stored in one record will be stored similarly in another. For example, if department number were stored as “100” in a DEPARTMENT record and “one-hundred” in an EMPLOYEE record, these records could not successfully be joined. The values are equivalent but not exactly equal because one record uses numeric data and the other uses character data. You can define domains to describe value-specific parameters like data type, size, and legal value list for different fields having similar values. Use domains whenever you have similar fields in separate records that will be used for joins. You are not obligated to use domains, but they are strongly recommended. You should define a domain for each group of joins you have identified. You can also use domains when you have two or more similar fields that will never be joined. Suppose you have a field for office that contained three-digit numbers between 100 and 200 and you also have a three-digit field of numbers 100 to 200 for parking lot slots. Both fields are totally unrelated and it would never make sense to join them. However, you could still use a domain to define the value-specific similarities. That way, you only need to define the parameters once in the domain instead of in each field definition. The field definition would simply refer to the domain. Plan for any domains you may have for the joinable fields. Give each domain a meaningful name that indicates the type of data it describes. Sometimes it doesn't make sense for a record type to exist unless its corresponding owner or parent record type exists. For example, it doesn't make sense for a company's personnel database to have a DEPENDENT record type if there is no corresponding EMPLOYEE record type. Record types that depend on other record types are sometimes Designing a Database 55 called member or child record types. Dependencies among record types are sometimes called owner/member or parent/child relationships. Referential assertions, rules that govern relationships between record types, can control owner/member dependencies so that, for example, a DEPENDENT record cannot be added unless an EMPLOYEE record already exists and an EMPLOYEE record cannot be deleted until all of the associated DEPENDENT records for the employee have been deleted. Concept Mining and Clustering As the amount of information in document collections continues to expand, it becomes increasingly important to provide searchers with additional features beyond the standard ranked list of documents in response to a search query—features which can help users find more relevant information and help them understand better the information and its context. To meet this need, BASIS has developed the Concept Mining and Clustering features. While interrelated, the Concept Mining and Clustering features may be used independently. Concept Mining Concept mining refers to the automatic extraction of concepts (keywords, phrases, personal names, company names, etc.) during the import of a document. Information is gleaned from the data as it is read in. Concept mining gives users a context for their search results, giving them an overview of the companies, people, and other concepts that are significant in a set of documents without having to read any of the documents themselves. Users can also be alerted to different facets of the retrieved document set by examining its concept lists. For example, a search involving the word "diet" might retrieve some documents about weight-loss programs as well as some documents about nutrition. A concept list based on all of these documents might include different entries associated with these different topics, thus giving the user a quick indication that the search may need to be refined. The types of concepts most commonly included in concept mining are: Keywords (based on occurrence frequencies) Phrases (statistically significant two-word combinations) Personal Names Company Names 56 Designing a Database The concept mining technology automatically extracts keywords, phrases, and company and personal names from text-image fields based on configurable parameter files. The extracted terms are stored in record fields which can be indexed and used for retrieval. The DBA defines what types of concepts are to be extracted. The concept types that will be extracted from the data are defined in the concept mining initialization file (concept.ini), which you can customize to suit your individual or companies’ needs. Company names are identified by an algorithm which recognizes names based on corporate suffixes, such as "Inc", "Corp", etc. Suffixes used in many different countries are recognized. A prepared list of company names can also be supplied in a configuration file to augment the algorithmic name recognition. Personal names are identified using configuration files containing first and last names. The concept mining feature includes configuration files containing over 7500 first names and over 18,000 last names. These files can be customized to add or delete any names to suit your needs. In addition to extracting concepts, the concept mining feature assigns a numerical weight value to each extracted concept. This weight signifies the relevance or importance of that concept in the analyzed text. For example, if you used the Concept Mining feature on a document about Albert Einstein and one of the concept types you are using is “personal names”, then of course, “Albert Einstein” would—assuming his name is the name that occurs the most in the document—receive the highest weight of any personal name. A name that only occurs once in the document would receive a very low weight. The weight values are stored in record fields used by the document Clustering feature to organize result sets based on frequency of concepts. Clustering When a user performs a search, the Clustering feature groups together similar documents into clusters. It does this by analyzing each document and building a representation of its topic and then it compares documents’ representations with each other to determine which ones are most similar. For example, a general search on “sports” may be clustered into topics like “Football”, “Baseball”, “Basketball”, and so forth. A typical use of clustering is when organizations want to improve their “Frequently Asked Questions” pages or automate responses to common customer email inquiries. They can use clustering to analyze historical customer emails and determine categories of frequently asked questions. Clustering can give users an overview of the range of topics discussed in a set of documents and a list of subtopics. By browsing a list of subtopics found in the result set, users can quickly get an overview of the kinds of information contained in a result set. They don't need to read through a large number of documents to try to synthesize an overview. Clustering can be useful for: Designing a Database 57 Understanding better what is in the result set. For example, a search for information about retirement plans might result in a cluster focused on traditional pensions, one on IRAs, and one on 401(k) plans. Helping users to better break down a search into sub clusters, allowing them to more effectively “drill down” into a topic. Helping DBA’s to develop ideas about how to organize their document collections by topic. The result of a clustering operation is a list of clusters, where each cluster includes the following information: A cluster quality score indicating the cluster's cohesiveness A ranked list of the documents that were placed in the cluster, along with document quality scores indicating each document's similarity to the cluster The title of the most representative (highest-scoring) document in the cluster A list of key terms (with weights) describing the cluster Evaluating the Database Design After seeking information from users and organizing your records and fields, evaluate your final design with your users. How does the design stand up? Check these general characteristics to ensure that you have a good design. Keep in mind that these are only guidelines and that some applications may not be well suited to them. Characteristics of good designs are Single purpose records. Make sure that you're not creating records with many purposes. An example of this would be the combination of EMPLOYEE and DEPENDENT records into a single record type. Well controlled use of data redundancy. Redundant data is helpful only when the redundancy is used to link two or more records. Proper use of optional fields. Avoid defining fields that are optional or required based on the data value of some other field. It's best to have a field that is always optional, rather than sometimes optional. One unique field per record. Single-purpose records will commonly have a unique field that can serve as the key. Here are some characteristics that you might want to avoid: 58 Designing a Database Multi-purpose records. If you have multi-purpose records, try to get a clearer understanding of the entities described and define additional record types rather than overburden a record type with fields that describe attributes of more than one entity. Multiple unique fields within a record. If you have accurately described an entity, it usually has one unique attribute. Too many records. Check to be sure you haven't unnecessarily split an entity between record types. If you do this you will have to constantly join the records together to get complete data. Designing a Database 59 Excessive data redundancy. Especially for joining record types, you probably want some data redundancy, but review each instance and make sure that it's really necessary. Unique fields that aren't unique. Review the unique fields and make sure the data will be unique. Required fields that should be optional. Make sure that data will always be available for each required field when you add record occurrences. Also, reassess the need for a required field and your reasoning for making it required. Optional fields that should be required. Take another look at your data. If you leave out a non-required field, will the rest of the record make sense? Too many optional fields. If you have too many optional fields in a record, you should consider using additional record types. Too many optional fields reflects an entity in a state of flux or a multi-purpose record. Try to define the requirements more precisely, if possible. Sharing the Design with the Users Data Dictionary After you've come up with a reasonable design, it's a good idea to let users know what you have in mind. A good vehicle for this is a “data dictionary.” The dictionary should explain your record relationships and your field specifications in language that your users can understand. You can derive information for a data dictionary from the record types you define in the Actual Data Model for your database. Keep the data dictionary simple and general. Here are some guidelines for preparing a dictionary: Support your record relationship diagram and field layouts with a good description of the arrangement. Give only one definition per domain. Don't bother to explain what each and every field means when they are all blanketed under the same domain. For example, explain what an employee number is, not DEPARTMENT employee number, EMPLOYEE employee number and PLACES employee number. One explanation about the domain will cover each member field. Explain what validation rules are needed (pattern, legal, word list, code list, etc.). For more information, see Database Definition and Development, “Field Validation.” Identify the data types (numeric, character, text, etc.). Explain the purpose of each record. Describe any record-to-record dependencies. Describe any field-to-field dependencies. 60 Designing a Database