Download DPD Chapter 2: Designing a Database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 2
Designing a Database
Getting Started
The most important step in building a database is seeking input from users. If your users
get an efficient system that meets their needs, you have built a successful database.
Therefore, discovering what the users need will lay the foundation for your system.
Before you start building, make sure that you adequately

Gather information from users

Define information requirements

Develop an implementation strategy
Gather Information from Users
Check with the users to make sure you've accurately clarified the scope of the database.
How you collect user input (e.g., formal or informal interviews, questionnaires,
brainstorming sessions) depends on the size and composition of your user groups and
your personal preference. Whatever techniques you decide to use, keep in mind these
important points:

Input should be secured from a truly representative sample of users

Users should be made to feel that their input is important and that their needs are
being seriously considered in the design of the database
This second point is particularly important. Users are likely to have a more positive
attitude toward a database they have helped to design.
Designing a Database  33
Define Information Requirements
An analysis of typical database usage (interactive inquiries and reports) will provide a
comprehensive basis for determining which categories of information (records and fields)
should be included in your database. When you determine these fields, write out sample
data the way you think it should be displayed.
Also, make certain that data for each field you intend to include in the database is, in fact,
readily available. If you can't collect certain information in a timely manner or if it
requires a great deal of preparation time for data entry, you should reassess the need to
include it in the database.
Although you can add records and fields at any point in the life of a database, it is best to
minimize the need for such changes. A thorough analysis of user needs in the initial
planning phase is well worth the effort. Here are some examples of questions you can ask
your users that will help you come up with a workable design:

What are the system outputs?
Avoid describing the system in terms of inputs. The purpose of the system is to
produce outputs, and you need to know what the system will help them do in broad
terms.

What are the major pieces of information and how are they related? This question
should be very easy for your users. They know their information.

What are the major business processes and how are they related?
Users will probably feel comfortable describing this process in chronological order
(i.e., for a mail order company, the employees would: take the order, see if the
products are in stock, and then ship the products). This description is fine, but make
sure that you have a clear idea of the process in reverse order. For example, the last
thing the order department does is print out the shipping labels. Think of what
processes must happen before shipping (invoice must be matched with current stock,
current stock lists must be updated daily, complete shipping information must be
taken on the order, etc.). By looking at the final process first, you can assess what
pieces of information the final process needs to obtain from the initial process.

What information supports each process?
You'll want to know where the information in each of these processes comes from
(i.e., information taken over the phone for orders, PR department collects clippings,
forms are filled out by scientists, etc.).
34  Designing a Database
Keep in mind that database design is a direct derivative of the system requirements. If
nobody knows what the requirements for the system are, any database design will do. If
you get the impression that a user just wants to store data, ask him or her, “Why? What
reports do you want? What are some typical queries? What does the data do for you?”
There is no such thing as an input-only system.
Develop an Implementation Strategy
You can ensure a successful transition from your existing information retrieval system to
your new system by carefully planning. That way you'll help minimize confusion and
avoid disruptive misunderstandings during implementation.
Here are some planning guidelines for implementation:
1.
Establish procedures and schedules and assign responsibilities for the transition
period (i.e., from the existing system to the new BASIS system).
2.
Clarify ongoing responsibilities and schedules (e.g., data entry, updating, etc.) so that
people will have a clear idea of what their responsibilities will be once the new
database is implemented.
3.
Consider establishing formal procedures for:
* Orienting and training users
* Monitoring database use
* Receiving and acting on user recommendations
* Keeping users up-to-date on changes made in the database and related issues
4.
Prepare a detailed plan describing what you intend to do to encourage user
involvement.
Understanding Relationships in a Database
If you understand how record types can be linked together in relational DBMSs, you can
build good design into your database.
Designing a Database  35
Mappings
In a BASIS database, every field can be associated with other fields. These associations,
which are called mappings, are used to connect various fields to form the logical model or
structure of the database. There are three categories of mappings.
A one-to-one mapping (1:1) exists when each value of field X is related to exactly
one value of Y. It is said that X identifies Y. For example, in a data file of state
names and their abbreviations the CA value of field X is related to the CALIFORNIA
value of Field Y, and the OH value of field X is related to the OHIO value of field Y.
If X identifies Y and Y identifies X, a one-to-one mapping exists between X and Y.
A one-to-one relationship is commutative.
A one-to-many mapping (1:M) exists when each value of field X is related to zero or
more values of Y. The SQUARE ROOTS relation is one-to-many (actually 1:2)
since each input value for X produces two square roots for Y. For example, a value
of 9 for field X yields -3 and +3 values for field Y. Both 1:1 and 1:M mappings are
considered simple mappings.
A many-to-many mapping (M:M) is a complex mapping. In this case, each value of
X is associated with many values of Y and each value of Y is associated with many
values of X.
Simple Mappings
One-to-One
One-to- Many
1:1
1:M
Complex Mapping
Many-to-Many M:M
Figure 2-1: Mapping notation
A database containing employee information could make use of all 3 kinds of
relationships. If there were no employees with the same name, a one-to-one or identity
relationship would exist between employee name and employee number. However, if the
company employed two or more people with the same name, the employee number would
identify the employee name but the relationship would not be commutative because
employee name would not identify employee number.
36  Designing a Database
Employee Number
1001
1823
3724
4421
Employee Name
Joe Davis
John Smith
Mary Jackson
John Smith
Figure 2-2: A one-way identity relationship (not one-to-one)
Each project has one manager, but a project manager may be in charge of several
projects. Therefore, the relationship between a project manager and his or her projects is
one-to-many (1:M).
Project Managers
Linda Lemon
John Miller
Peter Smith
Projects
Data Entry System
Inventory Control
Monthly Reports
System Maintenance
System Performance
Figure 2-3: A one-to-many relationship
A project will often have more than one employee working on it. Each employee could
easily be involved in several projects at the same time. So, the relationship between
projects and employees is many-to-many (M:M). An employee would have a list of many
projects with which he or she is associated, and for each project we could list the
employees who are working on it.
Designing a Database  37
Employee Name
Projects
Henry Bush
Data Entry System
Tim Jones
Inventory Control
Paula Fox
Joan Dark
System Maintenance
Carol White
Susan Smith
System Performance
Shirley Lewis
Pat Cook
Figure 2-4: A many-to-many relationship
We recommend that you index fields frequently used to join different record definitions.
The following chart shows how the three types of relationships can be represented with
indexed fields.
38  Designing a Database
Table 2-1: Relationships and their index types
Mapping
Field X
Field Y
1:1
Unique Index
Unique Index
1:M
Unique Index
Exact Index
M:M
Exact Index
Exact Index
These mappings can be found in inter-record relationships and are easily represented by
using indexes, but they also appear in intra-record relationships (i.e., associations among
fields of the same record). For example, the employee number:employee name
relationship is an intra-record relationship. The following chart shows how intra-record
relationships can be modeled using simple and compound fields.
Table 2-2: Intra-record relationships expressed by field types
Mapping
Field X
Field Y
1:1
Simple Field
Simple Field
1:M
Simple Field
Compound Field
M:M
Compound Field
Compound Field
Understanding Database Structures
Entities and relationships form a logical database structure. These structures are
classified as follows:
SIMPLE
TREE
PLEX
(HIERARCHICAL)
(NETWORK)
Designing a Database  39
Simple Structure
A simple structure has only one record definition, so there are no inter-record
relationships. This type of database describes only one kind of entity. For example, a
bibliographic database will often use a simple structure.
Tree Structure
A tree structure is built from several record definitions in which all of the inter-record
relationships are one-to-one or one-to-many. Many management structures and
classification systems use these tree or hierarchical structures to arrange information.
Figure 2-5: A tree structure
An inverse tree structure is traversed by starting at the root node (or entity) and working
down to the terminating leaf nodes. At each node there is only one way to go up, but there
can be several ways to go down. In a database that uses a hierarchical structure there will
be many instances of tree structures.
Figure 2-6: An inverted tree structure
40  Designing a Database
Plex Structure
A plex structure is built from several record definitions in which the inter-record
relationships are 1:1, 1:M, and M:M. It is the most general type of structure and can be
used to model any situation. Simple and tree structures are actually special cases of plex
structures. A plex is often called a “network.”
Figure 2-7: Plex structures
One interesting kind of plex is called a “Bill-of-Materials” structure. In this plex the parts
that are used to form an object are listed for each object. Each part may have subparts
and can be used in more than one object. This leads to M:M relationships in some cases,
but when we are trying to determine what a particular object is made from, it is easy to
view the structure as a tree.
Designing a Database  41
Product A
Product B
Subassembly 1
Subassembly 2
Part W
Part X
Part Y
Part Z
Figure 2-8: Example of a bill-of-materials plex
Understanding Relational Databases
The relational approach to database management is based on mathematical concepts
involving sets and relations. You can use the BASIS system to create and manage any
relational database.
Note: Relational database theory and normalization, introduced in the rest of this
chapter, are complex topics. Attempts to provide a comprehensive understanding of them
are beyond the scope of this manual. For more information about relational database
theory and normalization, consult the books listed at the end of the Preface of this manual.
Relational database theory uses terminology taken from the mathematical theory of
relations. A relational database is said to consist of a number of relations. Relations are
usually represented as tables in which the rows are called tuples, and each column is
referred to as a domain. Each column can also be thought of as an attribute, and its set of
values forms one of the domains of the relation. The degree of the relation is the number
of columns in the relation, and the cardinality of a relation is the number of rows it has.
A relation can be represented by a table. In the following table the column headings
EMP_NAME, ENO, and SALARY represent attributes. Because the table has 3 columns,
its degree is 3. Because the table has 5 rows or tuples, its cardinality is 5.
42  Designing a Database
Table 2-3: A table is often used to represent a relation
EMP_NAME
ENO
SALARY
Davis, Joe
1001
8000
Smith, John
1823
2000
Martin, Peter
2458
1050
Jackson, Mary
3724
2700
Adams, Paula
5734
1100
In BASIS, each relation corresponds to a record definition and each tuple is a record.
The attributes are fields.
Tables that represent relations have the following properties:
1.
Each row represents a complete piece of information.
2.
No two rows are identical.
3.
The order of the rows is unimportant.
4.
The order of the columns is essential; we can interchange complete columns only.
5.
Only certain values are allowed in each column.
6.
Adding or removing rows does not change the meaning of the relation.
7.
Adding or removing columns changes the meaning of the relation.
8.
Certain rows may not be allowed if they violate an integrity constraint.
Some column or combination of columns will uniquely identify every row. This is called
a candidate key. One candidate key will be used as the primary key. In Table 2-3 the
primary key is the ENO column (field).
Set operations manipulate relations. The two principal set operations are projections and
joins.
Designing a Database  43
Projection
A projection selects certain columns from a relation and puts them in a specific order.
Projections form new relations.
Table 2-4: Two projections of the employee relation
EMPLOYEE
ENO
NAME
JOB
SALARY
MGR-NO
DNO
1001
Davis, Joe
President
8000
1000
100
1823
Smith, John
Salesman
2000
3724
201
2458
Martin, Peter
Clerk
1050
2843
301
3724
Jackson, Mary
Manager
2700
1001
201
5734
Adams, Paula
Clerk
1101
3724
201
PROJECTION-A
NAME
ENO
SALARY
Davis, Joe
1001
8000
Smith, John
1823
2000
Martin, Peter
2458
1050
Jackson, Mary
3724
2700
Adams, Paula
5734
1101
44  Designing a Database
PROJECTION-B
MGR-NO
DNO
1000
100
3724
201
2843
301
1001
201
Join
A join forms a relation from one or more other relations that share a common attribute
(field). There are several types of joins: the outer join (directed join), the theta-join, and
the equi-join. The equi-join, described below, is the most commonly used. Two relations
are equi-joined using a common attribute, and those tuples that share the same value for
the attribute are used to create the resulting relation.
Given relations X, Y, and Z below:
Table 2-5: Examples of equi-joins
X
a
b
c
U
1
Red
V
2
Blue
W
3
Green
X
5
Yellow
Z
7
Orange
Designing a Database  45
Y
b
d
1
Cherry
3
Grape
6
Lemon
Z
a
b
L
1
A
4
U
6
B
8
W
9
We can make the following equi-joins:
X(b,c) join Y (b,d) on b
b
c
d
1
Red
Cherry
3
Green
Grape
46  Designing a Database
X(a,b,c) join Z(a) on a
a
b
c
U
1
Red
W
3
Green
Y(b,d) join Z(b,a) on b
b
d
a
1
Cherry
L
6
Lemon
U
X(b,c) join Y (d) join Z(a) on b
b
c
d
a
1
Red
Cherry
L
Normalization
The process of forming records and determining inter-record relationships can be
complex. One method for designing relational databases is called normalization. This
method can easily be applied to any database structure.
Normalization is a technique you can use to eliminate insertion, deletion, and update
anomalies that could exist in a database because of various functional dependencies
among attributes of entities.
Insertion, deletion, and update anomalies can occur when a relation is used to represent
more than one fact. Let's examine a relation called ORDER that contains the following
attributes:
ORDER_NUMBER
ORDER_DATE
PRODUCT_NAME
PRODUCT_PRICE
SUPPLIER_NAME
SUPPLIER_ADDRESS
Designing a Database  47
The ORDER relation describes a purchase, but since product and supplier information is
also present, it tracks suppliers of products. In this case, the relation represents more than
one fact, so various anomalies could occur.
An insertion anomaly occurs whenever we add new supplier information without filling in
the order data. What does a null value for ORDER_NUMBER or ORDER_DATE
signify? In fact, the ORDER_NUMBER would probably be required. So how do we add
a new product supplier without an order?
A deletion anomaly occurs if the deletion of an order destroys the only information
available about a supplier of a product. This would happen if there were no other orders
for the same product.
Update anomalies occur when one attribute must be changed in many places. If the price
of a product needs to be corrected, changing the price may require updates to several
orders. This may not be easy if a number of orders change while we are trying to fix the
price.
If you are designing a complex database, you must define each relation so it represents
only one fact. To do this, study the dependencies that exist among the attributes of a
relation. Determine which attributes are dependent on other attributes. For example,
attribute B of a relation R is functionally dependent on attribute A of R if, at every instant
of time, each value in A has no more than one value in B associated with it in relation R.
This means that if B is functionally dependent on A, then A identifies B. There is a
simple correspondence between A and B (A—>B).
The attributes of set B (which could contain one attribute) of a relation R are fully
functionally dependent on set A of attributes in R if B is functionally dependent on the
whole of A but not on any subset of A. We cannot uniquely identify B without using all
of the attributes of set A.
When you examine a relation, find one or more groups of attributes on which all of the
other attributes depend. Such a group (possibly one attribute) is called a candidate key of
the relation. Any attribute that is used to form a candidate key is called a prime attribute.
The value of the key uniquely identifies each tuple in the relation. Every attribute in the
group that defines the candidate key must be required. If any attribute were eliminated
from the group, those remaining could not form a candidate key that uniquely identifies
each tuple. If more than one candidate key exists for a relation, one of the candidate keys
is designated as the primary key. Each record type must have a primary key field.
For example, to determine the functional dependencies that exist in two relations, look at
the ORDER relation used earlier and a subset of it called the PRODUCT relation.
48  Designing a Database
ORDER:
ORDER_NUMBER
PRODUCT:
PRODUCT_NAME
ORDER_DATE
SUPPLIER_NAME
PRODUCT_NAME
SUPPLIER_ADDRESS
PRODUCT_PRICE
PRODUCT_PRICE
SUPPLIER_NAME
SUPPLIER_ADDRESS
We must study each attribute of a relation and decide which of the other attributes in the
relation uniquely identify it. This diagram shows the dependencies.
ORDER_NUMBER*
ORDER_DATE
PRODUCT_NAME
PRODUCT_PRICE
SUPPLIER_NAME
SUPPLIER_ADDRESS
Figure 2-9: Dependencies in ORDER relation
The * indicates which attributes are prime attributes and the arrows are used to show
which attributes can be used to identify other attributes. All of the attributes are identified
by the ORDER_NUMBER. The SUPPLIER_ADDRESS can be determined from
knowing only the SUPPLIER_NAME, and the PRODUCT_PRICE is identified by using
both the PRODUCT_NAME and the SUPPLIER_NAME together.
Designing a Database  49
The dependencies of the product relation are
PRODUCT_NAME*
SUPPLIER_NAME*
SUPPLIER_ADDRESS
PRODUCT_PRICE
Figure 2-10: Dependencies in PRODUCT relation
Notice that the SUPPLIER_ADDRESS is identified by the SUPPLIER_NAME but that
both the SUPPLIER_NAME and the PRODUCT_NAME are required to determine the
PRODUCT_PRICE.
These diagrams show the type of functional dependencies that we want to eliminate.
Since the PRODUCT_NAME with the SUPPLIER_NAME is the only group of attributes
that will uniquely identify the other attributes in the PRODUCT relation, they form the
only candidate key. (By default, they also form the primary key.) But the
SUPPLIER_ADDRESS is not fully functionally dependent on this key. It is identified by
the SUPPLIER_NAME alone. We would like to remove this nonfull dependence.
In the ORDER relation, it is easy to see that the ORDER_NUMBER is the only good key.
Problems appear in this relation. Notice that we can identify the SUPPLIER_ADDRESS
by knowing the SUPPLIER_NAME, which is not a candidate key. This could cause the
address to occur in many tuples and lead to anomalies. This is called a transitive
dependence and must be removed.
The normalization process is used to remove nonfull and transitive dependencies. This
eliminates many possible anomalies. Various degrees of normalization are derived by
looking for dependencies and converting a database into normal form. Many databases
are normalized to third normal form. Higher degrees of normalization have been
discovered. They involve multi-valued dependencies, but they will not be explained here.
First Normal Form
Getting a database into first normal form is fairly easy. We must eliminate any repeating
groups and use simple fields for each attribute of every relation. We must be able to
represent the relations as tables where only one value is present in each column of every
row. This sometimes requires us to invent new fields that are needed to help uniquely
identify each value that was a member of a repeating group and now resides in its own
record.
50  Designing a Database
Starting with the relation EMPLOYEE where DEPENDENT and PROJECT_NO are
repeating groups:
EMPLOYEE:
ENO *
Occurs (1) Time
NAME
Occurs (1) Time
SALARY
Occurs (1) Time
DEPENDENT
Occurs (0:20) Times
PROJECT_NO
Occurs (0:15) Times
To “flatten” the original EMPLOYEE relation, we need to create three dependent
relations:
EMPLOYEE :
ENO *
NAME
SALARY
DEPENDENT :
DEPENDENT NAME
DEPENDENT_NO *
ENO *
ASSIGNMENT:
PROJECT_NO *
ASSIGNMENT_NO *
ENO
(The prime attributes are denoted by a *.) The name of each dependent is put in a
separate record and so is each project number.
Designing a Database  51
Second Normal Form
The goal of normalization is to get the database into third normal form. We use tables
that are in first normal form to determine the existing functional dependencies.
According to theory, we should first eliminate all of the nonfull dependencies, which
leaves us in second normal form. Next we should eliminate the transitive dependencies,
which converts the database to third normal form. In actual practice we will
simultaneously eliminate both kinds of dependencies and go directly from first to third
normal form.
Third Normal Form
We know the database is in third normal form if every determinant for the relation is a
candidate key. (A determinant is a group of attributes that identifies another attribute.)
To convert the ORDER relation to third normal form, we need to have three relations.
ORDER :
ORDER_NUMBER *
ORDER_DATE
PRODUCT_NAME
SUPPLIER_NAME
PRODUCT :
PRODUCT_NAME *
SUPPLIER_NAME *
PRODUCT_PRICE
SUPPLIER:
SUPPLIER_NAME *
SUPPLIER_ADDRESS
52  Designing a Database
Summary
The steps in the normalization process are
1.
Represent every relation as a table where only one value is present in each column of
every row.
2.
Remove nonfull dependencies.
Change
A*
into
A*
B*
B*
C
C
and
B*
D
D
3.
Remove transitive dependencies.
Change
A*
B
into
A*
B
and
B*
C
C
Designing a Database  53
Diagramming Record Relationships
Once you have ideas from your users about what records and fields you need, you can
organize record types into a logically designed database. You'll probably find it helpful
to draw a record relationship diagram, a simple way to illustrate a complex system. The
ovals in Figure 2-11 represent the records, and the lines between them show how the
records are related.
DEPENDENT
DEPARTMENT
employee has
dependents
employee manages
departments
employee manages
employees
department has
employees
EMPLOYEE
employee works
with clients
employee
updates places
CLIENT
PLACE
client goes to
places
client has travel
schedules
SCHED
Figure 2-11: Record relationship diagram for TOUR database
Note: The Windows operating system does not include the PLACE record because it is
a sectioned record, and Windows does not support sectioned records.
54  Designing a Database
Planning Inter-Record Relationships
Think about how you want to store and retrieve records. When you retrieve your records,
you may want to join some records together. Take it a step further and think of which
fields you will use to join the records. At this point in the design, you can identify similar
fields in different records. If you define them similarly, you can ensure that their values
are stored the same way and can be used to join records.
Joining Records
Joins enable users to display related information in two or more separate records
simultaneously. For example, in the TOUR database the DEPENDENT and
EMPLOYEE record types can be joined, based on a common set of values for the ENO
field.
Domains, ranges of values a field or logically similar fields could have, help insure that a
value stored in one record will be stored similarly in another. For example, if department
number were stored as “100” in a DEPARTMENT record and “one-hundred” in an
EMPLOYEE record, these records could not successfully be joined. The values are
equivalent but not exactly equal because one record uses numeric data and the other uses
character data. You can define domains to describe value-specific parameters like data
type, size, and legal value list for different fields having similar values.
Use domains whenever you have similar fields in separate records that will be used for
joins. You are not obligated to use domains, but they are strongly recommended. You
should define a domain for each group of joins you have identified.
You can also use domains when you have two or more similar fields that will never be
joined. Suppose you have a field for office that contained three-digit numbers between
100 and 200 and you also have a three-digit field of numbers 100 to 200 for parking lot
slots. Both fields are totally unrelated and it would never make sense to join them.
However, you could still use a domain to define the value-specific similarities. That way,
you only need to define the parameters once in the domain instead of in each field
definition. The field definition would simply refer to the domain.
Plan for any domains you may have for the joinable fields. Give each domain a
meaningful name that indicates the type of data it describes.
Sometimes it doesn't make sense for a record type to exist unless its corresponding owner
or parent record type exists. For example, it doesn't make sense for a company's
personnel database to have a DEPENDENT record type if there is no corresponding
EMPLOYEE record type. Record types that depend on other record types are sometimes
Designing a Database  55
called member or child record types. Dependencies among record types are sometimes
called owner/member or parent/child relationships.
Referential assertions, rules that govern relationships between record types, can control
owner/member dependencies so that, for example, a DEPENDENT record cannot be
added unless an EMPLOYEE record already exists and an EMPLOYEE record cannot be
deleted until all of the associated DEPENDENT records for the employee have been
deleted.
Concept Mining and Clustering
As the amount of information in document collections continues to expand, it becomes
increasingly important to provide searchers with additional features beyond the standard
ranked list of documents in response to a search query—features which can help users
find more relevant information and help them understand better the information and its
context. To meet this need, BASIS has developed the Concept Mining and Clustering
features. While interrelated, the Concept Mining and Clustering features may be used
independently.
Concept Mining
Concept mining refers to the automatic extraction of concepts (keywords, phrases,
personal names, company names, etc.) during the import of a document. Information is
gleaned from the data as it is read in. Concept mining gives users a context for their
search results, giving them an overview of the companies, people, and other concepts that
are significant in a set of documents without having to read any of the documents
themselves. Users can also be alerted to different facets of the retrieved document set by
examining its concept lists. For example, a search involving the word "diet" might
retrieve some documents about weight-loss programs as well as some documents about
nutrition. A concept list based on all of these documents might include different entries
associated with these different topics, thus giving the user a quick indication that the
search may need to be refined.
The types of concepts most commonly included in concept mining are:

Keywords (based on occurrence frequencies)

Phrases (statistically significant two-word combinations)

Personal Names

Company Names
56  Designing a Database
The concept mining technology automatically extracts keywords, phrases, and company
and personal names from text-image fields based on configurable parameter files. The
extracted terms are stored in record fields which can be indexed and used for retrieval.
The DBA defines what types of concepts are to be extracted. The concept types that will
be extracted from the data are defined in the concept mining initialization file
(concept.ini), which you can customize to suit your individual or companies’ needs.
Company names are identified by an algorithm which recognizes names based on
corporate suffixes, such as "Inc", "Corp", etc. Suffixes used in many different countries
are recognized. A prepared list of company names can also be supplied in a configuration
file to augment the algorithmic name recognition.
Personal names are identified using configuration files containing first and last names.
The concept mining feature includes configuration files containing over 7500 first names
and over 18,000 last names. These files can be customized to add or delete any names to
suit your needs.
In addition to extracting concepts, the concept mining feature assigns a numerical weight
value to each extracted concept. This weight signifies the relevance or importance of that
concept in the analyzed text. For example, if you used the Concept Mining feature on a
document about Albert Einstein and one of the concept types you are using is “personal
names”, then of course, “Albert Einstein” would—assuming his name is the name that
occurs the most in the document—receive the highest weight of any personal name. A
name that only occurs once in the document would receive a very low weight. The weight
values are stored in record fields used by the document Clustering feature to organize
result sets based on frequency of concepts.
Clustering
When a user performs a search, the Clustering feature groups together similar documents
into clusters. It does this by analyzing each document and building a representation of its
topic and then it compares documents’ representations with each other to determine which
ones are most similar. For example, a general search on “sports” may be clustered into
topics like “Football”, “Baseball”, “Basketball”, and so forth.
A typical use of clustering is when organizations want to improve their “Frequently Asked
Questions” pages or automate responses to common customer email inquiries. They can
use clustering to analyze historical customer emails and determine categories of
frequently asked questions.
Clustering can give users an overview of the range of topics discussed in a set of
documents and a list of subtopics. By browsing a list of subtopics found in the result set,
users can quickly get an overview of the kinds of information contained in a result set.
They don't need to read through a large number of documents to try to synthesize an
overview. Clustering can be useful for:
Designing a Database  57

Understanding better what is in the result set. For example, a search for
information about retirement plans might result in a cluster focused on
traditional pensions, one on IRAs, and one on 401(k) plans.

Helping users to better break down a search into sub clusters, allowing them to
more effectively “drill down” into a topic.

Helping DBA’s to develop ideas about how to organize their document
collections by topic.
The result of a clustering operation is a list of clusters, where each cluster includes the
following information:

A cluster quality score indicating the cluster's cohesiveness

A ranked list of the documents that were placed in the cluster, along with
document quality scores indicating each document's similarity to the cluster

The title of the most representative (highest-scoring) document in the cluster

A list of key terms (with weights) describing the cluster
Evaluating the Database Design
After seeking information from users and organizing your records and fields, evaluate
your final design with your users. How does the design stand up?
Check these general characteristics to ensure that you have a good design. Keep in mind
that these are only guidelines and that some applications may not be well suited to them.
Characteristics of good designs are

Single purpose records. Make sure that you're not creating records with many
purposes. An example of this would be the combination of EMPLOYEE and
DEPENDENT records into a single record type.

Well controlled use of data redundancy. Redundant data is helpful only when the
redundancy is used to link two or more records.

Proper use of optional fields. Avoid defining fields that are optional or required
based on the data value of some other field. It's best to have a field that is always
optional, rather than sometimes optional.

One unique field per record. Single-purpose records will commonly have a unique
field that can serve as the key.
Here are some characteristics that you might want to avoid:
58  Designing a Database

Multi-purpose records. If you have multi-purpose records, try to get a clearer
understanding of the entities described and define additional record types rather than
overburden a record type with fields that describe attributes of more than one entity.

Multiple unique fields within a record. If you have accurately described an entity,
it usually has one unique attribute.

Too many records. Check to be sure you haven't unnecessarily split an entity
between record types. If you do this you will have to constantly join the records
together to get complete data.
Designing a Database  59

Excessive data redundancy. Especially for joining record types, you probably want
some data redundancy, but review each instance and make sure that it's really
necessary.

Unique fields that aren't unique. Review the unique fields and make sure the data
will be unique.

Required fields that should be optional. Make sure that data will always be
available for each required field when you add record occurrences. Also, reassess the
need for a required field and your reasoning for making it required.

Optional fields that should be required. Take another look at your data. If you
leave out a non-required field, will the rest of the record make sense?

Too many optional fields. If you have too many optional fields in a record, you
should consider using additional record types. Too many optional fields reflects an
entity in a state of flux or a multi-purpose record. Try to define the requirements
more precisely, if possible.
Sharing the Design with the Users
Data Dictionary
After you've come up with a reasonable design, it's a good idea to let users know what you
have in mind. A good vehicle for this is a “data dictionary.” The dictionary should
explain your record relationships and your field specifications in language that your users
can understand. You can derive information for a data dictionary from the record types
you define in the Actual Data Model for your database. Keep the data dictionary simple
and general. Here are some guidelines for preparing a dictionary:

Support your record relationship diagram and field layouts with a good description of
the arrangement.

Give only one definition per domain. Don't bother to explain what each and every
field means when they are all blanketed under the same domain. For example,
explain what an employee number is, not DEPARTMENT employee number,
EMPLOYEE employee number and PLACES employee number. One explanation
about the domain will cover each member field.

Explain what validation rules are needed (pattern, legal, word list, code list, etc.).
For more information, see Database Definition and Development, “Field
Validation.”

Identify the data types (numeric, character, text, etc.).

Explain the purpose of each record.

Describe any record-to-record dependencies.

Describe any field-to-field dependencies.
60  Designing a Database