Download Introduction to Database Management Systems for Clinical Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Concurrency control wikipedia , lookup

SQL wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Ingres (database) wikipedia , lookup

Functional Database Model wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Join (SQL) wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
z
Introduction to Relational Databases for Clinical Research
Michael A. Kohn, MD, MPP
[email protected]
copyright 2007Michael A. Kohn
Table of Contents
Introduction.......................................................................................................................................................1
Relational Databases, Keys, and Table Normalization....................................................................................2
Table of Study Subjects:........................................................................................................................2
Table of Measurements (One-to-Many Relationship):.........................................................................5
Table of Examiners (Many-to-Many Relationship):.............................................................................9
One-to-One Relationships:..................................................................................................................12
Referential Integrity in a Normalized, Relational Database:..............................................................14
Undesirability of Storing Calculated Values:......................................................................................14
Data Dictionaries, Data Types, and Domains.................................................................................................15
Object Data Type.................................................................................................................................16
Extracting Data from the Database (Queries).................................................................................................17
Action Queries.....................................................................................................................................23
Guidelines for Data Management in Clinical Research.................................................................................25
Introduction
A clinical research study requires definition of the study population, the predictor
variables, and the outcome variables. The researcher must determine how to measure the
variables and anticipate problems with the measurements. Inevitably, baseline data on
the individuals in the study population and measurements of the predictor and outcome
variables will reside in a computer database. The software that runs this computer
database is the database management system (DBMS). Often the amount of actual
study information is small compared to the amount of administrative information, such as
patient contact information, exam schedules, reimbursement records, etc. The DBMS
may also store this administrative information, and it is used to update, check, and correct
all the data. It will also be used either to analyze the study data or to extract and format
the data for export to a statistical analysis package.
Since the original papers outlining the relational database model were published by E.F.
Codd in 1969 and 1970,(1, 2) an entire theory of relational database management has
evolved.(3-5) This theory is based on mathematical set theory, and has its own specific
terminology. The clinical researcher need not be familiar with this theory, nor its
terminology, but he or she should understand the concept of a relational database made
up of multiple tables in which the rows correspond to entities and the columns
correspond to attributes. The clinical researcher should also understand the definition of
primary key and foreign key, and the principle of table normalization. We will
develop the concept of a relational database, the definition of primary and foreign keys,
1
and the principle of normalization using as an example the Infant Jaundice Study, a
fictional cohort study to determine whether neonatal jaundice affects neuropsychological
scores at five years of age.
We assume that the reader has some experience with collecting and storing clinical
research data using spreadsheet or statistical analysis software. Therefore, the reader
should be familiar with storing data in a table with rows as records and columns as fields.
The reader should also be familiar with basic data types, such as the text, integer, real,
and date types. Because we assume this familiarity, we can focus initially on the
definition of primary and foreign keys and on the principle of normalization, which is the
process of breaking a single, complex table with many columns into two or more related
tables with fewer columns but more rows. If the general discussion of this process is
confusing, we encourage you to focus on the example, particularly Figures 1 through 7.
Relational Databases, Keys, and Table Normalization
A relational database is a collection of spreadsheet-like, two-dimensional tables in
which the rows correspond to individual records or entities and the columns correspond
to the different characteristics or attributes of these entities. In each table there is a
single column or combination of columns that uniquely identifies a row. This column or
combination of columns is the table’s primary key. If the table also includes a column
or combination of columns that is the primary key in another table, this column or group
of columns is called a foreign key. Including a foreign key creates a relationship
between the current table and the table for which the foreign key is primary. Tables are
related in one of three ways: one-to-many, many-to-many, and one-to-one. Strictly
speaking, the term “relational” has little to do with these between-table relationships. In
fact, “relation” is the formal term for a table with a primary key. However, the concept
of a relational database as a collection of related tables is a useful heuristic. Most clinical
research studies will have a table of study subjects, a table of measurements on those
subjects, and a table of examiners who make the measurements. The need for a multitable relational database often first arises when measurements are repeated on individual
subjects.
Table of Study Subjects:
All clinical research databases have a table in which each row corresponds to a study
participant. In this table of subjects, the columns correspond to participant-specific
attributes such as name, birth date, and sex. Each row must have a column value or
combination of column values that distinguishes it from the other rows. This column or
combination of columns is the primary key. It is highly desirable to create a single
subject identification number that functions as the primary key.
2
Figure 1 shows a table of 13 study subjects for the fictional Infant Jaundice Study that we
are using as an example. The Infant Jaundice Study is a cohort study to compare the fiveyear neuropsychological scores (IQs) of infants with neonatal jaundice to the scores of
normal infants from the same birth cohort. Of the 13 subjects listed in the table, 6 had
neonatal jaundice and 7 did not. Neither the “DOB” field nor the “FName” field is a
candidate primary key, because neither uniquely identifies its row; Helen and Robert
have the same birth date, and there are two Amy’s. The combination of “FName” and
“DOB” uniquely identifies a row in this table and could be used as a composite primary
key. However, as more children are entered into the study; inevitably two children will
share first name and birth date. Instead, a unique identification number (“SubjectID”) is
assigned to each study participant and functions as the primary key (Figure 2). Using a
unique subject identifier that has no meaning external to the study database also
simplifies the process of “de-linking” study data from personal identifiers for purposes of
maintaining subject confidentiality.
3
Predictor and outcome variables can be included in the table of study subjects if each
subject can only have one measurement of the variable, just as each subject can only have
one birth date and one sex. Often, predictor and outcome variables really are measured
only once per subject, and all the important study data fit reasonably well into a single,
two-dimensional table. When this is the case, the researcher may prefer to store the data
using a spreadsheet program or a statistical analysis package. But, even when the
dynamic data that are added and modified during the course of the study, fit into a single,
two-dimensional table, a relational DBMS may still be needed to handle the study’s
administrative data such as subject contact information, exam schedules, and
reimbursement records. The database management software will also be useful for
maintaining lookup tables and for its data-entry, data-formatting, and data-validation
features--all to be discussed later.
4
In the Infant Jaundice Study the table of subjects (Figure 2) has rows representing
individual infants and columns corresponding to subject identification number, name,
birth date, sex, and whether the infant had neonatal jaundice. If each subject in the study
receives only one neuropsychological exam at age five, the exam results can also be
included as a set of columns in the table of study subjects (Figure 3). If the dynamic
study data are limited to this one table, they are easily exported to a spreadsheet or
statistical package for analysis without any need for rearrangement.1 Some of us have
come to refer to a database consisting of a single, two-dimensional table, such as the one
depicted in Figure 3, as a “flat-file”. However, the original meaning of the term “flat
file” was a file consisting of a string of characters that could only be evaluated
sequentially (such as a tab-delimited text file). Many statistical packages have added
features to accommodate more than one table, but at their core, most remain single-table
or “flat file” databases.
Table of Measurements (One-to-Many Relationship):
The need to include more than one table in a study database often arises first when
measurements are repeated on individual subjects. If the same study variable is measured
on multiple occasions, then a separate table is required for measurements. The rows in
1
In the table shown in Figure 3, the mean (+ standard deviation) neuropsychological score for the 5 of 6
neonatal jaundice patients who had the outcome measured is 112.6 (+ 19.1). For the 4 of 7 controls with
measurements, the mean is 101.3 (+ 24.2). T-test comparison of these means yields a p-value of 0.46.
5
this separate table correspond to individual examinations and include the examination
date, the results of the exam, and most importantly, the subject identification number of
the examinee (which functions as the foreign key). The relationship between the table of
subjects and the table of examinations is one-to-many.
To enable assessment of the inter-rater reliability of the neuropsychological score in our
Infant Jaundice Study example, some of the subjects received the neuropsychological
exam multiple times from different examiners. If we attempt to include the results of
multiple examinations in the subject table, we end up with the situation depicted in
Figure 4.
The table has to have enough columns to accommodate the participant with the most
examinations, even if that participant has 10 more examinations than any other
participant. Most of the examination fields will be null, and querying the table to find the
number of exams done in a particular time interval will require searching the many
different exam date columns.2 The number of columns increases geometrically with the
maximum number of examinations per subject, so the table could get extremely wide.
This is probably the most common and most fundamental mistake that clinical
researchers make in setting up their study databases. Whenever a table has repeating
columns, like “ExDate1”, “ExDate2”, etc., and whenever a table gets beyond about 30
columns wide, it is time to re-evaluate the structure of the table.
On the other hand, if we create a table in which each row corresponds to an examination,
we end up with the situation depicted in Figure 5.
2
Repeating columns such as those shown in Figure 4 violate the First Normal Form (1NF), which requires
that column values be “atomic” and there be no repeating groups.
6
The subject-specific data is repeated multiple times—once for each examination of the
same study subject. The subject identification number (“SubjectID”) can no longer
function as a primary key. Instead, the primary key for this new table would have to be a
combination of “SubjectID” and “ExDate”. Correcting a birth date on a single participant
might require changing multiple rows. If you correct the birth date on only some of these
rows, the patient may end up with two different birth dates in the database. (See Helen’s
records for an example.) Also, querying the table for unique participants born on a
particular date is problematic. Most importantly the table has no place to store
individuals without exams. Comparison of the table in Figure 5 with the previous tables
will show that subject-specific data on the 4 subjects without exams (Alejandro, Ryan,
Zachary, and Jackson) have been lost.3
3
Redundancy in column values from row to row violates the Second Normal Form (2NF), which requires
that non-key column values depend on the entire primary key.
7
The solution to these problems is normalization, which refers to the decomposition of
one wide table into two or more narrower tables without losing any data. In this case, we
normalize the single table in Figure 4 into the two tables in Figure 6: a table of subjects
(“Baby”) and a table of examinations (“Exam”). In the table of examinations, the
columns represent examination date, examination results, and most importantly, the
identification number of the examinee (the foreign key). The primary key in the table of
examinations can be the combination of the subject identification number and the exam
date, as long as no study subject will be examined twice on the same day. However,
creating a single, unique exam identifier (“ExamID”) to function as the primary key in
the table of examinations will simplify matters later. The relationship of the subject table
to the examination table is one-to-many (Figure 7). Now, querying the examination
table for all exams performed within a particular time period requires searching a single
exam date column; querying the subject table for unique patients born on a particular date
is simple. A change to a subject-specific field like birth date is made in one place, and
consistency is preserved. The database can still accommodate subjects, such as
Alejandro, Ryan, Zachary, and Jackson, who have no exams.
8
Table of Examiners (Many-to-Many Relationship):
In addition to a table of subjects and a table of measurements (with multiple
measurements per subject), many studies need to maintain a table of examiners to store
the characteristics of those making the measurements. In the Infant Jaundice Study, one
of a group of physicians performs each neuropsychological assessment. So, the study’s
database includes a table in which each row corresponds to a different physician (Figure
8).
9
Since each physician performs many examinations, incorporating the physician-specific
information into the table of examinations results in repetition and allows for
inconsistencies (Figure 9).
This is the same problem we encountered when we tried to put subject-specific
information in the exam table (Figure 5). Again, if we delete a physician’s last remaining
examination, we also delete the physician from the database. As above, the solution is to
maintain the table of examiners as a separate table and create a one-to-many relationship
with the table of examinations (Figures 10 and 11).
Note that a second foreign key (“DocID”) has been added to the table of examinations.
The relationships diagram in Figure 11 shows that both the “Baby” and the “Doctor”
tables have one-to-many relationships with the “Exam” table. Each subject can see
several different doctors, and each doctor can examine many different subjects. The
relationship between the “Baby” table and the “Doctor” table is many-to-many. The
creation of such a many-to-many relationship requires a linkage table like the “Exam”
table in this example. Commonly the combination of foreign keys in the linkage table is
unique. (No doctor examines the same child twice.)
10
11
One-to-One Relationships:
A one-to-one correspondence between rows in two separate tables can always be
converted to a single table combining the columns from both separate tables. However, if
the information contained in one of the two tables applies to a small proportion of the
records in the other, separating the tables and establishing a one-to-one relationship
eliminates the large number of empty cells that would occur in the combined table.
In the Infant Jaundice Study, a small number of subjects died prior to age five. The
circumstances and date of death are important data to capture in the study’s database.
However, if we include columns related to the circumstances of death in the table of
study subjects, these columns will be empty in the vast majority of rows, since most of
the subjects lived to age five (Figure 12).4 Creating a separate table in which each row
corresponds to the death of a study subject solves the problem of empty cells (Figure 13).
The relationship of this table of deaths to the table of study subjects is one-to-one (Figure
14). A table involved in a one-to-one relationship does not require a separate column for
the foreign key from the related table, since the primary key is also the foreign key.
The table depicted in Figure 12 does not violate any normal form and is perfectly
acceptable.
4
12
13
Referential Integrity in a Normalized, Relational Database:
Figure 14 is the relationships diagram for a simple, normalized, relational database
consisting of only four tables. By structuring the database this way, instead of as a very
wide and complex single table, we have eliminated redundant storage and the opportunity
for inconsistencies. Each piece of information, such as a subject’s birth date, or an
examiner’s specialty, is stored in only one place. The DBMS software will maintain
referential integrity, meaning that it will not allow creation of an exam record for a
subject who does not already exist in the “Baby” table, and it will not allow assigning an
exam to a doctor who does not already exist in the “Doctor” table. Similarly, a subject
may not be deleted unless and until all that subject’s examinations have also been
deleted. Some refer to the record on the “one” side of a one-to-many relationship as the
“parent”, while the records on the “many” side of the relationship are the “children”.
Using this terminology, referential integrity forbids the creation of “orphans”.
Even in this simple example, structuring the database as a four-table relational database
has many advantages. The databases for most clinical research studies will include more
than four tables. Trying to build such databases using statistical or spreadsheet software
is a mistake.
Undesirability of Storing Calculated Values:
Creating a field (column) to store a value that is calculated from other fields is
problematic, because updating any of the other fields requires updating the calculated
field as well. Inconsistencies result if one of the “raw-data” fields is updated without
updating the calculated field. We will discuss later the alternative to storing calculated
fields—recalculating the value in a query. Clinical researchers most often make the error
of storing calculated fields by creating an “age” field in addition to the birth date and
measurement date fields. If either the birth date or measurement date field is changed,
the “age” field becomes inaccurate.
14
Returning to the Infant Jaundice Study example, Figure 15 shows the subject’s age in
months at exam. If a subject’s birth date is corrected in the “Baby” table, the ages in the
“Exam” table will be inaccurate. Storing a patient’s birth date and the date of each exam
allows calculation of his or her exact age at the time of the exam. We can always
calculate age from birth date and exam date, but we cannot work the other way and
calculate exam date from birth date and age, or birth date from exam date and age. In
general, one should always store the endpoints of an interval rather than the interval.
Similarly, storing the mean score for a series of measurements means updating that field
every time a score is changed.5 Occasionally, it is expedient to store the results of
extremely complex calculations. In these situations, procedures and checks are required
to ensure recalculation and update whenever one of the “raw-data” fields is updated.
Data Dictionaries, Data Types, and Domains
In focusing on the critical concept of normalization, we skipped over the more mundane
concepts of data dictionaries, data types and domains. So far we have seen tables only in
the datasheet view. Each column or field has a name and, implicitly, a data type and a
definition. In the “Baby” table (Figure 2), “FName” is a text field that contains the
subject’s first name; “DOB” is a date field that contains the subject’s birth date, and
“Jaundice” is a yes/no field that indicates whether the study subject had neonatal
jaundice. In the “Exam” table (Figures 6 and 10), “ExWght” is a real-number weight in
Storing calculated values such as “AgeInMonths” in Figure 15 represents a violation of
the Third Normal Form (3NF), which requires that all non-key columns be mutually
independent. Since tables in 3NF must also be in 2NF, we can say that all non-key
attributes should depend on “the key, the whole key, and nothing but the key”.
5
15
kilograms and “ExNPScor” is an integer IQ score. The data dictionary makes these
column definitions explicit. Figure 16 shows the “Baby” and “Exam” tables in table
design (or “data dictionary”) view. Note that the data dictionary is itself a table with
rows representing fields and columns for field name, field type, and field description.
Since the data dictionary is a table of information about the database itself, it is referred
to as “metadata”.6
Each field also has a domain or range of allowed values. For example, the allowed
values for the “Sex” field are “M” and “F”. The DBMS will not allow entry of any other
value in this field. Similarly the “ExNPScor” allows only integers between 40 and 200.
Creating these validation rules affords some protection against data entry errors. Some of
the field types come with automatic validation rules. For example, the DBMS will
always reject a date of April 31.
Object Data Type
In addition to the text, number, date, and yes/no data types, there is also an “object” data
type. Sometimes called a BLOB (Binary Large Object), an object is a file associated
66
Although Figure 15 displays two data dictionaries, one for the “Baby” table and one for the “Exam”
table, the entire database can be viewed as having a single data dictionary rather than one dictionary for
each table. For each field in the database, the single data dictionary requires specification of the field’s
table name in addition to the field name, field type, field description, and range of allowed values.
16
with an application (computer program) that can interpret it. A common example would
be an image file in JPEG format. If the Infant Jaundice Study required a photograph of
each subject at the time of his or her exam, we could have included an object field in the
“Exam” table to store the photograph as a JPEG file. A word-processed document or a
spreadsheet could be stored in an object field. Generally, one cannot sort or search on an
object field.
Extracting Data from the Database (Queries)
Once the database has been created and populated with some data, the user will want to
organize, sort, filter, and view or “query” the data, as well as add, modify, or delete
records (rows). The standard language for manipulating data in a relational database is
called Structured Query Language or SQL (pronounced “sequel”).7 All relational
database software systems use one or another variant of SQL, but they also provide a
graphical interface for building queries that makes it unnecessary for the clinical
researcher to learn SQL. In the examples that follow, we will use the graphical query
interface provided by Microsoft Access.
A query can join data from two or more tables, display only selected fields, and filter for
records that meet certain criteria. Suppose that, in the Infant Jaundice Study, we are
interested in age at exam of subjects who were examined in January and February of
2010. In order to calculate age at exam, we will need both the date of exam from the
“Exam” table and the date of birth from the “Baby” table. Figure 17 shows the structure
of a query that joins the “Exam” and “Baby” tables on the “SubjectID” field and displays
the “SubjectID”, “DOB”, and “ExDate” fields – only for exams that were performed in
January or February, 2010.
7
SQL has 3 sublanguages: DDL – Data Definition Language, DML – Data Manipulation Language, and
DCL – Data Control Language. Strictly speaking, DML is the SQL sublanguage used to view, organize,
and extract data, as well as insert, update, and delete records.
17
The SQL associated with this query follows:
SELECT Baby.SubjectID, Baby.DOB, Exam.ExDate
FROM Baby INNER JOIN Exam ON Baby.SubjectID = Exam.SubjectID
WHERE Exam.ExDate Between #1/1/2010# And #2/28/2010#
ORDER BY Exam.ExDate;
Again, although all relational database applications use SQL, the clinical researcher need
not learn it, because he or she can use a graphical query designer such as the one shown
in Figure 17.
Executing the query yields the results shown in Figure 18. There were 14 examinations
performed in January and February of 2010.
18
Note that the result of a query that joins two tables, displays only certain fields, and
selects rows based on special criteria, still looks like a table in datasheet view. One of the
tenets of the relational database model is that operations on tables produce table-like
results. It is important to remember, however, that the query result displays the data as
they reside in the tables. The query is a window through which to view the data in the
tables. In fact, queries are sometimes called “views”. Changing a field value in a query,
actually changes the value in the underlying table; deleting a row from a query deletes the
record from the underlying table.
Queries can also display results of calculations based on raw data fields from the tables.
As discussed above, storing only the raw data fields and recalculating derived fields has
numerous advantages. In this example, we want to calculate the age in months of each
child at the time of the exam. Figure 19 shows the calculated field added to the query
design, and Figure 20 shows the results of the query in Figure 19.
19
20
Because the “AgeInMonths” value is calculated rather than stored, it will automatically
reflect any changes to the raw data fields (“DOB” and “ExDate”) from which it is
calculated.
Perhaps the purpose of determining age in months at the exam date is to compare another
calculated value, Body Mass Index (BMI), to norms for age and sex. Figure 21 shows a
query that calculates BMI from weight (“ExWght”) and height (“ExHght”), in addition to
age in months, and displays the subject’s sex as well. Figure 22 shows the results of this
query which could be used to compare each individual subject’s BMI to norms based on
sex and age in months.
21
22
Action Queries
The queries demonstrated so far are “select” queries – so called because they are based
on the SQL “select” command. They filter, sort, restrict, and display the data stored in
the database tables, and as mentioned above, are sometimes called views. Another
category of query is the “action” query that actually changes the data in the tables. The
three types of action queries are 1) the update query that changes the values of specific
fields in existing records, 2) the append or insert query that adds new records (rows) to a
table, and 3) the delete query that deletes records from a table.
Update Queries
Append (Insert) Queries
Delete Queries
Guidelines for Database Management for Clinical Research
1. Establish the database tables, their rows and columns, and their relationships correctly
at the outset.
A poorly organized database makes data maintenance and retrieval nearly impossible.
Make sure the data are normalized. Avoid data structures that require duplicate data entry
or redundant storage. Sometimes it helps to start with the data collection forms, but you
do NOT need one table per data collection form. One form can combine data from
several tables, and data from one table can appear on several forms. Whether you start
with data collection forms or data tables is irrelevant, as long as the process is iterative.
You can start with the tables and then develop the forms, test the forms, find problems,
and update the tables, or you can start with a word-processed form, create the tables, test,
and update.
2. Establish and follow naming conventions for columns and tables.
Short field names without spaces or underscores are convenient for programming,
querying, and other manipulations. Instead of spaces or underscores, use “IntraCaps”
(upper case letters within the variable name) to distinguish words, e.g. “StudyID”,
“FName”, or “ExamDate”. Table names should be singular, e.g. “Baby” instead of
“Babies”, “Exam” instead of “Exams”.
3. Obtain baseline demographic and clinical information about members of the study
population from existing computer databases.
23
Avoid re-entering data which are already available (in digital format) from other sources.
In the Infant Jaundice Study, the patient demographic data and contact information are
obtained from the hospital database. Computer systems can almost always produce
character-delimited or fixed-column-width text files that the database management
system can import.
4. Minimize the extent to which study measurements are recorded on paper forms.
Enter data directly into the computer database or move data from paper forms into the
computer database as close to the data collection time as possible. When you define a
variable in a computer database, you specify both its format and its domain or range of
allowed values. Using these format and domain specifications, computer data entry
forms give immediate feedback about improper formats and values that are out of range.
The best time to receive this feedback is when the study subject is still on site. If having
a paper copy of the data is important, you can always print out a record immediately after
collecting it. (This is equivalent to getting an ATM receipt at the end of the transaction.)
5. Follow standard data entry conventions.
Several conventions for data entry and display have developed over time. Although most
users of screen forms are not aware of these conventions, they have come to expect them
subconsciously. For example, a series of mutually exclusive, collectively exhaustive
choices is usually displayed as an “option group” consisting of several different “radio
buttons”, whereas choices which are not mutually exclusive are displayed as check boxes.
6. Back up the database regularly and check the adequacy of the back up procedure by
periodically restoring a file from the back up medium.
24
Guidelines for Data Management in Clinical Research
1. Establish the database tables, their rows and columns, and their relationships correctly
at the outset.
2. Establish and follow naming conventions for columns and tables.
3. Obtain baseline demographic and clinical information about members of the study
population from existing computer databases.
4. Minimize the extent to which study measurements are recorded on paper forms.
5. Follow standard data entry conventions.
6. Back up the database regularly.
1.
Codd EF. Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks".
IBM Research Report 1969;RJ599.
2.
Codd EF. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM
1970;13(6):377-387.
3.
Date CJ. The database relational model : a retrospective review and analysis : a historical account
and assessment of E.F. Codd's contribution to the field of database technology. Reading, MA: AddisonWesley; 2001.
4.
Date CJ. An introduction to database systems. 7th ed. Reading, Mass.: Addison-Wesley; 2000.
5.
Mata-Toledo RA, Cushman PK. Schaum's outline of fundamentals of relational databases. New
York: McGraw-Hill; 2000.
25