Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Open Database Connectivity wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Ingres (database) wikipedia , lookup
Functional Database Model wikipedia , lookup
Clusterpoint wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
z Introduction to Relational Databases for Clinical Research Michael A. Kohn, MD, MPP [email protected] copyright 2007Michael A. Kohn Table of Contents Introduction.......................................................................................................................................................1 Relational Databases, Keys, and Table Normalization....................................................................................2 Table of Study Subjects:........................................................................................................................2 Table of Measurements (One-to-Many Relationship):.........................................................................5 Table of Examiners (Many-to-Many Relationship):.............................................................................9 One-to-One Relationships:..................................................................................................................12 Referential Integrity in a Normalized, Relational Database:..............................................................14 Undesirability of Storing Calculated Values:......................................................................................14 Data Dictionaries, Data Types, and Domains.................................................................................................15 Object Data Type.................................................................................................................................16 Extracting Data from the Database (Queries).................................................................................................17 Action Queries.....................................................................................................................................23 Guidelines for Data Management in Clinical Research.................................................................................25 Introduction A clinical research study requires definition of the study population, the predictor variables, and the outcome variables. The researcher must determine how to measure the variables and anticipate problems with the measurements. Inevitably, baseline data on the individuals in the study population and measurements of the predictor and outcome variables will reside in a computer database. The software that runs this computer database is the database management system (DBMS). Often the amount of actual study information is small compared to the amount of administrative information, such as patient contact information, exam schedules, reimbursement records, etc. The DBMS may also store this administrative information, and it is used to update, check, and correct all the data. It will also be used either to analyze the study data or to extract and format the data for export to a statistical analysis package. Since the original papers outlining the relational database model were published by E.F. Codd in 1969 and 1970,(1, 2) an entire theory of relational database management has evolved.(3-5) This theory is based on mathematical set theory, and has its own specific terminology. The clinical researcher need not be familiar with this theory, nor its terminology, but he or she should understand the concept of a relational database made up of multiple tables in which the rows correspond to entities and the columns correspond to attributes. The clinical researcher should also understand the definition of primary key and foreign key, and the principle of table normalization. We will develop the concept of a relational database, the definition of primary and foreign keys, 1 and the principle of normalization using as an example the Infant Jaundice Study, a fictional cohort study to determine whether neonatal jaundice affects neuropsychological scores at five years of age. We assume that the reader has some experience with collecting and storing clinical research data using spreadsheet or statistical analysis software. Therefore, the reader should be familiar with storing data in a table with rows as records and columns as fields. The reader should also be familiar with basic data types, such as the text, integer, real, and date types. Because we assume this familiarity, we can focus initially on the definition of primary and foreign keys and on the principle of normalization, which is the process of breaking a single, complex table with many columns into two or more related tables with fewer columns but more rows. If the general discussion of this process is confusing, we encourage you to focus on the example, particularly Figures 1 through 7. Relational Databases, Keys, and Table Normalization A relational database is a collection of spreadsheet-like, two-dimensional tables in which the rows correspond to individual records or entities and the columns correspond to the different characteristics or attributes of these entities. In each table there is a single column or combination of columns that uniquely identifies a row. This column or combination of columns is the table’s primary key. If the table also includes a column or combination of columns that is the primary key in another table, this column or group of columns is called a foreign key. Including a foreign key creates a relationship between the current table and the table for which the foreign key is primary. Tables are related in one of three ways: one-to-many, many-to-many, and one-to-one. Strictly speaking, the term “relational” has little to do with these between-table relationships. In fact, “relation” is the formal term for a table with a primary key. However, the concept of a relational database as a collection of related tables is a useful heuristic. Most clinical research studies will have a table of study subjects, a table of measurements on those subjects, and a table of examiners who make the measurements. The need for a multitable relational database often first arises when measurements are repeated on individual subjects. Table of Study Subjects: All clinical research databases have a table in which each row corresponds to a study participant. In this table of subjects, the columns correspond to participant-specific attributes such as name, birth date, and sex. Each row must have a column value or combination of column values that distinguishes it from the other rows. This column or combination of columns is the primary key. It is highly desirable to create a single subject identification number that functions as the primary key. 2 Figure 1 shows a table of 13 study subjects for the fictional Infant Jaundice Study that we are using as an example. The Infant Jaundice Study is a cohort study to compare the fiveyear neuropsychological scores (IQs) of infants with neonatal jaundice to the scores of normal infants from the same birth cohort. Of the 13 subjects listed in the table, 6 had neonatal jaundice and 7 did not. Neither the “DOB” field nor the “FName” field is a candidate primary key, because neither uniquely identifies its row; Helen and Robert have the same birth date, and there are two Amy’s. The combination of “FName” and “DOB” uniquely identifies a row in this table and could be used as a composite primary key. However, as more children are entered into the study; inevitably two children will share first name and birth date. Instead, a unique identification number (“SubjectID”) is assigned to each study participant and functions as the primary key (Figure 2). Using a unique subject identifier that has no meaning external to the study database also simplifies the process of “de-linking” study data from personal identifiers for purposes of maintaining subject confidentiality. 3 Predictor and outcome variables can be included in the table of study subjects if each subject can only have one measurement of the variable, just as each subject can only have one birth date and one sex. Often, predictor and outcome variables really are measured only once per subject, and all the important study data fit reasonably well into a single, two-dimensional table. When this is the case, the researcher may prefer to store the data using a spreadsheet program or a statistical analysis package. But, even when the dynamic data that are added and modified during the course of the study, fit into a single, two-dimensional table, a relational DBMS may still be needed to handle the study’s administrative data such as subject contact information, exam schedules, and reimbursement records. The database management software will also be useful for maintaining lookup tables and for its data-entry, data-formatting, and data-validation features--all to be discussed later. 4 In the Infant Jaundice Study the table of subjects (Figure 2) has rows representing individual infants and columns corresponding to subject identification number, name, birth date, sex, and whether the infant had neonatal jaundice. If each subject in the study receives only one neuropsychological exam at age five, the exam results can also be included as a set of columns in the table of study subjects (Figure 3). If the dynamic study data are limited to this one table, they are easily exported to a spreadsheet or statistical package for analysis without any need for rearrangement.1 Some of us have come to refer to a database consisting of a single, two-dimensional table, such as the one depicted in Figure 3, as a “flat-file”. However, the original meaning of the term “flat file” was a file consisting of a string of characters that could only be evaluated sequentially (such as a tab-delimited text file). Many statistical packages have added features to accommodate more than one table, but at their core, most remain single-table or “flat file” databases. Table of Measurements (One-to-Many Relationship): The need to include more than one table in a study database often arises first when measurements are repeated on individual subjects. If the same study variable is measured on multiple occasions, then a separate table is required for measurements. The rows in 1 In the table shown in Figure 3, the mean (+ standard deviation) neuropsychological score for the 5 of 6 neonatal jaundice patients who had the outcome measured is 112.6 (+ 19.1). For the 4 of 7 controls with measurements, the mean is 101.3 (+ 24.2). T-test comparison of these means yields a p-value of 0.46. 5 this separate table correspond to individual examinations and include the examination date, the results of the exam, and most importantly, the subject identification number of the examinee (which functions as the foreign key). The relationship between the table of subjects and the table of examinations is one-to-many. To enable assessment of the inter-rater reliability of the neuropsychological score in our Infant Jaundice Study example, some of the subjects received the neuropsychological exam multiple times from different examiners. If we attempt to include the results of multiple examinations in the subject table, we end up with the situation depicted in Figure 4. The table has to have enough columns to accommodate the participant with the most examinations, even if that participant has 10 more examinations than any other participant. Most of the examination fields will be null, and querying the table to find the number of exams done in a particular time interval will require searching the many different exam date columns.2 The number of columns increases geometrically with the maximum number of examinations per subject, so the table could get extremely wide. This is probably the most common and most fundamental mistake that clinical researchers make in setting up their study databases. Whenever a table has repeating columns, like “ExDate1”, “ExDate2”, etc., and whenever a table gets beyond about 30 columns wide, it is time to re-evaluate the structure of the table. On the other hand, if we create a table in which each row corresponds to an examination, we end up with the situation depicted in Figure 5. 2 Repeating columns such as those shown in Figure 4 violate the First Normal Form (1NF), which requires that column values be “atomic” and there be no repeating groups. 6 The subject-specific data is repeated multiple times—once for each examination of the same study subject. The subject identification number (“SubjectID”) can no longer function as a primary key. Instead, the primary key for this new table would have to be a combination of “SubjectID” and “ExDate”. Correcting a birth date on a single participant might require changing multiple rows. If you correct the birth date on only some of these rows, the patient may end up with two different birth dates in the database. (See Helen’s records for an example.) Also, querying the table for unique participants born on a particular date is problematic. Most importantly the table has no place to store individuals without exams. Comparison of the table in Figure 5 with the previous tables will show that subject-specific data on the 4 subjects without exams (Alejandro, Ryan, Zachary, and Jackson) have been lost.3 3 Redundancy in column values from row to row violates the Second Normal Form (2NF), which requires that non-key column values depend on the entire primary key. 7 The solution to these problems is normalization, which refers to the decomposition of one wide table into two or more narrower tables without losing any data. In this case, we normalize the single table in Figure 4 into the two tables in Figure 6: a table of subjects (“Baby”) and a table of examinations (“Exam”). In the table of examinations, the columns represent examination date, examination results, and most importantly, the identification number of the examinee (the foreign key). The primary key in the table of examinations can be the combination of the subject identification number and the exam date, as long as no study subject will be examined twice on the same day. However, creating a single, unique exam identifier (“ExamID”) to function as the primary key in the table of examinations will simplify matters later. The relationship of the subject table to the examination table is one-to-many (Figure 7). Now, querying the examination table for all exams performed within a particular time period requires searching a single exam date column; querying the subject table for unique patients born on a particular date is simple. A change to a subject-specific field like birth date is made in one place, and consistency is preserved. The database can still accommodate subjects, such as Alejandro, Ryan, Zachary, and Jackson, who have no exams. 8 Table of Examiners (Many-to-Many Relationship): In addition to a table of subjects and a table of measurements (with multiple measurements per subject), many studies need to maintain a table of examiners to store the characteristics of those making the measurements. In the Infant Jaundice Study, one of a group of physicians performs each neuropsychological assessment. So, the study’s database includes a table in which each row corresponds to a different physician (Figure 8). 9 Since each physician performs many examinations, incorporating the physician-specific information into the table of examinations results in repetition and allows for inconsistencies (Figure 9). This is the same problem we encountered when we tried to put subject-specific information in the exam table (Figure 5). Again, if we delete a physician’s last remaining examination, we also delete the physician from the database. As above, the solution is to maintain the table of examiners as a separate table and create a one-to-many relationship with the table of examinations (Figures 10 and 11). Note that a second foreign key (“DocID”) has been added to the table of examinations. The relationships diagram in Figure 11 shows that both the “Baby” and the “Doctor” tables have one-to-many relationships with the “Exam” table. Each subject can see several different doctors, and each doctor can examine many different subjects. The relationship between the “Baby” table and the “Doctor” table is many-to-many. The creation of such a many-to-many relationship requires a linkage table like the “Exam” table in this example. Commonly the combination of foreign keys in the linkage table is unique. (No doctor examines the same child twice.) 10 11 One-to-One Relationships: A one-to-one correspondence between rows in two separate tables can always be converted to a single table combining the columns from both separate tables. However, if the information contained in one of the two tables applies to a small proportion of the records in the other, separating the tables and establishing a one-to-one relationship eliminates the large number of empty cells that would occur in the combined table. In the Infant Jaundice Study, a small number of subjects died prior to age five. The circumstances and date of death are important data to capture in the study’s database. However, if we include columns related to the circumstances of death in the table of study subjects, these columns will be empty in the vast majority of rows, since most of the subjects lived to age five (Figure 12).4 Creating a separate table in which each row corresponds to the death of a study subject solves the problem of empty cells (Figure 13). The relationship of this table of deaths to the table of study subjects is one-to-one (Figure 14). A table involved in a one-to-one relationship does not require a separate column for the foreign key from the related table, since the primary key is also the foreign key. The table depicted in Figure 12 does not violate any normal form and is perfectly acceptable. 4 12 13 Referential Integrity in a Normalized, Relational Database: Figure 14 is the relationships diagram for a simple, normalized, relational database consisting of only four tables. By structuring the database this way, instead of as a very wide and complex single table, we have eliminated redundant storage and the opportunity for inconsistencies. Each piece of information, such as a subject’s birth date, or an examiner’s specialty, is stored in only one place. The DBMS software will maintain referential integrity, meaning that it will not allow creation of an exam record for a subject who does not already exist in the “Baby” table, and it will not allow assigning an exam to a doctor who does not already exist in the “Doctor” table. Similarly, a subject may not be deleted unless and until all that subject’s examinations have also been deleted. Some refer to the record on the “one” side of a one-to-many relationship as the “parent”, while the records on the “many” side of the relationship are the “children”. Using this terminology, referential integrity forbids the creation of “orphans”. Even in this simple example, structuring the database as a four-table relational database has many advantages. The databases for most clinical research studies will include more than four tables. Trying to build such databases using statistical or spreadsheet software is a mistake. Undesirability of Storing Calculated Values: Creating a field (column) to store a value that is calculated from other fields is problematic, because updating any of the other fields requires updating the calculated field as well. Inconsistencies result if one of the “raw-data” fields is updated without updating the calculated field. We will discuss later the alternative to storing calculated fields—recalculating the value in a query. Clinical researchers most often make the error of storing calculated fields by creating an “age” field in addition to the birth date and measurement date fields. If either the birth date or measurement date field is changed, the “age” field becomes inaccurate. 14 Returning to the Infant Jaundice Study example, Figure 15 shows the subject’s age in months at exam. If a subject’s birth date is corrected in the “Baby” table, the ages in the “Exam” table will be inaccurate. Storing a patient’s birth date and the date of each exam allows calculation of his or her exact age at the time of the exam. We can always calculate age from birth date and exam date, but we cannot work the other way and calculate exam date from birth date and age, or birth date from exam date and age. In general, one should always store the endpoints of an interval rather than the interval. Similarly, storing the mean score for a series of measurements means updating that field every time a score is changed.5 Occasionally, it is expedient to store the results of extremely complex calculations. In these situations, procedures and checks are required to ensure recalculation and update whenever one of the “raw-data” fields is updated. Data Dictionaries, Data Types, and Domains In focusing on the critical concept of normalization, we skipped over the more mundane concepts of data dictionaries, data types and domains. So far we have seen tables only in the datasheet view. Each column or field has a name and, implicitly, a data type and a definition. In the “Baby” table (Figure 2), “FName” is a text field that contains the subject’s first name; “DOB” is a date field that contains the subject’s birth date, and “Jaundice” is a yes/no field that indicates whether the study subject had neonatal jaundice. In the “Exam” table (Figures 6 and 10), “ExWght” is a real-number weight in Storing calculated values such as “AgeInMonths” in Figure 15 represents a violation of the Third Normal Form (3NF), which requires that all non-key columns be mutually independent. Since tables in 3NF must also be in 2NF, we can say that all non-key attributes should depend on “the key, the whole key, and nothing but the key”. 5 15 kilograms and “ExNPScor” is an integer IQ score. The data dictionary makes these column definitions explicit. Figure 16 shows the “Baby” and “Exam” tables in table design (or “data dictionary”) view. Note that the data dictionary is itself a table with rows representing fields and columns for field name, field type, and field description. Since the data dictionary is a table of information about the database itself, it is referred to as “metadata”.6 Each field also has a domain or range of allowed values. For example, the allowed values for the “Sex” field are “M” and “F”. The DBMS will not allow entry of any other value in this field. Similarly the “ExNPScor” allows only integers between 40 and 200. Creating these validation rules affords some protection against data entry errors. Some of the field types come with automatic validation rules. For example, the DBMS will always reject a date of April 31. Object Data Type In addition to the text, number, date, and yes/no data types, there is also an “object” data type. Sometimes called a BLOB (Binary Large Object), an object is a file associated 66 Although Figure 15 displays two data dictionaries, one for the “Baby” table and one for the “Exam” table, the entire database can be viewed as having a single data dictionary rather than one dictionary for each table. For each field in the database, the single data dictionary requires specification of the field’s table name in addition to the field name, field type, field description, and range of allowed values. 16 with an application (computer program) that can interpret it. A common example would be an image file in JPEG format. If the Infant Jaundice Study required a photograph of each subject at the time of his or her exam, we could have included an object field in the “Exam” table to store the photograph as a JPEG file. A word-processed document or a spreadsheet could be stored in an object field. Generally, one cannot sort or search on an object field. Extracting Data from the Database (Queries) Once the database has been created and populated with some data, the user will want to organize, sort, filter, and view or “query” the data, as well as add, modify, or delete records (rows). The standard language for manipulating data in a relational database is called Structured Query Language or SQL (pronounced “sequel”).7 All relational database software systems use one or another variant of SQL, but they also provide a graphical interface for building queries that makes it unnecessary for the clinical researcher to learn SQL. In the examples that follow, we will use the graphical query interface provided by Microsoft Access. A query can join data from two or more tables, display only selected fields, and filter for records that meet certain criteria. Suppose that, in the Infant Jaundice Study, we are interested in age at exam of subjects who were examined in January and February of 2010. In order to calculate age at exam, we will need both the date of exam from the “Exam” table and the date of birth from the “Baby” table. Figure 17 shows the structure of a query that joins the “Exam” and “Baby” tables on the “SubjectID” field and displays the “SubjectID”, “DOB”, and “ExDate” fields – only for exams that were performed in January or February, 2010. 7 SQL has 3 sublanguages: DDL – Data Definition Language, DML – Data Manipulation Language, and DCL – Data Control Language. Strictly speaking, DML is the SQL sublanguage used to view, organize, and extract data, as well as insert, update, and delete records. 17 The SQL associated with this query follows: SELECT Baby.SubjectID, Baby.DOB, Exam.ExDate FROM Baby INNER JOIN Exam ON Baby.SubjectID = Exam.SubjectID WHERE Exam.ExDate Between #1/1/2010# And #2/28/2010# ORDER BY Exam.ExDate; Again, although all relational database applications use SQL, the clinical researcher need not learn it, because he or she can use a graphical query designer such as the one shown in Figure 17. Executing the query yields the results shown in Figure 18. There were 14 examinations performed in January and February of 2010. 18 Note that the result of a query that joins two tables, displays only certain fields, and selects rows based on special criteria, still looks like a table in datasheet view. One of the tenets of the relational database model is that operations on tables produce table-like results. It is important to remember, however, that the query result displays the data as they reside in the tables. The query is a window through which to view the data in the tables. In fact, queries are sometimes called “views”. Changing a field value in a query, actually changes the value in the underlying table; deleting a row from a query deletes the record from the underlying table. Queries can also display results of calculations based on raw data fields from the tables. As discussed above, storing only the raw data fields and recalculating derived fields has numerous advantages. In this example, we want to calculate the age in months of each child at the time of the exam. Figure 19 shows the calculated field added to the query design, and Figure 20 shows the results of the query in Figure 19. 19 20 Because the “AgeInMonths” value is calculated rather than stored, it will automatically reflect any changes to the raw data fields (“DOB” and “ExDate”) from which it is calculated. Perhaps the purpose of determining age in months at the exam date is to compare another calculated value, Body Mass Index (BMI), to norms for age and sex. Figure 21 shows a query that calculates BMI from weight (“ExWght”) and height (“ExHght”), in addition to age in months, and displays the subject’s sex as well. Figure 22 shows the results of this query which could be used to compare each individual subject’s BMI to norms based on sex and age in months. 21 22 Action Queries The queries demonstrated so far are “select” queries – so called because they are based on the SQL “select” command. They filter, sort, restrict, and display the data stored in the database tables, and as mentioned above, are sometimes called views. Another category of query is the “action” query that actually changes the data in the tables. The three types of action queries are 1) the update query that changes the values of specific fields in existing records, 2) the append or insert query that adds new records (rows) to a table, and 3) the delete query that deletes records from a table. Update Queries Append (Insert) Queries Delete Queries Guidelines for Database Management for Clinical Research 1. Establish the database tables, their rows and columns, and their relationships correctly at the outset. A poorly organized database makes data maintenance and retrieval nearly impossible. Make sure the data are normalized. Avoid data structures that require duplicate data entry or redundant storage. Sometimes it helps to start with the data collection forms, but you do NOT need one table per data collection form. One form can combine data from several tables, and data from one table can appear on several forms. Whether you start with data collection forms or data tables is irrelevant, as long as the process is iterative. You can start with the tables and then develop the forms, test the forms, find problems, and update the tables, or you can start with a word-processed form, create the tables, test, and update. 2. Establish and follow naming conventions for columns and tables. Short field names without spaces or underscores are convenient for programming, querying, and other manipulations. Instead of spaces or underscores, use “IntraCaps” (upper case letters within the variable name) to distinguish words, e.g. “StudyID”, “FName”, or “ExamDate”. Table names should be singular, e.g. “Baby” instead of “Babies”, “Exam” instead of “Exams”. 3. Obtain baseline demographic and clinical information about members of the study population from existing computer databases. 23 Avoid re-entering data which are already available (in digital format) from other sources. In the Infant Jaundice Study, the patient demographic data and contact information are obtained from the hospital database. Computer systems can almost always produce character-delimited or fixed-column-width text files that the database management system can import. 4. Minimize the extent to which study measurements are recorded on paper forms. Enter data directly into the computer database or move data from paper forms into the computer database as close to the data collection time as possible. When you define a variable in a computer database, you specify both its format and its domain or range of allowed values. Using these format and domain specifications, computer data entry forms give immediate feedback about improper formats and values that are out of range. The best time to receive this feedback is when the study subject is still on site. If having a paper copy of the data is important, you can always print out a record immediately after collecting it. (This is equivalent to getting an ATM receipt at the end of the transaction.) 5. Follow standard data entry conventions. Several conventions for data entry and display have developed over time. Although most users of screen forms are not aware of these conventions, they have come to expect them subconsciously. For example, a series of mutually exclusive, collectively exhaustive choices is usually displayed as an “option group” consisting of several different “radio buttons”, whereas choices which are not mutually exclusive are displayed as check boxes. 6. Back up the database regularly and check the adequacy of the back up procedure by periodically restoring a file from the back up medium. 24 Guidelines for Data Management in Clinical Research 1. Establish the database tables, their rows and columns, and their relationships correctly at the outset. 2. Establish and follow naming conventions for columns and tables. 3. Obtain baseline demographic and clinical information about members of the study population from existing computer databases. 4. Minimize the extent to which study measurements are recorded on paper forms. 5. Follow standard data entry conventions. 6. Back up the database regularly. 1. Codd EF. Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks". IBM Research Report 1969;RJ599. 2. Codd EF. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM 1970;13(6):377-387. 3. Date CJ. The database relational model : a retrospective review and analysis : a historical account and assessment of E.F. Codd's contribution to the field of database technology. Reading, MA: AddisonWesley; 2001. 4. Date CJ. An introduction to database systems. 7th ed. Reading, Mass.: Addison-Wesley; 2000. 5. Mata-Toledo RA, Cushman PK. Schaum's outline of fundamentals of relational databases. New York: McGraw-Hill; 2000. 25