Download document 8109697

Relational Databases and Statistical Processing Andrew Westlake Survey & Statistical Computing +44 20 8374 4723 [email protected] www.SaSC.co.uk Introduction • Good Statistical Analysis needs Good Data – Many things contribute to data quality • Quality of structures and processes for data collection related to quality of data • Good Ideas from Computer Science – Process design methodologies – Database methodologies – Data and process modelling systems • Statisticians can benefit from CS concepts – Particularly for design of complex structures and continuous processes 15-Mar-03 Statistical Databases and RDBMS © S&SC 2 Data Structure • Statistics generally based on a single rectangular dataset – Much statistical data is conceived and stored in this form • Database professionals often use Relational Database systems – Formal model, based on a set of related rectangular datasets – Allows the representation of more complex structures, and easier maintenance and manipulation of the data. • RDB approach valuable for some statistical situations – Review RDB concepts and functionality, plus SQL – Statistical Examples • Pakistan Fertility Survey • Processing and analysis system for HIV and AIDS reporting for England and Wales 15-Mar-03 Statistical Databases and RDBMS © S&SC 3 The Relational Model • A logical specification of the content and behaviour of a database management system, including • The types of structure that can be present in a database • The properties of elements that can be stored in these structures • The operations that can be performed on these structures and their behaviour • Facilities that must be present in the database management system • The general nature of the interactions between the database and its users and administrators – Codd’s Rules specify properties that a Relational DataBase Management System (RDBMS) must possess • SQL – A standard language with which to interact with a RDBMS. 15-Mar-03 Statistical Databases and RDBMS © S&SC 4 Relational Database System Structure DataBase Management System User Processes P Schema Design P P Data Entry Database D Data Dictionary D Data D Rules DBMS (SQL) Insertion Deletion Updates Consistency Retrieval Selection Transformation Sorting Grouping Aggregation P Transaction Processing P MIS Reports User Processes Data Modelling SQL Relational Model Storage & Access Methods Applications & Tools, include functionality, semantics appropriate to application Expressing the real task in terms of the conceptual model Standard interface to an RDBMS, syntax and embedding Specification of functionality, behaviour and scope Implementation issue, affects performance 15-Mar-03 Statistical Databases and RDBMS © S&SC 5 RDBMS Strengths • Relational Model – Precise, formal mathematical specification of structure and behaviour – Appropriate for information about Entities, and Relationships between them • Data Modelling – Useful tools for understanding data structures and flows • SQL – International Standard (SQL2), widely implemented • Current Implementations – Widely available, well supported, good implementations, integration with other products, add-on market for tools 15-Mar-03 Statistical Databases and RDBMS © S&SC 6 Components of the Relational Model • Domain • a set of possible (atomic) values • Attribute • defined over a domain, has a name, cf. variable • Tuple • a set of values, one associated with each attribute, cf. cases or records. NULL values supported • Relation • defined over a set of attributes, has a name, consists of a set of tuples • Relational DataBase • a set of relations 15-Mar-03 Statistical Databases and RDBMS © S&SC 7 Relations Relation Cancer_ Registration Domain Registration ID ID Attribute Tuples 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 Gender Age in years Sex Age_at_ Registration 2 1 2 2 1 1 1 2 1 2 1 58 68 73 52 49 68 87 74 69 92 45 Relation Lung_ Diseases Domain Attribute ICD Diagnosis String Disease_Name Tuples 162 501 503 Cancer of the Lung Asbestosis Mesothelioma 15-Mar-03 Weight in Kg, 1dp Weight SEG Age in ICD years SEG Age_at_ Diagnosis Diagnosis 54.5 75.8 62.3 48.7 95.2 112.7 84.2 69.4 53.0 83.0 67.4 3 2 5 2 4 3 4 4 1 5 5 53 60 70 45 47 61 85 70 69 92 41 194 285 162 501 162 196 203 162 186 162 503 • Rectangular Data tables, in which columns are variables, rows are cases. • Tables can be linked by the values of common variables. Statistical Databases and RDBMS © S&SC 8 SQL – International Standard, actively revised – Widely available in good RDBMS software – Text language, used by programs and people • Stored or constructed at run-time – Designed to support tools which are independent of the DBMS • cf. Client-Server architecture – User and Programmer skills portable across products and sites – Has sections for defining database structure (DDL), Manipulating database content (DML) Ensuring database integrity, and Managing database security 15-Mar-03 Statistical Databases and RDBMS © S&SC 9 SQL Data Manipulation • All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations. • The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement. • Retrieval is usually done through a Query interface, which generates the SQL. 15-Mar-03 SELECT {DISTINCT} <expression list> FROM <table list> WHERE <condition> GROUP BY <column list> HAVING <group condition> ORDER BY <column list> Statistical Databases and RDBMS © S&SC 10 Data Manipulation - Join • The Join operation selects the combination of rows from two or more tables that meet a matching condition SELECT R1.*, R2.* FROM R1, R2 WHERE A = X • All matching combinations are retained – Eg can relate household information to all members via household number Join A=X A B C R1.D X R2.D E 1 b1 c1 d1 1 d3 e1 2 b2 c2 d2 2 d5 e2 3 b3 c3 d3 3 d3 e3 15-Mar-03 Statistical Databases and RDBMS © S&SC 11 Data Manipulation - Aggregation • Aggregation functions can be used in the variable list SELECT COUNT(DIAGNOSIS), COUNT(DISTINCT DIAGNOSIS), AVE(WEIGHT) FROM CANCER_REGISTRATION – This produces one record for the whole table containing three values, the number of records with non-null values for Diagnosis, the number of distinct Diagnosis values, and the average Weight. – The available functions are COUNT, SUM, AVE, MIN, MAX. – Aggregation functions can be applied to expressions. 15-Mar-03 Statistical Databases and RDBMS © S&SC 12 Data Manipulation - Grouping • The GROUP BY clause is used to perform aggregation within subgroups, producing a record for each group. SELECT DIAGNOSIS, COUNT(DIAGNOSIS), AVE(WEIGHT) FROM CANCER_REGISTRATION GROUP BY DIAGNOSIS ORDER BY DIAGNOSIS – This produces a record for each diagnosis containing the number of registrations and their average weight as well as the diagnosis code. – It is sorted into diagnosis code order, using the ORDER BY clause. – The Group By clause can contain multiple fields (for crosstabulation). – The expression list can contain variables used in the Group By clause, and aggregate functions for any variable. – A WHERE clause selects the records to be aggregated. – A HAVING clause selects the groups to be returned. 15-Mar-03 Statistical Databases and RDBMS © S&SC 13 Views • Stored query definition – Can be evaluated on demand – Used to present data in the form needed by the user (a person or a process) • Dynamic evaluation – Ensures that the viewed information is up to date – May be inefficient if the information does not change 15-Mar-03 Statistical Databases and RDBMS © S&SC 14 Database Design • Choose physical tables and fields – Capture essential characteristics of real-world system – Support intended uses of the information (directly or through views) – Operate efficiently • Identify important links between tables – Provide appropriate Keys – Choose Indexes and other implementation details • Retain flexibility for future needs • Various methodologies for design process 15-Mar-03 Statistical Databases and RDBMS © S&SC 15 RDB Summary • Relational databases are ubiquitous, and are useful for largescale data collections • Some manipulation and aggregation operations can be done more easily than in statistical packages • Relational model is a useful way of thinking about data structures – May suggest different data structure • • • • No replacement for statistical packages for statistical analysis Flexible manipulation of more complex data structures Useful development tools for processes Interface to Views for Statistical Analysis – ODBC or similar – Extract micro- or macro-data for statistical analysis 15-Mar-03 Statistical Databases and RDBMS © S&SC 16 Databases for Statistics • Simple situations – Use statistical or survey packages • Studies with Complex Structure – Use RDB for flexible manipulation – May store static result tables • Ongoing Data Capture Processes – Use RDB for dynamic data entry, updating, matching and checking – Choose data structure to optimise data processes and quality, not ease of statistical processing – May take static snapshots for regular statistical analysis • Statistical Dissemination – another presentation • Statistical Metadata – yet another presentation 15-Mar-03 Statistical Databases and RDBMS © S&SC 17 PFFPSWID PK,FK1 PK,FK1 PK,FK1 PFFPS – Main structure PFFPSHID PK,FK1,I1 PK Weights PK Cluster Weight 15-Mar-03 I10 I1 I9 I3 I4 I5 I6 I11 CLUSTER H_NO RECORD_TYP PROVINCE AREA DISTRICT VISIT F_DAY F_MON F_YEA INTERVIEW SUPERVISOR DEO RESULT Q16T Q16M Q16F Q17 Q18 Q19 Q20 Q21 Q22 Q23_1 Q23_2 Q23_3 Q23_4 Q23_5 Q23_6 Q23_7 Q23_8 Q24 Q25_1 Q25_2 Q25_3 Q25_4 Q26 Q27 Statistical Databases and RDBMS © S&SC PFFPSHHM PK,FK1,I1 PK,FK1,I1 PK CLUST_HHM H_NO_HHM Q01 RECORD_TYP PROV_HHM AREA_HHM Q03 Q04 Q05 Q06 Q07 Q08 Q09 Q10 Q11 Q12 Q13 Q14 Q15 CLUST_HHM H_NO_HHM Q01 RECORD_TYP PROV_WID AREA_WID LINE_WID DIST_WID F_DAY_WID F_MON_WID F_YEA_WID VISIT_WID WOM_RES Q101H Q101M Q102 Q103 Q104 Q105M Q105Y Q106 Q107 Q108L Q108C Q109 Q110 Q111 Q112 Q113 Q114 Q115 Q201 Q202 Q203S Q203D Q204 Q205S Q205D Q206 Q207S Q207D Q208 Q210 Q218 Q219 Q220 Q221 Q222 Q223 Q224 Q225_1 Q225_2 Q225_3 Q225_4 Q225_5 Q225_6 Q225_7 Q226 Q227 Interview CMC Birth CMC 18 Data Tables for PFFPS Weights PK PK,FK1 PK,FK1 PK,FK1 PK Cluster PFFPSHID PK,FK1,I1 PK PFFPSWBH CLUSTER H_NO PFFPSWID PK,FK1 PK,FK1 PK,FK1 CLUST_WID H_NO_WID L_NO_WID CLUST_WBH H_NO_WBH L_NO_WBH LINE_WBH PFFPSWPB PK,FK1 PK,FK1 PK,FK1 PK,FK1 CLUST_WPB H_NO_WPB L_NO_WPB LINE_WPB PFFPSWC1 PK,FK1 PK,FK1 PK,FK1 PK CLUST_WC1 H_NO_WC1 L_NO_WC1 Q301 PFFPSHHM PK,FK1,I1 PK,FK1,I1 PK CLUST_HHM H_NO_HHM Q01 15-Mar-03 PFFPSWC2 PK,FK1 PK,FK1 PK,FK1 CLUST_WC2 H_NO_WC2 L_NO_WC2 Statistical Databases and RDBMS © S&SC PFFPSWMF PK,FK1 PK,FK1 PK,FK1 CLUST_WMF H_NO_WMF L_NO_WMF PFFPSWSE PK,FK1 PK,FK1 PK,FK1 19 CLUST_WSE H_NO_WSE L_NO_WSE PFFPS Meta Data Structure Record PK Category Record PK,FK1 PK,I1 Field Code Dictionary 15-Mar-03 Field FK1,I1 Record Range Skip PK,FK1 PK,I1 PK Field Code PK,FK1 PK Field Lower Statistical Databases and RDBMS © S&SC 20 HIV and AIDS reporting section, PHLS • Redesign of processing and reporting system • Old system (1994) based on files processed with Epi-Info and Aspect database • Requirements – – – – 15-Mar-03 Simplify management of data Improve consistency Simplify regular reporting Enable ad hoc reporting Statistical Databases and RDBMS © S&SC 21 Old System • Input Notifications • Data Storage – HIV Tests (standard form) – AIDS Diagnosis (standard) – ONS Death Reports • Some notifications on disc, most on paper • Processes • Reporting – Quarterly standard reports – Ad Hoc inquiries (e.g. PQs) – Occasional in-depth analysis • Problems – Continuous Data Entry from paper – Regular Linking for reporting extractions – Occasional Matching within and between files 15-Mar-03 – Separate HIV and AIDS files – Deaths Added – – – – Statistical Databases and RDBMS © S&SC Duplicates and omissions Name matching Repeated linking Inflexible for reporting (eg progression from HIV to AIDS) 22 Old Structure HIV Forms HIV Notification File Quarterly Extracts AIDS Forms Death Reports on disc 15-Mar-03 AIDS Notification File Statistical Databases and RDBMS © S&SC 23 Initial Thoughts • Notification forms appropriate for reporting, but not for primary information store • Information is about Patients – Notifications are sources of information – Record information about events and states • Do matching once, at initial entry • Unified database structure • Additional components for metadata, auditing, contact information • Extracts to separate statistical reporting system 15-Mar-03 Statistical Databases and RDBMS © S&SC 24 Full system 15-Mar-03 Overall Structure Notification Processing Metadata System Patient Data External Organisation Reporting Facilities Auditing System Statistical Databases and RDBMS © S&SC 25 Main Patient Tables D AIDS Indicator PK,FK1,I1 Patient ID PK Indicator Type Code PK Indicator Code D Patient PK Patient ID I11 I5 I2 I10 I7 I9 I8 I1 I6 I3 I4 FK1,U2,U1 Soundex Init DOB Sex Occ PC Out PC In DHA LA Ethnic Gp Exp Audit ID 15-Mar-03 D HIV Infection PK,FK2 FK2,U2,U1 Audit ID Patient ID U3 HivNo FK3 Hospital ID FK1,U2,U1 Audit ID D AIDS Diagnosis PK,FK2 Patient ID U1 AidsNo FK3 Hospital ID FK1,U3,U2 Audit ID Statistical Databases and RDBMS © S&SC D Death Record PK,FK2 Patient ID FK3,U1 FK4,U5 U6 U4 I1 FK1,U3,U2 AidsNo HivNo ONS ID Death ID Date Death Audit ID 26 Patients and Notifications Patient Repeating Components Optional Components HIV Infection Alternative IDs (Soundex, DoB) AIDS Diagnosis External ID Source Notifications HIV Notification (green form) (Context, Institution) AIDS Indicator Action (Type, Date, Outcome) AIDS Notification Previous Status (blue form) (Date) ONS Death Notification Death Record Blood Donation Residence (PostCode, Date) Blood Recipient External Files Follow Up Study (Vicky) 15-Mar-03 Statistical Databases and RDBMS © S&SC 27 D Alternative ID PK,FK2,I1 PK,I3 PK,I2 D User Patient ID Identifier Type Code Identifier Value PK Full HAP Data Structure User Code PK FK1,I1 Session ID D HIV Infection PK,FK2 D Session FK1,I1 User Code FK1,U2,U1 Audit ID Patient ID D Audit Block U3 HivNo FK3 Hospital ID FK1,U2,U1 Audit ID PK D Death Record Patient ID FK3,U1 FK4,U5 U6 U4 FK1,U3,U2 AidsNo HivNo ONS ID Death ID Audit ID PK Notification ID FK3 FK2,I2,I1 U3 FK4,I6 FK1,U1,U2 Patient ID Batch ID Notification No Organisation ID Audit ID Audit ID FK1,I1 Session ID FK2,I2 Creation Session ID D Notification PK,FK2 Session ID D Note PK Note ID FK2,I2,I1 Patient ID FK1,U2,U1 Audit ID D Patient PK Patient ID D Action FK1,U2,U1 Audit ID PK Action ID FK2,I4,I2 Patient ID FK1,U1,U2 Audit ID D External Patient ID D AIDS Indicator PK,FK1,I1 PK PK Patient ID Indicator Type Code Indicator Code FK2,U2,U1 Audit ID PK,FK2 Patient ID Person ID FK1,U2,U1 Audit ID 15-Mar-03 FK1,U1 U1 AidsNo FK3 Hospital ID FK1,U3,U2 Audit ID PK Organisation ID FK2,I2 Director FK3,I3 Primary Contact FK1,U2,U1 Audit ID Statistical Databases and RDBMS © S&SC / Person Audit ID D Contact D Organisation D Person PK PK,FK3,I3 Patient ID PK,FK2,I1 Organisation ID PK,I2 ID in Institution D AIDS Diagnosis / works at PK,FK3,I2 PK,FK2,I1 Organisation ID Person ID FK1,U2,U1 Audit ID 28 New Notification Data Entry • Get Patient details from source • Look for matching existing patients – May need expert to decide – Create new Patient if needed • Open appropriate form for source to complete transfer of Patient and details • Forms know how to handle duplicates (generally take earliest) Enter common Patient identification from a New Notification External Data Sources View Matches with existing Patients Initiate entry of a new Patient or updating an existing one New Patient Create a new Patient record Transfer Patient Transfer information from New Notification HIV Notification AIDS Notification Information about the Patient and HIV Infection taken from a green HIV form Information about the Patient and AIDS Diagnosis taken from a blue AIDS form Death Record Add or Update Death Information for a Patient HIV Infection 15-Mar-03 Notification Statistical Databases and RDBMS © S&SC Death Record AIDS Diagnosis 29 Extraction • Separate Extraction Database template – Linked to main database – Set of associated Excel spreadsheet templates • New copy each quarter – Populate by Extraction of information through Views – Gives a Snapshot of Patient status • Set of aggregation queries – Import results into spreadsheets as source • Reorganise with Pivot tables – Create publication versions 15-Mar-03 Statistical Databases and RDBMS © S&SC 31 Conclusions • Good analysis needs good data – Need right structure for efficient processing • RDB provides flexible structure and manipulation – Useful for complex situations – Can produce the views needed for statistical analysis – Quick to implement at a basic structural level • Good framework for implementing production processes – Important to get the process right for good quality data – Not cheap to implement production processes • Can easily include associated information – Metadata, other supporting information 15-Mar-03 Statistical Databases and RDBMS © S&SC 32 The End • Extended versions of presentations available from – www.sasc.co.uk – Follow links to MSc in Statistical Computing 15-Mar-03 Statistical Databases and RDBMS © S&SC 33

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download document 8109697