Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Warehousing for Scientific Behavioral Data by Baiju H. Devani A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen’s University Kingston, Ontario, Canada June 2004 c Baiju H. Devani, 2004 Copyright Abstract Building a data management model for scientific applications poses challenges not normally encountered in commercial database development. Complex relationships and data types, evolving schema, and large data volumes (in terabyte) are some commonly cited challenges with scientific data. In this thesis, we propose a data warehouse model to manage and analyze scientific behavioral data. Data warehousing is popular in customer-centered environments and encapsulates the process of transforming and aggregating operational data and bringing it to a platform optimized for Online Analytical Processing (OLAP). A database schema ubiquitous with data warehousing is the dimensional or the star schema. In this paper, we develop a proof-of-concept data warehouse system for a scientific laboratory at Queen’s University that is conducting behavioral studies in the area of limb kinematics. The system is based on three primary technologies: a Perl based parsing grammar for transforming and cleaning source data, an object-relational data management system based on IBM’s Universal DB2 system, and a Java-based front-end interface that is accessible through MathWork Inc’s Matlab system. i Acknowledgments I would like to thank Queen’s University for giving me the opportunity to pursue an MSc. degree. I would also like to thank my supervisors, Dr. Glasgow, Dr. Martin, and Dr. Scott, for their academic guidance and support throughout this period. I appreciate their patience, and their confidence in me. I would also like to thank my family, especially my parents, for their steadfast moral and financial support. I could not have come this far without their blessings. Finally, I would like thank all my friends for making my stay in Kingston memorable. I have had some good times and I take good memories with me. A special thanks to Noorin for always being there to provide support during stressful times, and also, always being there to celebrate my successes. ii Contents Abstract i Acknowledgments ii Contents iii List of Tables vi List of Figures vii 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Scientific Problem Description 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Limb Kinematics And Primary Motor Cortex . . . . . . . . . 2.3 Information/Data Management Problems Posed By KINARM 2.3.1 Management . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Background 3.1 Introduction . . . . . . . . . . . . . . . . . . . . 3.2 Conceptual Framework . . . . . . . . . . . . . . 3.2.1 Relational Database Model . . . . . . . . 3.2.2 Object-Oriented Model . . . . . . . . . . 3.2.3 Object-Relational Model . . . . . . . . . 3.2.4 Analytical Versus Transaction Processing iii . . . . . . . . . . . . . . . . . . . . . . . . . Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 . . . . . . . . 5 5 6 8 9 9 11 11 12 . . . . . . 13 13 14 14 16 19 20 3.3 3.4 3.5 3.6 Data Warehouse . . . . Related Work . . . . . Research Methodology Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 26 28 32 4 System Overview 4.1 Introduction . . . . . . . . . 4.2 System Requirements . . . . 4.3 Existing Data Organization 4.4 System Architecture . . . . 4.4.1 Parse Grammar . . . 4.4.2 Data Warehouse . . 4.4.3 Matlab Interface . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 33 36 39 41 43 49 50 5 Analysis 5.1 Introduction . . . . . . . 5.2 Query Support . . . . . 5.2.1 Metadata Query 5.2.2 Trial Data Query 5.3 Data Management . . . 5.4 Data Analysis . . . . . . 5.5 Operational Aspects . . 5.6 Emergent Issues . . . . . 5.6.1 Schema Evolution 5.6.2 Scalability . . . . 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 51 52 53 58 63 64 67 68 68 69 70 . . . . . . 72 72 74 74 76 77 78 . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion And Future Works 6.1 Thesis Summary . . . . . . . . . . . . . 6.2 Key Limitations And Possible Solutions . 6.2.1 Arrays To Store Signals . . . . . 6.2.2 Source Data Upload . . . . . . . 6.3 Future Work . . . . . . . . . . . . . . . . 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography 79 Appendices 83 iv A Matlab Scripts A.1 Metadata query 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Metadata query 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Trial data query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 84 86 89 B Statistical Formulae 92 C A sample .pro file 94 D Regular Expressions For Parsing Grammer 96 Glossary 100 v List of Tables 3.1 Research Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 File-based System Versus The Data Warehouse System . . . . . . . . 71 6.1 Research Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . 73 vi List of Figures 2.1 The KINARM Device . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 A Simple Relational DBMs model A Simple Object-Oriented Model Data Warehouse Architecture . . Dimensional Schema . . . . . . . IS Research Steps . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 17 22 25 30 Data Flow In The File-based System . . . .pro Data Structure . . . . . . . . . . . . . Data Warehouse System Overview . . . . . Parsing Grammar Rules . . . . . . . . . . Grammar tree . . . . . . . . . . . . . . . . System Schema . . . . . . . . . . . . . . . User-defined Objects And Typed Tables In Data Hierarchy In Typed Fact Table . . . . . . . . . . . . . . . . . . . . . DB2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 38 40 41 42 44 45 47 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Metadata Query 1 . . . . . . . . . . . . . Metadata Query 1 - Results . . . . . . . . Metadata Query 2 . . . . . . . . . . . . . Metadata Query 2 - Results . . . . . . . . Trial Data Query . . . . . . . . . . . . . . Trial Data Query - Results . . . . . . . . . Trial Data Query - Results 2 . . . . . . . . Results From Java-based Matlab Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 55 56 57 59 60 61 66 6.1 6.2 Array Structure In A Fact Table . . . . . . . . . . . . . . . . . . . . . Sample Query With Array Structures . . . . . . . . . . . . . . . . . . 75 76 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . Chapter 1 Introduction 1.1 Motivation Data management systems were popularized in the 1970’s by the introduction of the relational data model. Since then, these systems have evolved from being used primarily for transactional processing workloads to systems that integrate and store large amounts of data primarily for analytical purposes. These analytical systems, commonly referred to as data warehouses, support complex analysis of data and decision making in organizations. Data warehousing facilitates the use of technologies such as on-line analytical processing (OLAP), decision support systems (DSS), and data mining software, all of which try to make sense of the large amounts of data generated by organizations [19, 31, 22]. The success of data warehouses in the business world motivates us to examine its use in a scientific environment. To scientists, the tasks of collecting, storing, and analyzing data are part of their core activity. Scientists are in the business of generating knowledge from data, yet, database systems in general, and data warehouses 1 CHAPTER 1. INTRODUCTION 2 in particular, are not as popular in the scientific community as they are in the business community. One reason for this is that traditional database systems assume that the data and the processes generating the data are well defined. Scientific data and processes on the other hand are inherently shifting with domain knowledge. A data model for such an environment needs to be flexible and should easily allow such data/schema evolution without rendering historical data useless. This is especially hard to implement in traditional database systems due to structural rigidities imposed on the data types and relationships among the data. Furthermore, scientific research generates large amounts of data with complex relationships. For example, scientific laboratories conducting behavioral experiments may collect data with high dimensionality and millions of data points per experiment. The challenge from a data modelling and analysis point of view is to develop a model that captures the complexity and richness of the data while creating an efficient framework for storing, querying, and analyzing scientific data. Research Statement: The purpose of this thesis is to propose a data warehousing model, based on an object-relational database management system, to address the problems of managing and analyzing large amounts of data generated from scientific behavioral experiments. In order to accomplish this goal, the following objectives are identified: 1. Identify applicable models and technologies and develop a data warehouse for a specific scientific problem to act as a proof-of-concept system. 2. Demonstrate the effectiveness and efficiency of the system by: CHAPTER 1. INTRODUCTION 3 (a) Developing tools or interfaces that allow researchers to query and analyze data. (b) Illustrating the value added by the data warehouse system in terms of facilitating more efficient management and analysis of behavioral data. 3. Finally, through system implementation, we will identify lessons learnt and solutions that could be applied to future data warehouse projects for behavioral data. As a proof-of-concept system, we have developed a data warehouse model for behavioral research done by Dr. Stephen Scott, professor of Anatomy and Cell biology at Queen’s University, and his group. Dr. Scott’s research investigates limb motor coordination and the role of different regions of the brain in such movements [42, 43]. This research generates large amounts of complex data and gives us an ideal opportunity to develop a data warehouse system for a practical problem. Dr. Scott’s research and data are described in detail in Chapter 2. 1.2 Thesis Organization The thesis is organized as follows. In Chapter 2, gives a background on Dr. Scott’s research and the data management and analysis problems it poses. Chapter 3 outlines the background on core data management systems, discusses related works in this area, and describes a research methodology for this thesis. Chapter 4 describes the data warehouse that is developed as a proof-of-concept system, and discusses key design decisions taken prior to, and during implementation. Chapter 5 evaluates the CHAPTER 1. INTRODUCTION 4 data warehouse system and test it against the current file-based approach. Finally, in Chapter 6, summarizes this thesis, and describe future works in this area. Chapter 2 Scientific Problem Description 2.1 Introduction As a proof-of-concept system for this thesis, we have developed a data warehouse for Dr. Stephen Scott’s research laboratory in the Department of Anatomy and Cell biology, Queen’s University. Dr. Scott’s research group is studying the role of the primary motor cortex (region of the brain) in controlling limb movements. The data generated by the lab has all the qualities of scientific data that make it hard to model, namely: a) large volume, b) evolving/changing structure, and c) complex relationships among the data. In this chapter, we explore these issues to gain a better understanding of both the data that is to be managed, and the underlying scientific process generating the data. The chapter begins by outlining the research paradigm used by Dr. Scott’s lab and the data it generates. We then describe the key characteristics of the data and the challenges posed from a data management and analysis point of view. 5 6 CHAPTER 2. SCIENTIFIC PROBLEM DESCRIPTION 2.2 Limb Kinematics And Primary Motor Cortex While it is generally accepted that the primary motor cortex plays a significant role in multi-joint limb movements, the exact nature of the role is not yet clear [42, 43]. One research paradigm, pioneered by Dr. Stephen Scott, to study this problem is using KINARM (Kinesiological Instrument for Normal and Altered Reaching Movement). The KINARM device (shown in Figure 2.1) is an exoskeleton 1 that can sense and perturb planar limb movements. This allows researchers to record brain activity while measuring and manipulating the physics of the limb. Furthermore, KINARM behaviors are visually guided, thus allowing researchers to understand how sensory information guides motor action. Using the paradigm above, researchers study a number of motor behaviors or tasks. For example, a simple task involves the subject moving the limb to a target projected on a planar surface. The movement is constrained by requirements such as moving to the target in a certain amount of time and following straight hand paths to the target. During task execution, KINARM measures variables of interest related to limb movement. In this way, a number of complex behavioral experiments can be designed. These experiments vary in either: • Spatial positions of the targets (direction of movement). • Mechanics of the movements. For example, loads that aid or resist the movement are added such that the subject has to overcome the load to reach the target, or has to resist the load to avoid overshooting the target. • The sequence of the movement (order in which subjects move to the target). 1 Exoskeleton here refers to an mechanical structure on the outside of the body CHAPTER 2. SCIENTIFIC PROBLEM DESCRIPTION a 7 b Figure 2.1: The KINARM device is the primary device used in Dr. Scott’s lab for studying multi-joint limb movement [42, 43]. (a) shows the limb placed in the exoskeleton which is attached to motor linkages that can independently manipulate the elbow and shoulder joints during a task. The red dot shows a target light projected on the horizontal movement plane. An experimental task consists of movements to different spatial targets under varying load conditions. (b) Electrodes passed transdurally record neural activity in cortical region during various tasks. CHAPTER 2. SCIENTIFIC PROBLEM DESCRIPTION 8 The goal of these task experiments is to dissociate the limb motion from the underlying muscular/neural forces used to generate it and thereby gain insight into how such precise movements are coordinated and generated by the brain. 2.3 Information/Data Management Problems Posed By KINARM The KINARM research paradigm described above generates large amounts of behavioral data. For example, the neural data, measured at a frequency of 4000Hz, can result in thousands of data points per movement (even when re-sampled at a lower frequency). At present, data is stored in files saved on standard 700MB disks. There are currently about 150 disks making a total database of roughly a 120GB. Furthermore, with new equipment being installed, such as the Plexon data acquisition system [37], the rate of data acquisition is going to increase and a terabyte database is conceivable in the near future. Also, in addition to large data volumes, the above paradigm generates a complex data-set in terms of the types of data collected (Electromyogram (EMG), neural, and kinesiological), the relationships between data entities, and the temporal nature of data. The resulting data management and analysis problems are described below and grouped into four categories: data management, complexity, relevancy, and analysis. CHAPTER 2. SCIENTIFIC PROBLEM DESCRIPTION 2.3.1 9 Management As described above, significant data volumes are collected during behavioral experiments. With the current file-based set-up, such volumes present the following problems: 1. Lack of query tools: This makes data management a daunting task. For example, a simple question such as “Do we have enough cells for analysis xyz ?” requires a researcher to manually sift through written logs of experiments and identify cells of interest. 2. Uncontrolled data redundancy: Since there is no centrally accessible and shared data repository, individual researchers copy and store data relevant to their analysis on local hard drives. Such uncontrolled redundancy wastes hardware resources and makes it hard to maintain data consistency. For example, correction of corrupt data or new experimental data needs to be communicated to every potential user. 2.3.2 Complexity Not only are large data volumes collected from behavioral experiments, but also the data collected is complex. Data complexity arises from the following factors: 1. Complex relationships between the data types. For example, each experiment or task is a set of movements to different spatial targets. Data for each movement towards a target is stored in separate files. Each file has metadata describing global aspects of the movement, and trial specific metadata. A more detailed CHAPTER 2. SCIENTIFIC PROBLEM DESCRIPTION 10 discussion of the data organization is given in Chapter 4, however, at present it is sufficient to take note of this complexity. 2. Another source of complexity is the evolving/shifting nature of the data model and the underlying scientific process. Since a data model is an abstraction of a real world concept, the model has to change with changes in domain knowledge. For example, in this instance, as knowledge is gained from behavioral experiments, new tasks or behaviors might be defined or new signals might be introduced. Also, as new knowledge is gained, data needs to be re-analyzed. Thus, capabilities such as ad-hoc querying and analysis become very important. 3. Finally, an additional source of complexity is the temporal nature of behavioral data. A data signal in a behavioral experiment is recorded over a period of movement of a limb towards a target. Such time series data adds complexity to the analysis process because, in most cases, a researcher needs to analyze different subsets of this series. For instance, a researcher could ask for cell 2 discharge rate between the time the target light was projected and the time the movement started. Another source of complexity is the inherit temporal shift in the different signals. For example, there is a lag between when a neuron discharges and when that discharge translates to an observable limb behavior. The data model should take into account the need to extract and analyze data based on temporal queries. 2 The term cell and neuron are used interchangeably throughout the thesis and refers to a biological cell which conducts electric neural impulses from one part of the body to another CHAPTER 2. SCIENTIFIC PROBLEM DESCRIPTION 2.3.3 11 Relevance Another challenge posed by KINARM is that biological data is inherently noisy. Furthermore, noise is also introduced from the device measuring the biological signals. This means that raw data cannot be analyzed without filtering/processing it. However, this process involves a loss of information which might be required in the future and thus raw data cannot be discarded. For example, a researcher might switch between analyzing raw data and filtered data depending on the signal of interest. The data model should be able to conserve both views of the data as well as encapsulate a process of transforming raw data to processed data and provide an option of accessing/querying either source (raw or processed). 2.3.4 Analysis From a data analysis point of view, Dr. Scott’s research faces the following challenges in the current environment: 1. As mentioned previously, lack of query capabilities makes it hard for data of interest to be identified. For example, at present a researcher cannot ask the following without writing a small program: “Retrieve data where task=a and subject=b and date > 01/01/2001”. Additionally, there is no mechanism to extract only signals relevant to a particular analysis. For example, in a typical experiment, as many as 32 signals might be recorded. Of these, a researcher might only need 2 signals for a particular analysis. However, this is not possible in the current file-based system. With large data volumes, which is the case here, a significant amount of time goes into disk I/O with most analysis requiring large amounts of RAM (Random Access Memory). Also, significant effort and CHAPTER 2. SCIENTIFIC PROBLEM DESCRIPTION 12 programming skill is required to simply access relevant data and bring it to the analysis platform. This creates a steep learning curve for new researchers to the lab, most of whom are from a life sciences background and are attached to the lab for relatively short periods of time. 2. The current file-based environment does not provide adequate support for implementing data mining algorithms. With large data volumes and the nature of the research at hand, data mining is a logical next step in terms of automating analysis and knowledge generation. Evidence suggests that data mining efforts can be significantly reduced by a data structure that can be queried. Hirji [24] notes in his study that 30% of total effort in implementing data mining projects is spent on data preparation. He further cites studies by Cabena et. al [9] that suggest data preparation could take as much as 70% of the total effort. A well structured data source with fast query capabilities can potentially aid data mining. Thus one can argue that a database system is the logical predecessor of any data mining efforts. 2.4 Summary This chapter has briefly outlined behavioral research conducted in Dr. Scott’s lab. We have also identified practical data management and analysis problems faced by researchers in his lab. We now proceed by giving a background on data management systems and data warehousing in the next chapter, and then outline a warehouse system for Dr. Scott’s lab in Chapter 4. Chapter 3 Background 3.1 Introduction The previous chapter discussed Dr. Scott’s research, and the data analysis and management problems faced by researchers in his lab. This chapter has the following three goals: 1. Give a background on data warehousing and the core data management technologies on which it is based. 2. Outline related works in the area of management and analysis systems for scientific data. 3. Outline a research methodology for this thesis. 13 CHAPTER 3. BACKGROUND 3.2 14 Conceptual Framework Relational and object-oriented models are currently the most widely used database technologies. The growth of relational database systems was driven by the need for fast transaction processing type systems. The object-oriented database concepts were driven by the need for better modelling and storage of complex data such as those found in scientific applications. Furthermore, object-oriented programming languages such as Java and C++ integrated well with object-oriented databases. More recently, relational database technology has been augmented with object-oriented features and are described as object-relational databases. These core technologies are described in detail below. 3.2.1 Relational Database Model Relational Database Management Systems (RDMs) were first introduced by Codd in his seminal paper titled “A relational model of data for large shared data banks” in 1970 [12]. Since then it has been one of the most widely implemented and studied database model. In this model, a database is described in terms of relations, attributes, and tuples. Plainly speaking, this translates to tables (relations), columns (attributes), and rows (tuples). The value that a datum can take is constrained by its domain. For example, the column “Name” could have a domain of ten characters. Thus a table can be thought of as a collection of related data values [19]. Figure 3.1 illustrates a sample relational structure. Each row in a table is normally identified by a unique primary key (or a set of keys that are collectively unique) or by a foreign key that relates it to a tuple in another table. For instance, consider Figure 3.1. The Student Information table is linked to 15 CHAPTER 3. BACKGROUND the Parent information table via the StudentId field. This acts as the primary key (PK) in the student data table, and a foreign key (FK) in the parent information table. In this way relationships amongst tables can be defined. Furthermore, we can identify the numeric relationship between the tuples in each table. In this case, each student can have one or two parents. And each parent can have 1 or many children (students). This is referred to as the cardinality of the relationship. Some popular examples of RDBMs are: IBM’s Universal DB2 system [25], Microsoft SQL Server [14], and Oracle Data Management System [16]. Student Information Table StudentId 123456 654321 Last Name First Name Doe John Smith Karen Gender M F Parent Information Table DOB 01/01/1980 30/03/1981 StudentId 123456 654321 654321 Last Name First Name Doe Senior Smith Catherine Smith Tom ER View Student Information Parent Information PK StudentId * Last Name First Name Gender DateOfBirth Tuple 1 .. 2 FK1 StudentId Last Name First Name Attribute Table Figure 3.1: A simple relational model showing a one-to-many relationship between student and parent information tables. The bottom portion represents the schema in the Entity-Relationship (ER) notation. The top portion, gives a physical view of the table by populating it with sample data points. CHAPTER 3. BACKGROUND 16 The Structured Query Language (SQL) [29] serves as a standardized data definition, query, and update language for all relational database systems. SQL provides a simple and efficient interface for describing and querying relational databases. With three decades of experimentation, RDBMs have evolved into a robust database technology with many strengths, including well developed concurrency controls, backup and recovery functions, optimized query engines, and efficient indexing schemes. However, relational database systems have limited data modelling capabilities. The only data structure available is the row-column structure. Furthermore, it is not suited for the storage of complex data types such as multimedia objects and text. The system is also rigid in terms of schema evolution. For example, dropping an attribute or a column from a table would require the entire table to be recreated. These limitations gave rise to a new database model - the object-oriented model. 3.2.2 Object-Oriented Model Object-oriented Database Management Systems (OODBMs) are closely related to object-oriented languages. Data is stored as objects that are described in terms of their attributes, and functions that work on it [19]. Objects refer to abstract entities that have attributes and methods/functions to manipulate or extract the attributes. For example, consider the simple relational model outlined in Figure 3.1. The same example is illustrated in an object-oriented model in Figure 3.2. This shows a Student class being defined in terms of StudentId, LastName, FirstName, Gender, DateOfBirth, and Parents attributes, and with methods such as getAge. In this case, the Parents attribute is itself an object of the Parent class. In this way, we can capture complex relationships between real-world entities. 17 CHAPTER 3. BACKGROUND Student 1 +studentId : int -lastName : string -firstName : string -gender : char -dateOfBirth : string -parents : Parent Parent -lastName : string -firstName : string 1..2 +getName() : string +GetParent() : string +getAge() : int +store() 2 +getName() : string Class Student { // member attributes private int studentId; private String lastName; private String firstName; private String gender; private String dob; // array of Parents. private Parent [ ] parents; // contructor public void Student (int id, String lnam, String fname, String gen, String dateofbirth, Parent [ ] p) { studentId = id; lastName = lname; firstName = fname; gender = gen; dob = dateofbirth; parents = p; } // get student name public String getName () { } // get parent name public String getParent () { } // get age public int getAge () { } // persistency method public store () { } 3 public class Studentdb { public static void main(String[] args) { // instantiate parent objects Parent dad = new Parent("Smith", "Tom"); Parent mom = new Parent("Smithi", "Catherine"); Parent [] p = new Parent[2]; p[0] = dad; p[1] = mom; // create student object Student s = new Student(654321,"Smith", "Karen",'F', "30/03/1981", p); s.store } } Figure 3.2: A simple object-oriented model for the example outlined in 3.1. (1) shows the model in Unified Modelling Language (UML) notation. The Student object is composed on lastName, firstName, gender, dateOfBirth, and parents attributes. The parents attribute is itself an object of the Parent class. OODBMs supports storage and management of such objects, thereby making them persistent. (2) shows the Java pseudo code for the class corresponding to the Student object. (3) shows a instantiation of the objects with sample data. CHAPTER 3. BACKGROUND 18 Once this general class is defined, individual objects with unique identities can be instantiated. However, the objects in a programming language are transient and do not exist outside the program. An OODBMs facilitates storage, indexing and retrieval of these objects, thereby giving them persistency and allowing objects to be exchanged between applications. Support for concepts like inheritance, allows new data classes to be described in terms of existing classes. Furthermore, unlike the relational model, the object-oriented model tightly couples the data and application programs. This means that both data and programs that manipulate the data can be stored and managed on the same platform [4]. For instance, in the Student class above, student information and the getAge method are stored together. The strength of this model is in the flexibility it gives in storing abstract/complex data types. This is particularly useful for scientific applications, as experimental data can be stored in its natural form (without being decomposed into rows and columns). Furthermore, data evolution is graceful in an object-oriented model. For example, consider the problem of defining a new student class for part-time students. This can be accommodated easily through the use of inheritance. That is, the new object inherits all attributes of the existing student object and has an additional attribute to indicate part-time status. The weakness of this model is the lack of a standardized data model 1 . This means that unlike SQL, object-oriented database systems do not have a standardized access or query language. This makes object-oriented systems vendor specific, and thus hard to migrate to a different system/vendor. Lack of standardization also means that efforts at query language optimization are fragmented and differ from 1 The vendor initiated ODMG standard (Object Database Management Group) [10] was completed in 2001 (http://www.odmg.org/). However, it has yet to be widely accepted. OQL is the query language based on this standard. CHAPTER 3. BACKGROUND 19 system to system. Furthermore, although traversing among related objects (linked objects) is fast, attribute selection and comparisons are not as optimized as they are in relational systems [34]. For example, a query such as “select all students where date of birth is greater that xyz” will execute faster on a relational system. This is because operations such as join and select are highly optimized in relational systems. Relational databases are a mature technology and have been fine-tuned for optimal performance (at the cost of expressiveness). This is not yet the case for object-oriented data management systems. Despite these weaknesses, the popularity of the object-oriented approach to modelling scientific data is apparent from the excerpt below taken from a joint EU-US workshop on large scientific databases [47]: “The object-oriented languages and object persistency is becoming ubiquitous in scientific data processing: these technologies allow us to define and store complex science objects and inter-relationships that we deal with ... We recommend the exploration of information models that have objectoriented characteristics of extensibility, so that the model is a serialization of the object itself.” (pg. 15) 3.2.3 Object-Relational Model Object-relational database systems (ORDBMs) were developed to incorporate the robustness of relational systems with the expressiveness of object oriented models. A number of database systems now offer the ability to develop, maintain, and manipulate objects within a relational framework [19]. This approach provides the familiar structures and capabilities of RDBMs, and additionally provides key object-oriented CHAPTER 3. BACKGROUND 20 functionalities such as user defined types, objects, and function. For example, abstract objects based on primitive data types (integers, characters, etc) can be defined and stored in relational tables. The strengths of this model are obvious: robustness and expressiveness. If objectrelational technologies provide the same level of flexibility and extensibility as objectoriented systems, then one can potentially gain from the robustness of relational database systems and expressiveness of object-oriented systems. 3.2.4 Analytical Versus Transaction Processing Systems Having identified the core DBMs technologies, we now focus on two distinct types of workloads for which a data management system could be built: Online Transaction Processing (OLTP) workloads, and Online Analytical Processing (OLAP) workloads. Workload refers to the types of queries that the data management system is expected to perform most frequently. The database systems designed for each of these workloads differ in the way data is organized and stored. OLTP workloads are characterized by large numbers of data transactions (inserts, updates, and retrievals) in short periods of time [18]. The systems that are designed to cater for such workloads are referred to as OLTP systems. For example, consider an airline reservation system. This system performs thousands of small data inserts and updates (submitted by numerous users), and fixed queries such as reservation lookups, flight availability, etc. In order to optimize for such workloads, OLTP systems are generally designed as highly normalized relational databases. Again, Codd [12] pioneered the idea of data normalization in relational systems. Data normalization is the process of distributing data across multiple tables in order to reduce redundancy, and CHAPTER 3. BACKGROUND 21 thus minimizing insert/update anomalies [12, 19]. For instance, consider the example in Figure 3.1. An insert in the attendance table would not require repeated inserts of student first name and last name values. OLAP workloads on the other hand, are characterized by ad-hoc queries (on large amounts of data) and infrequent updates. The systems designed to cater for such workloads are referred to as OLAP systems [19]. OLAP systems are designed specifically for analytical purposes. These system are popular in customer-centered environments, and are commonly referred to as Decision Support Systems (DSS) [44]. This is because they pool low level data and deliver it in a form that is understandable to novice end-users responsible for high level data analysis. However, there are two pre-requisites for an effective OLAP system: 1. Data has to be in a consistent state (removing all anomalies such as missing values, noisy data, etc). This means data has to be integrated from operational system(s) to a platform dedicated for data analysis. 2. Data should be stored in a schema that is optimized for OLAP type workloads. In our context, the data management and analysis requirements indicate a need for an OLAP type system. This is best realized through a data warehouse. 3.3 Data Warehouse A Data warehouse (see Figure 3.3) is best described by Inmon [28] as: “ a subject oriented, integrated, non-volatile and time-variant collection of data in support of management’s decisions.” 22 CHAPTER 3. BACKGROUND Operational Data Data extraction, transformation, cleaning Data Warehouse Analysis Tools Figure 3.3: A high level architecture of a data warehouse system [36] In our case, the data warehouse supports scientific research by making OLAP like query capabilities available to researchers. The defining characteristics of a data warehouse system (in the present context) are explained below: Subject-oriented: Data is conceptually organized by experiment metadata such as subject, task, direction, etc. The goal is to create an efficient data structure that can support the retrieval of experimental data based on metadata criteria. For example, select discharge frequency for task ‘a’ and subject ‘y’. Integrated: In a warehouse system, data is generally integrated from multiple operational systems. Relevant data from these systems are extracted, cleaned, parsed, and aggregated for upload to a DBMs. In our case, we have only one operational system (the current file-based system). However, we have a large variation, from experiment to experiment, in the types of data collected. For example, EMG, cell, and kinesiological data. Non-volatile: Data is stored for a long time and generally never deleted. CHAPTER 3. BACKGROUND 23 Time-variant: Data in a warehouse system is temporal, thus making it possible to analyze it for trends over time. In our case, we have temporal data in the sense of experiments conducted on a certain date and at a certain time. However, more importantly, the data is temporal in the sense that the actual experimental data is collected over the period of a movement. For example, a cell firing is sampled over a movement period of a subject reaching towards a target. Dimensional/Star Schema As defined above, data warehousing involves cleaning, aggregating and transforming source data and storing it on a platform optimized for OLAP type workloads. Our proposed schema for the data warehouse is a dimensional or star schema. The dimensional schema (see Figure 3.4) is a simplified relational schema that minimizes the number of table joins. Krippendorf and Song [32] describe it as: “a central fact table or tables containing quantitative measures of a unitary or transactional nature (such as sales, shipments, holdings, or treatments) that is/are related to multiple dimensional tables which contain information used to group and constrain the facts of the fact table(s) in the course of a query” (pg. 4). The two key types of tables in a dimensional schema are described below: Fact table: Kimball [31] describes the “facts” in a fact table as numerical measurements of a business taken at an intersection of all dimensions. Facts are generally numeric, continuously valued (not discrete), and additive. In our context, we have measurable scientific facts. The granularity of the fact table is determined by the unit of measurement of the facts. For instance, in our case, we can define a trial level granularity (that is, the basic unit of access would be CHAPTER 3. BACKGROUND 24 an entire trial from an experiment), or define a fine granularity whereby each instant in time of a trial is individually accessible through SQL. Dimension table: A dimension table gives identity to the facts in a fact table. As seen from Figure 3.4, each data point in a fact table is identified by keys derived from the dimension table. Dimension table attributes are generally textual and discrete [31]. For example, in Figure 3.4, a store dimension attribute such as location is textual (city names), and discrete (finite set of cities). The dimensional model advocates de-normalized dimension tables (dimensional data is not necessarily distributed across different tables to minimize redundances). The reason for this is that dimension tables are relatively small compared to the fact table, so the cost of introducing redundancy is relatively small. By avoiding normalization on the dimension table, we reduce the number of relational joins (tables joined using primary and foreign keys) in the schema , thereby improving performance for large select queries. For instance, in most cases, a large query on the fact table, will involve at most one dimension and thus one join. The dimensional schema has been widely used in data warehouse projects and is popular for business applications [21]. By minimizing the number of joins, a dimensional schema ensures optimum query performance for OLAP type workloads. Furthermore, fewer joins ensure that SQL queries are simple and do not require a deep understanding of the data model, and thus enables novice users to easily submit ad-hoc queries to the warehouse system. This is particularly valuable in our case, since the end-users will have little or no programming/SQL background and are in the lab for relatively short periods of time. 25 CHAPTER 3. BACKGROUND Time_dimension PK time_key month quarter year Sales Fact Product_dimension PK Store_dimension product_key product_attributes PK FK1 FK2 FK3 time_key product_key store_key sales facts store_key store_attributes Figure 3.4: A sample dimension or star schema. The sales “fact” is joined to three dimension tables, each describing a different aspect of the fact. For example, one could aggregate sales facts based on a product or a store. CHAPTER 3. BACKGROUND 3.4 26 Related Work Although the data and application requirements of this project are quite unique, important design decisions and functionalities can be inferred from much work in the area of scientific data management and analysis [20, 1, 45, 33, 2, 23, 39, 6, 11]. In this section, some of this work is briefly described to gain a better understanding of the opportunities and challenges in modelling scientific data. The Human Brain project implements an object-oriented system (based on O2 database technology [5]) that stores structural images of the brain and functional metadata associated with it [20]. The user-defined metadata makes it possible for scientists to easily share their research. For example, images 2 of neural activation areas can be stored with metadata describing the experiment, statistical techniques used, methodology, etc. The architecture separates raster data (in this case 3D images) and metadata storage and management. The RaSDaMan system (http: //www.rasdaman.com/) manages the raster data and provides a powerful query language for it, while the 02 system gives persistency to metadata objects. The architecture and design of the system is geared towards efficient storing, querying, and exchange of brain images. Another system that uses an object-oriented approach is the LOGOS system [45]. This system is more task oriented and has a library of functions that manipulate neuroscience data 3 . For example, raw data signal can be processed via built in object functions before statistical analysis. The architecture also integrates external software tools, such as simulators and statistical packages, with the database module. 2 Functional magnetic resonance images (fMRI). Neuroscience data in this case refers to both 2D and 3D images, and physiological time series data such as nerve cell discharges. 3 CHAPTER 3. BACKGROUND 27 The data is organized and stored as objects and classes of objects with persistency provided by the ObjectStore system (object-oriented DBMs). The Earth System Model Data Information System (ESMDIS) [11] uses an objectrelational model for managing data related to ocean-atmosphere dynamics. The ESMDIS design separates the metadata from the actual data. The metadata is stored on the Informix object-relational DBMs system, while the data is stored in Network Common Data Format (netCDF). netCDF is a portable, self-describing, array-based data storage format [40]. Thus, although data is stored outside the DBMs environment, the netCDF format allows standardized access to the data, and is referenced by metadata that can be queried. The CenSSIS (Center for Subsurface Sensing and Imaging Systems) is a webenabled database system for storing scientific data, primarily images [48]. CenSSIS stores the actual data (images) on a file server with links to the metadata which is stored on a relational DBMs (Oracle). The relational system storing the metadata is designed to ensure flexibility and extendability. For instance, the metadata tables are organized in a hierarchy where specialized metadata tables are linked to a base metadata table through unique identifiers. Thus, as new types of images are incorporated into the system, the metadata table hierarchy can be easily extended. The brief survey above identifies the following key themes that serve as useful guides for this project: 1. The popularity of object-oriented features in these systems. This hinges on two key requirements of scientific data: (1) Need for modeling complex data and relationships, and (2) A need for flexible schema. If object-relational DBMs can CHAPTER 3. BACKGROUND 28 deliver these functionalities, then there is a potential for combining the expressiveness of object-oriented systems with the robustness of relational systems. 2. A clear separation of metadata and data. In the cases above, we see that the data is stored outside the DBMs environment, but indexed by metadata that resides in a DBMs framework. As we will outline in Chapter 4, our system also separates the metadata from the data (in this case, trial data). However, we store both in the DBMs environment and at a fine level of granularity (the trial data is not stored as character or binary objects). For this thesis, we have chosen an object-relational database system for the warehouse implementation. As mentioned previously, object-relational systems provide the flexibility of an object-oriented system together with the robustness of a relational system. 3.5 Research Methodology Having outlined the fundamentals of DBMs and some related work in the area of managing scientific data, we now focus on identifying a methodology for this research. As outlined previously, the goal of the research is to develop a data management and analysis system that efficiently and effectively stores, retrieves, and analyzes large volumes of scientific behavioral data. Specifically, we propose a data warehouse system based on an object-relational data management platform. The difficulties in describing and defending a system development research methodology has been the focus of a number of studies [35, 8, 7]. Furthermore, the challenges in evaluating such research is articulated well by Weber [46] when he says: “The conundrum posed by CHAPTER 3. BACKGROUND 29 design research for progress in a discipline emerges clearly when a paper describing such research must be evaluated for publication in a learned journal. What are the quality standards the reviewer must apply to decide upon its acceptability? Typically the paper contains no theory, no hypothesis, no experimental design, and no data analysis. Traditional evaluation criteria cannot be used. The paper’s contribution requires an inherently subjective evaluation ” (p. 9) To guide this research, we have adopted the methodology suggested by Nunamaker, Chen, and Pudin [35] and the evaluation criteria proposed by Burstein and Gregor [8]. Nunamaker, Chen, and Pudin broadly describe system development research as a concept-development-impact cycle where the proof of proposed concepts/theory and impact of the theory are evaluated via system development. They suggest five steps that we follow in this research. These steps, shown in figure 3.5, begin by constructing a conceptual framework which evolves into a set of system requirements, and finally into a prototype system. The methodology emphasizes the cyclic nature of system development research in which knowledge about the system is gained through incremental prototype development. Table 3.1 summarizes these steps and maps them to the current research. Burstein and Gregor expand on Nunamaker, Chen, and Pudin’s framework by proposing five criteria for evaluating system development research. These criteria address the issue of evaluating system development research in terms of significance, internal validity, external validity, objectivity, and reliability of the system. In Chapter 6, we discuss these criteria in detail and use them as internal benchmarks for evaluating our work. 30 CHAPTER 3. BACKGROUND System Development Research Process Construct a Conceptual Framework Develop a system architecture Analyze and Design the system Build the prototype system Observe and Evaluate the system Research Issues - State meaningful research question - Investigate the system functionalities and requirements - Understand the system building process/procedures - Study relevant disciplines for new approach and ideas - Develop a unique architecture design for extensibility, modularity, etc. - Define functionalities of system components and interrelationships among them - Design the database/knowledge base schema and process to carry out system functions - Develop alternative solutions and choose one solution - Learn about the concepts, framework, and design through the system building process - Gain insight about the problems and the complexity of the system - Observe the use of the system by case studies and field studies - Evaluate the system by laboratory experiments or field experiments - Develop new theories/models based on the observation and experimentation of the system’s usage - Consolidate experiences learned Figure 3.5: IS Research steps from Nunamaker, et al (1991) IS process as outlined by Nunamaker, et al Mapping to current proposal 1. State meaningful research question 2. Investigate system functionalities and requirements 1. Research Goal: develop a data management and analysis system that efficiently and effectively stores, retrieves, and analyzes large volumes of scientific behavioral data. Specifically, we propose a data warehouse system based on object-relational DBMs technology. 2. System requirements and functionalities are outlined and discussed in Chapter 4.2: • Data and metadata query support. Particularly temporal and signal based slicing of data. • Scalable and flexible schema. • Developing an appropriate front-end analysis tool. CHAPTER 3. BACKGROUND Construct Conceptual Framework: Develop system architecture: 1. Specify system components and interactions 2. Specify measurable requirements 1. Key components of the system are outlined in Chapter 4. These include a parsing grammar, a data management system, and front-end analysis tools. 2. Some of the requirements identified in Chapter 4 are measurable. However, some, such as the need for flexible schema, are inherently subjective. Analyze and design system 1. Design to be based on theory and abstraction 1. A data warehouse system using a dimensional model based on objectrelational technology is proposed and designed. This design is based on sound conceptual foundation outlined in this chapter. Build system A functioning data warehouse system is developed and currently contains 45GB of experimental data. Experiment, observe, and evaluate the system The system is experimented and tested against the existing file-based system. The testing has focused on measurable aspects such as query support and the analysis interface. We also use the evaluation criteria suggested by Burstein and Gregor as internal benchmarks for the system. 31 Table 3.1: A system development research methodology. The left hand column shows the research steps as outlined by Nunamaker, et al [35], and the right hand column translates these steps to the current research. CHAPTER 3. BACKGROUND 3.6 32 Summary This chapter has outlined the data warehouse process and the foundational technologies on which it can be implemented. Furthermore, we have also looked at related work in the area of scientific data management. Finally, we have identified and outlined a research methodology for this work. Thus, having laid down a solid research foundation, the next chapter focuses on the actual data warehouse implementation. Chapter 4 System Overview 4.1 Introduction There are two primary goals for this chapter. First, to refine the problems identified in chapter 2 into a set of requirements for the data warehouse system. Second, to describe how data is currently organized in Dr. Scott’s lab, and then to outline the data management system developed for storing, retrieving and analyzing the data. The chapter also discusses key design decisions taken prior to, and during, implementation. 4.2 System Requirements From the data management and analysis problems identified in chapter 2 and through consultations with end-users, we identify the following key requirements of the warehouse system: 1. Query support: From the discussion in chapter 2, we can identify two types of 33 CHAPTER 4. SYSTEM OVERVIEW 34 queries that can be expected of the data management system: (a) Metadata queries: These are high level queries that allow researchers to query metadata related to each experiment. For example, such queries would answer questions such as “Do I have enough cells for analysis xyz?”. The queries should be quick and not require a scan of the actual trial data. The data model should thus capture metadata for each experiment. Metadata in this case refers to data (mostly textual) that describes the experiment. For example data such as subject information, experimental events, etc. (b) Trial data queries: These queries scan the actual trial data based on meta data criteria. For example, a researcher should be able to retrieve individual data signals across different trials and different tasks based on criteria such as subject, task, cell, etc. Furthermore, because we have event-based time series data, data slicing based on time is also a key requirement. If data is visualized as an n*m matrix where n represents each point in time and m represents a signal at that point, then slicing can be thought of as horizontal or temporal slicing of data. For example, a researcher might ask for neural data in the first 20 milliseconds after the target light is projected (reaction start time) and kinesiological data 60 milliseconds after target illumination. Since data is logically organized into task experiments, such operations should be possible across different trials and tasks. 2. Scalability: As mentioned earlier, data volumes are going to increase significantly as new recording equipment is introduced in the lab. Thus, the data warehouse should be scalable both in terms of query time and data upload time. CHAPTER 4. SYSTEM OVERVIEW 35 3. Schema evolution: the scientific process generating the data constantly shifts and the data model should be able to evolve with these shifts. Furthermore, programs that convert source data to match the database schema should also be flexible enough to adapt to such changes. We can anticipate the following schema evolution: (a) New signals being measured or two experiments of the same type recording different signals. For example, during a simple reaching task, one experiment might collect only cell data or only EMG data or both. (b) Additional data types being recorded. For example, video or audio recording of the experiments could be collected in the future. (c) Dropping signals. Some signals could be considered redundant and be replaced by other signals. In a pure relational database, this would involve dropping the entire table and copying its contents to a new table. 4. Analysis interface: Due to the nature of the analysis, SQL by itself is not sufficient for complex data analysis and visualization. Thus, the system has to interface with statistical tools, specifically MathWorks Inc’s Matlab software1 [27]. For this reason, a statistical front-end interface should be able to query and retrieve the data in times comparable to the current file-based approach. Having identified key requirements of the data warehouse system, we now outline the existing data organization, before presenting the details of the warehouse system. 1 This is main statistical software currently used in Dr. Scott’s lab CHAPTER 4. SYSTEM OVERVIEW 4.3 36 Existing Data Organization At present, real time data from individual channels is collected by the National Instruments Corporation’s labVIEW software [15]. Two categories of data are collected, analog data (neural and EMG recordings) and motor data such as hand and joint position, velocity, torque and acceleration. This data is sampled at intervals of anywhere from 1000Hz to 4000Hz. The analog data and motor data is stored in separate files and is processed by the Brainstorm software written in Matlab [41]. Figure 4.1 shows the current data/process flow. The analog and motor data is first re-sampled at a lower frequency (200Hz) and interpolated into a single file (.sam file). The .sam file is processed by the Brainstorm software, which applies data filters and adds additional header information for each trial to make the .pro files. The final step applies aggregation functions to signal data and stores it as .avg files. Since .pro data is most widely used in the lab for analysis, we will describe it in detail here. Figure 4.2 shows the data organization in a .pro file (refer to appendix C for a snapshot of a sample .pro file). Each .pro file is composed of three file-level headers that contain metadata pertaining to all the data in the file. Each file contains data from multiple trials of a movement in one direction. Furthermore, each trial has three headers that contain metadata specific to that trial. Please refer to the lab technical document for details on the data contained in all the headers [41]. A task or an experiment generates multiple .pro files since it is composed of movements to a set of different targets/directions. The following are the variances one could find in the .pro files: 1. Different number of trials might be recorded. 37 CHAPTER 4. SYSTEM OVERVIEW KINARM Analog and Motor data collected labVIEW software from KINARM labVIEW software .ANA file .MOT file .SAM SAM file file Data is re-sampled and filtered for noise by Brainstorm software .PRO file Stored in ASCII files .AVG file Figure 4.1: Data flow in the file based environment [41]. Data from KINARM is successively filtered, processed and stored in separate files. 38 CHAPTER 4. SYSTEM OVERVIEW PRO FILE 1 1 1 * Trial 1 1 Header 1 1 Trial Header 1 1 1 1 TargetInfo -TargetNum -StartXPos -... 1 1 1 1 1 Data Experiment StateCondition ChannelConfig -State -LightsOn -... -ChanName -MinValue -MaxValue 1 -task -subject -cell -... 1 * TrialFeatures StateTransitions signals -Method -Value -Trans1 -Trans2 -... 1 1 -SignalDat: array Figure 4.2: A class diagram showing the structure of a .pro data file. Each .pro file is composed of three file-level headers and data from one or more trials, which in turn have three headers specific to the trial. The trial data is composed of different types of signals, that vary from one .pro file to another. CHAPTER 4. SYSTEM OVERVIEW 39 2. Different number of signals and types of signals could be recorded. For example, a .pro could have EMG, cell, or kinesiological data. Furthermore, the number of channels for each type of signal might differ from one .pro file to another. For example, different number of EMG channels could be recorded for each pro file. These variances are recognized by the parsing grammar described in Section 4.4.1. 4.4 System Architecture In this section, we outline details of the system developed for Dr. Scott’s lab and also describe the key design decisions taken prior to, and during, implementation. System implementation was iterative, with changes being made as familiarity with both the DB2 system and scientific data increased. Features were also added as usability issues became apparent from end-user feedback during testing. The system, to some extent, had to reflect the manner in which data is currently retrieved and analyzed by researchers. We begin by outlining the final system and then describe the rationale behind the design. The .pro data file was selected as a starting point because it is the most commonly used data file for analysis. The raw analog and motor data files are rarely used in day-to-day analysis because they are noisy and sampled at a high frequency. The Brainstorm software starts the process of cleaning the data and thus we could choose data from either .sam, .pro, or .avg files for transfer to the warehouse system. By starting with .pro data, we make the system instantly available to end-users. However, since conversion from raw data to .pro data involves information loss, future work will need to integrate .ana and .mot data into the warehouse system. 40 CHAPTER 4. SYSTEM OVERVIEW Current Data Flow KINARM ANALOG FILES 1. Parse input data using a Perl based grammar MOTOR FILES 2. Data import scripts makes use of bulk loading utilities provided by the database system SAMLED FILE PROCESSED FILE PARSING GRAMMER 1 AVERAGED FILE 3. Query the database using: - DB2 tools such as command line processor and java based GUI - A Matlab Interface * Custom made Java class that queries the database using a JDBC driver and stores the data in Java objects which are served to the Matlab environment Parsed Files DB2 Import Scripts 2 DB2 DATA WAREHOUSE server 3 client DB2 Interface tools - Command Line Processor - GUI *Java class files Matlab Environment ProInfo Struct Figure 4.3: An overview of the data warehouse system. Data from .pro file is parsed and uploaded to the data warehouse system. A custom made Matlab interface uses a Java library to query the database and bring data to the analysis platform. CHAPTER 4. SYSTEM OVERVIEW 41 START RULE: EXPERIMENT_HEADER VERSION STATE CHANNEL FILE_INFO TRIAL_DATA(S) { Do something if parsed correctly } TRIAL_DATA: TRIAL_HEADER STATE_TRANS TRIAL_FEATURES SIGNAL_DATA SIGNAL_DATA: DATA1|DATA2|DATA3|DATA4|DATA5 EXPERIMENT_HEADER: REGULAR EXPRESSION TO RECOGNIZE HEADER VERSION: REGULAR EXPRESSION TO RECOGNIZE VERSION STATE: REGULAR EXPRESSION TO RECOGNIZE STATE CONDITIONS CHANNEL: REGULAR EXPRESSION TO RECOGNIZE CHANNELS FILE_INFO: REGULAR EXPRESSION TO RECOGNIZE RAW DATA SOURCE TRIAL_HEADER: REGULAR EXPRESSION TO RECOGNIZE TRIAL HEADER STATE_TRANS: REGULAR EXPRESSION TO RECOGNIZE STATE TRANSITIONS TRIAL_FEATURE: REGULAR EXPRESSION TO RECOGNIZE TRIAL FEATURES DATA1: REGULAR EXPRESSION TO RECOGNIZE CELL(1) + KINESIOLOGICAL DATA DATA2: REGULAR EXPRESSION TO RECOGNIZE CELL(1,2) + KINESIOLOGICAL DATA DATA3: REGULAR EXPRESSION TO RECOGNIZE CELL(1) + KINESIOLOGICAL + EMG DATA DATA4: REGULAR EXPRESSION TO RECOGNIZE CELL(1,2) + KINESIOLOGICAL + EMG DATA DATA5: REGULAR EXPRESSION TO RECOGNIZE VARIABLE EMG + CELL CHANNEL DATA Figure 4.4: Grammar rules that generate the parser. The left hand side of a statement gives the rule name, and the right hand side gives the regular expression for the data sequence to be parsed. Figure 4.3 shows a high level view of the key components of the warehouse system. The system can be divided into three key components: a parser for source data transformation, a database for storing and querying the data, and finally the Matlab interface to bring the data to the analysis platform. Each of these three components are described below. 4.4.1 Parse Grammar The starting point for the system is a perl based parser that processes the .pro data files and extracts relevant data for upload to the data warehouse. The parser is based 42 CHAPTER 4. SYSTEM OVERVIEW Start Rule Header Information Trial Header Version Trial Data State Transitions State Trial Features Data1 Data2 Channels File Info Signal Data Data3 Data4 Data5 Figure 4.5: A tree diagram to illustrate the structure of the grammar shown in Figure 4.4 on a Perl programming language module, Parse::RecDescent [13], that generates the parsing code based on a user-defined grammar. In essence, the grammar encodes knowledge specific to a .pro data file as a set of rules and re-organizes source data to fit the database schema. Figure 4.4 shows the grammar with actual regular expressions given in appendix D. Each statement gives a rule name followed by the action to be performed if the rule is satisfied. The start rule describes the overall structure of the .pro file. This structure is better illustrated in Figure 4.5. There are several advantages to using this rule-based parsing approach. First, due to numerous variations in the input files, line by line parsing would be very cumbersome programmatically. A grammar is much more elegant, modular, and extensible. This ties in with the overall goal of making a system that can adapt to changes in the scientific process generating the data. For example, the grammar CHAPTER 4. SYSTEM OVERVIEW 43 already distinguishes between five types of .pro data formats (see rule SIGNAL DATA in figure 4.4). When a new signal is recorded, this rule can be extended by adding the regular expression for recognizing the new data format. Secondly, a grammar makes it easier to code extensions, such as combining the data parsing and upload steps into a single program. This could be considered as a mediator-based approach to source data transformation. In such an approach, source data is communicated to a mediator using data wrappers. The mediator then resolves semantic and syntactic differences between the source and the warehouse schema using transformation rules [17]. The choice of using the Perl programming language for this task was based on easy to use data extraction features such as regular expressions, and the availability of a simple recursive descent grammar module. 4.4.2 Data Warehouse The data warehouse module of the system shown in Figure 4.3 uses IBM’s DB2 Universal Database system [25]. DB2 is a leading relational database system. It also supports object-oriented features such as user defined objects, and thus provides an excellent opportunity to leverage the power of a mature relational database platform while benefiting from the flexibility of object-oriented functionality. Furthermore, DB2 is freely available for research purposes and is used by the Database Research lab at Queen’s University. The schema shown in Figure 4.6 is based on the dimensional model discussed in Chapter 3. The fact table, labelled TRIAL DATA in the diagram, contains signal data from every trial and for every experiment. Data in this table is qualified by the following dimension tables: TRIAL HEADER, EXPERIMENT HEADER, and 44 CHAPTER 4. SYSTEM OVERVIEW TRIAL_HEADER DIMENSION STATE_CONDITIONS DIMENSION TRIAL_DATA FACT EXPERIMENT DIMENSION PK TrialNum PK Filenum TargetNum StartXPos StartYPos TarXPos TarYPos Scans PK Filekey 1..* 1 STATE_TRANSITIONS DIMENSION PK TrialNum PK Filenum Trans1 Trans2 Trans3 Trans4 Trans5 Trans6 Trans7 Trans8 1 1..* Time Handxpos Handypos Shoang Elbang Shovel Elbvel Handxacc Handyacc Shotor Elbtor Mot1tor Mot2tor Tanacc Tanvel Emg1 Emg2 Emg3 Emg4 Emg5 Emg6 Emg7 Emg8 Emg9 Emg10 FK1,FK2 TrialNum FK1,FK2,FK3 Filenum 1..* 1 Monkey Arm Hemisphere Mass ArmLen ForearmLen Chamber PenNum PenX PenY Rate Mot1 Mot2 Date Proto Version MonkeyNum Task TaskRepetition Target Cellnum 1 State LightsOn LightsOff TarPos Motor 1..* PosLim TimeLim TimeVar FK1 Filenum TRIAL_FEATURES DIMENSION 1 1..* Feature Method Value TrialNum FK1 Filenum Figure 4.6: The data warehouse schema. This is a star schema in which all trial data is stored in large table referred to as the fact table. Data in the fact table is qualified by the foreign keys linking it to the dimension tables. These are smaller tables that identify each fact or row in the fact table. 45 CHAPTER 4. SYSTEM OVERVIEW 1 Oid Name Weight Arm length Typed table based on structured type 2 Regular Col 1 Regular Col 2 Subject Object Col Column in a table based on structured type Figure 4.7: user-defined objects and typed tables in DB2. 1. shows it being stored in a typed table where the attributes are mapped to table columns, 2. shows the user-defined object stored in a regular table column. STATE TRANSITIONS. Furthermore, the experiment dimension contains two subdimensions: STATE CONDITIONS and TRIAL FEATURES. Sub-dimensioning is referred to as snowflaking in the data warehousing literature [31]. A direct relation to the fact table would necessitate an extra primary key in the table since every trial has multiple features and state conditions (that is a many-to-many relationship with the fact table). A possible alternative would be to separate each condition and feature into a separate relationship (or table). However, this increases processing in the data parsing step. To keep the parsing step and the warehouse design relatively intuitive, the sub-dimension tables are used. All dimension tables are based on structured user-defined objects. In DB2 terminology, they are referred to as typed tables [3]. Typed tables allow user-defined CHAPTER 4. SYSTEM OVERVIEW 46 objects and object hierarchies to be stored in DB2 tables as either rows or object columns (see Figure 4.7). For example, we can define an object “Subject” with the following attributes: name, weight, and arm length. This object can now be stored in a table with each attribute translated to a column. The advantage of using typed tables is the added flexibility it provides in terms of adding and dropping attributes. Furthermore, it allows the attributes to be treated like columns in a table and so they can be indexed and queried. The fact table, however, is designed as a regular table for reasons outlined below. Key Design Decisions As discussed previously, a number of design decisions led to the final database schema shown in Figure 4.6. Below are some of the key issues addressed during implementation: 1. Data granularity: One possible design for this database would be to store data for each trial as a Binary Large Object (BLOB) or a Character Large Object (CLOB). Although this coarse grained approach would make schema evolution much simpler, we decided on a fine grained approach (that is, each data point is explicitly stored as a row) because of the following reasons: (a) CLOB and BLOB data cannot be queried by the database and thus horizontal or vertical slicing of data (time epochs within a trial or signal filtering) is not possible using SQL queries. Storing signals as BLOBS or CLOBS would essentially create an index for each signal or a set of signals with data selection being done outside the Data Management System. 47 CHAPTER 4. SYSTEM OVERVIEW BASE_DATA_HIERARCHY base_data PK,I1 OID TIME HANDXPOS HANDYPOS SHOANG ELBANG SHOVEL ELBVEL HANDXACC HANDYACC SHOTOR ELBTOR SHOACC ELBACC MOT1TOR MOT2TOR TANACC TANVEL TRIALNUM FILENUM data_cell1_noemg cell1 data_cell12_noemg cell2 data_cell12_emg Channel1 Channel2 Channel3 Channel4 Channel5 Channel6 Channel7 Channel8 Channel9 Channel10 data_cell1_emg channel1 channel2 channel3 channel4 channel5 channel6 channel7 channel8 channel9 channel10 Figure 4.8: The object-oriented implementation for the fact table. The fact table starts with the base attributes, from which sub-tables are created using inheritance. This design was dropped because of the disadvantages identified in Section 4.4.2. CHAPTER 4. SYSTEM OVERVIEW 48 (b) The approach would necessitate an interface or application that is capable of making sense of the binary or character data (i.e. parsing and tokenizing the output data). For example, in the case of BLOB data, an application would have to query the database and parse the binary output to Matlab readable data. (c) Finally, as discussed in Chapter 2, evidence from data mining research indicates that considerable effort goes into preparing input data for mining algorithms. With all data points explicitly stored in rows and columns, data extraction can be done relatively easily using SQL. This design therefore facilitates future data mining projects. 2. Typed versus regular fact table: Initially the fact table was designed as a typed table. Figure 4.8 shows the data hierarchy in the original typed fact table. The hierarchy starts with a set of kinesiological attributes/signals that are recorded in every experiment. From this base type, additional objects are defined that inherit the base attributes and add to it. For example, the data cell1 noemg object inherits attributes from base data and additionally has a cell discharge attribute. This is just one instance of possible object hierarchies for the data at hand. Another object oriented design would be to identify three data objects: kinesiological, neural, and EMG data objects, and then make a complex object by combining the three. Although, such an approach gives more flexibility in terms of schema evolution, there are a number of disadvantages: (a) Typed tables require each object (or row) to have a unique identifier. In this case, because we have time series data sampled at a very high frequency, we would have millions and possibly billions of unique object CHAPTER 4. SYSTEM OVERVIEW 49 identifiers. Furthermore, the physical implementation of the typed table contains a system generated type id. This makes typed tables for trial data expensive, both in terms of storage and insert time performance. With the warehousing schema identified in Figure 4.6, trial data is identified by its relationship to the dimension tables and thus does not need a unique identifier over and above the foreign keys derived from the dimension tables. (b) Another issue is that the DB2 bulk loading command, LOAD, is not supported on typed tables [3]. This further deteriorates insert performance, and also adds the overhead of writing a highly optimized loading script. Data uploads in this case are significant, a single recording session could collect anywhere from 1-2 GB of raw data and 100-200MB of re-sampled data. In terms of schema evolution, regular tables allow attributes or columns to be added as required. In terms of dropping attributes, the entire table would need to be dropped and recreated. However, dropping attributes at present seems like an unlikely scenario. 4.4.3 Matlab Interface As mentioned earlier, because of complex analysis, SQL by itself is not sufficient for data analysis. Since the Matlab software is the primary analysis tool in the lab, a custom interface was developed for it to communicate with the database. The interface communicates with the DB2 system via a Java programming language class. The Matlab scripting environment allows Java objects to be instantiated CHAPTER 4. SYSTEM OVERVIEW 50 and gives access to object methods and attributes. We therefore developed an application that uses Java Database Connectivity (JDBC) to communicate SQL statements to the database and organizes the retrieved data in a Java object hierarchy. The Java implementation makes the data much more portable and independent of either Matlab or the specific platform, which is important in situations where data needs to be shared with other researchers or there is a migration to another statistical software. Prior to the data warehouse implementation, the starting point for data analysis was a Matlab script that organized .pro data in a localized structure (proinfo struct) [41]. Since this structure has been used for a number of years, a large library of custom built functions and scripts crucial for data analysis rely on data being in this structure. To ensure immediate usability of the data warehouse, the interface returns a structure identical to the proinfo struct. The proinfo struct is a legacy structure that imports all data related to a profile and does not truly use the power of SQL to cut and slice the data. In the long run, functions that use the proinfo struct can be phased out and replaced by more efficient ones based on SQL queries. 4.5 Summary This chapter outlined the data management system implemented as a proof of concept for this thesis and explained the rationale behind key design decisions. As noted, each decision has potential benefits and drawbacks. The database design is constrained by two factors: 1) the functionality provided by the database system and, 2) the need to incorporate a large library of functions and data developed over a number of years. The implications of both of these factors are discussed in Chapters 6. Chapter 5 Analysis 5.1 Introduction This chapter has two primary goals: 1. To demonstrate that the data warehouse system meets the key requirements outlined in Chapter 4. 2. To identify new functionality added by the data warehouse system and compare it against the file-based approach. The idea is to illustrate how the new capabilities add value to the end-users (researchers). The chapter is organized as follows: sections 5.2, 5.3, and 5.4 show the added capabilities of the data warehouse system and compare it to the file-based approach. Section 5.5 deals with operational aspects of the warehouse system, specifically looking at data parsing and uploading to the data management system. Section 5.6 deals with issues of schema evolution and scalability in the warehouse managed environment. These issues arise from the need to introduce a data model that is both efficient 51 CHAPTER 5. ANALYSIS 52 and flexible. Finally, section 5.7 summarizes the discussion in this chapter. All testing outlined in this chapter is done on a machine with the following specifications: 1. Operating System: Windows XP Professional 2. CPU: Xeon 2.4Ghz processors * 2 3. RAM: 2 GB 4. Software: DB2 data management system V8.2 service pack 4 5. Java version 1.4.0 6. Matlab version 6.5.1 Currently, the database contains 45 Gigabytes of data. It is comprised of 20,721 distinct cell-data-task-subject combinations and 95,475,388 records in the fact table. 5.2 Query Support The main benefit from the use of a data warehouse is the formal query support. SQL provides easy and efficient access to the data warehouse. Researchers can query data using different criteria, such as experiments, tasks, subjects, etc. Also, because of the fine grained design of the data warehouse (data points are explicitly stored as rows in the fact table), researchers can slice the data horizontally (by signal) or vertically (by time epoch). Together with ease of use, there is added efficiency in terms of the system resources needed to run large analytical tasks and the number of disk I/O’s. CHAPTER 5. ANALYSIS 53 From an end-user point of view, there are two major classes of queries that are executed frequently: metadata queries and trial data queries (or fact table queries). To compare this new capability against existing file-based access scripts, we have tested the running time of sample queries in each category against Matlab scripts for the same task. The Matlab scripts for testing purposes were written by an experienced Matlab programmer in Dr. Scott’s lab who is not involved in this research. 5.2.1 Metadata Query To test metadata query capability, we have implemented two queries and compared the performance against Matlab scripts written for the same task. Note that the warehouse system is expected to perform even better relative to the file-based approach because: 1. The Matlab script randomly picks files from a local folder and does not select data based on user criteria as does the SQL query. Thus the cost of searching for relevant data in the file-based approach is not captured in this analysis. 2. Before each query, the DB2 buffer pool was reset. This minimizes the impact of data buffering and results in the maximum number of physical reads of data. Metadata Query 1 - Movement Onset Time This is a simple query that retrieves the movement onset (MT start) time (an event time stamp) from the dimension table containing trial features. The query (shown in Figure 5.1) first selects file keys from the experiment header dimension based on user criteria, and then retrieves the corresponding movement onset time value for each CHAPTER 5. ANALYSIS 54 SELECT value as MTStart, filenum, trialnum FROM experiment_header as ex, feature_dimension as featr WHERE subject=’a’ AND task=’a’ AND cellnum < 50 AND featr.feature=’MTstart’ AND featr.method=’State’ AND ex.filenum=featr.filenum ORDER BY filenum, trialnum Figure 5.1: Sample metadata query that retrieves movement onset time (MTstart) from the trial features dimension table trial in the experiment. The same value was also retrieved from .pro files using a Matlab script (code shown in Appendix A.1). Figure 5.2 illustrates the results of running the SQL query and the corresponding Matlab script. The queries were run multiple times, and the results indicate the mean run time, the variance within the multiple runs are not significant. As can be seen, SQL queries are at least 30 times faster than the file-based access script written in Matlab. Also, the running time for the SQL query is essentially flat (slope of 0.007 seconds per file), compared to the Matlab script which has a slope of 0.3 seconds/file. This is because of the fixed cost, in the file-based system, of opening and reading the entire file into the Matlab environment. In Figure 5.2, “number of files” simply refers to the number of records fetched from either the warehouse system or the .pro data files. For example, “50 files” means 50 MT start values were retrieved from the metadata tables (in case of SQL access) and .pro data files (in case of filebased access). Thus, the slight increase in SQL query time (0.007 seconds for every additional file) can be explained partly by this increase in the result set, which has 55 CHAPTER 5. ANALYSIS 70 60 50 40 Time (secs) SQL - local machine SQL - client machine file-based access 30 20 10 0 50 100 150 SQL - local machine 0.9 1.1 1.5 1.9 SQL - client machine 1 1.1 1.5 1.8 33.2 52 65.5 file-based access 18 200 Number of Files Figure 5.2: Results of running metadata query 1 shown in Figure 5.1. On average, the SQL queries are about 30 times faster than the file-based access script. Furthermore, the SQL query time is essentially constant, as the number of files increases, whereas the Matlab script running time grows as a function of number of files in the input set. The results also show that running the query from a client machine over a local area network does not degrade performance. 56 CHAPTER 5. ANALYSIS SELECT DISTINCT ex.cellnum, ex.task, ex.date, ex.time FROM experiment_header as ex, (SELECT filekey,cellnum, date,time WHERE task=’c’ AND subject=’c’ AND (SELECT filekey,cellnum, date,time WHERE task=’b’ AND subject=’c’ AND from experiment_header cellnum < 40) as taska, from experiment_header cellnum < 40) as taskb WHERE ((taska.date < taskb.date) OR ((taska.date = taskb.date) AND (taska.time < taskb.time) )) AND (taska.cellnum = taskb.cellnum) AND ((ex.filekey=taska.filekey) or (ex.filekey = taskb.filekey)) ORDER BY cellnum,date,time Figure 5.3: metadata query 2. Retrieves all cells which were recorded for both task ‘a’ and ‘b’, where task ‘a’ was recorded prior to task ‘b’. to be written to an output file, and partly because the result criteria is wider (more records match the criteria: “cellnum < x”). Furthermore, since most end-users are going to query the database over a local area network, the query was also executed from a client (remote) machine on the same network as the database server. In this case, since the result set is much smaller than the data set that is queried, running the query from a client machine does not degrade performance. Metadata Query 2 - Cell Selection Query In this query, we retrieve the cells which are recorded for experimental task ‘a’ and task ‘b’, where task ‘a’ is recorded prior to task ‘b’ (see query in Figure 5.3). This is a common query that helps researchers identify whether there are enough cells for 57 CHAPTER 5. ANALYSIS 14 12 10 8 Time (secs) 6 SQL - local machine SQL - client machine File-based access 4 2 0 50 100 150 200 SQL - local machine 0.36 0.36 0.34 0.36 SQL - client machine 0.31 0.39 0.36 0.39 6.25 12.15 13.9 File-based access 5.6 Number of Files Figure 5.4: Results comparing data warehouse performance against file-based access for metadata query 2. Despite a better relative performance compared to query 1, file-based access is still slower than the SQL query for the same task. Again, running the query over a network client has no effect on performance. CHAPTER 5. ANALYSIS 58 data analysis across tasks and/or subjects. This query is different from metadata query 1 in that the corresponding Matlab script for the task is more efficient. It uses filenames to identify those that match the task criteria and thus does not have to open/read all the files in the input set 1 . Despite this, as the results in Figure 5.4 indicate, the Matlab script for the task (see Appendix A.2 for code), is still about 26 times slower than the SQL query. Also, the SQL query time is essentially flat (as the number of files increases), compared to the Matlab script. A run time profiler for the Matlab script shows that the execution time is dominated by calls to read pro.m script (the script that reads the file to memory). So although there are fewer calls to read pro.m (when compared to metadata query 1), it still is the primary performance bottleneck. 5.2.2 Trial Data Query To compare the data warehouse against the file based approach for trial data retrieval, we ran a query that calculates cell discharge rate (spikes per second) between two events: reaction start time (RTstart) and movement onset time (MTstart) 2 . Again, the same task was performed using a Matlab script to retrieve relevant data from .pro files. The SQL query is shown in Figure 5.5 and Matlab code for the task is included in Appendix A.3. This task was selected for testing because it is a common procedure executed by researchers. Furthermore, the query captures the key strength 1 All .pro files are named using a standard naming convention that indicates some experimental metadata, such as: task, subject, repetition, direction, target, and cell number. For example, consider a sample filename ‘aa1x0001’. The first three characters identify the subject (‘a’), task (‘a’), and experiment repetition (‘1’). The last four digits identify the target (‘0’) and cell number (‘001’). 2 Trial data refers to the data collected during an experiment. For instance, in this case, signals such as cell discharge, elbow velocity, elbow torque, etc, are collected during a trial for a particular task/movement. CHAPTER 5. ANALYSIS 59 SELECT (SUM(cellsignal1)/(count(time)*0.005)),data.filenum,data.trialnum FROM trial_data AS DATA, (SELECT value AS rtstart, filenum, trialnum FROM trial_features WHERE METHOD=’State’ AND FEATURE=’RTstart’ AND filenum IN (SELECT filekey FROM experiment_header WHERE cellnum < 47 AND subject=’c’ AND target=0 AND filekey NOT LIKE ’%%%x%e%%’) ) AS EVENT1, (SELECT value AS mtstart, filenum, trialnum FROM trial_features WHERE METHOD=’State’ AND FEATURE=’MTstart’ AND filenum IN (SELECT filekey FROM experiment_header WHERE cellnum < 47 AND subject=’c’ AND target=0 AND filekey NOT LIKE ’%%%x%e%%’) ) AS EVENT2 WHERE DATA.TIME BETWEEN EVENT1.rtstart AND EVENT2.mtstart AND EVENT1.filenum=DATA.filenum AND EVENT2.filenum=DATA.filenum AND DATA.trialnum=EVENT1.trialnum AND DATA.trialnum=EVENT2.trialnum GROUP BY DATA.filenum,DATA.trialnum ORDER BY DATA.filenum,DATA.trialnum Figure 5.5: An SQL query that calculates the cell discharge frequency (spikes/second) between reaction start time and movement onset time. 60 CHAPTER 5. ANALYSIS Cell Discharge Freq - SQL vs File-based Access 70 60 50 40 Time (secs) 30 SQL - local machine SQL- client machine File-based access 20 10 0 50 100 150 200 SQL - local machine 9 16 25 32 SQL- client machine 9 15 25 31 18 33 51 64 File-based access Number Of Files Figure 5.6: Results of running trial data query in Figure 5.5 against a file-based access script. The SQL query time (in seconds) is about half the time taken by the Matlab script. Also, running the query from a client machine on the same network does not degrade performance. The result set for this task is relatively small since each file has anywhere from 5-7 trials. The largest result set is 1260 rows. However, large number of rows are processed (see Figure 5.7) to get these results. 61 CHAPTER 5. ANALYSIS 1400000 1200000 1000000 800000 No. Of Data Rows 600000 Rows selected by SQL Total No. of rows in the experimental data 400000 200000 0 Rows selected by SQL Total No. of rows in the experimental data 50 100 150 200 28698 57723 101386 126012 319258 602145 1017906 1254818 Number Of Files Figure 5.7: This diagram illustrates the differences in the number of rows selected by the trial data query (Figure 5.5) and the total number of rows in the .pro data files. As can be seen, the Matlab script has to import significantly more data rows than SQL. 62 CHAPTER 5. ANALYSIS of our warehouse model - the ability to slice data horizontally (based on time stamps), and vertically (based on signal of interest). The results of this test (see Figure 5.6), shows that the SQL query is still twice as fast as the Matlab script. Assuming Matlab and SQL are both equally efficient in the computational task, the speedup can be attributed to SQL’s ability to retrieve only relevant data into memory as opposed to all data related to the experiment. Figure 5.7, compares the number of rows processed by SQL versus the total number of rows that the Matlab script has to bring into memory. In terms of signals, SQL selects 2 signals as opposed to approximately 32 signals that the Matlab script brings into memory. In terms of number of data rows, for this task, SQL retrieves and processes approximately 10% of the total number of data rows. Another observation is that the running time for the Matlab script for this task is more or less equal to the running time for the task outlined in metadata query 1 (Figure 5.2). Since the Matlab script corresponding to metadata query 1 simply opens a file and looks up a value, a comparable execution time indicates that for simple analysis (such as the trial data query outlined above), the biggest cost for filebased access scripts, is opening and reading relevant data into memory. A runtime profile of the Matlab script confirms this observation and indicates that on average 87% of time is spent reading .pro files 3 3 The Matlab scripts in all of the above analysis are interpreted and not compiled. Although compiled programs give better performance, in this case it will have very little impact since the dominant performance factor is opening/reading the file (I/O). CHAPTER 5. ANALYSIS 5.3 63 Data Management The new warehouse managed environment clearly adds value in terms of data management tasks such as backups and ensuring that valid data is made available to end users. Previously, all such operations were manually driven and enforced by the lab administrator. Furthermore, the availability of metadata queries allows researchers to better plan new experiments. For instance, metadata query 2 (shown in Figure 5.3) allows researchers to make quick decisions on whether more data is needed for a particular analysis or not. With the file-based approach, such queries are still possible if all files are stored on a single machine. However, the running time and the effort required in generating such queries is prohibitive. For example, a look at the code in Appendix A.2 shows a line count of approximately 60 lines (without comments) and it took an experienced Matlab programmer approximately 2 hours to write and debug. Finally, the data warehouse environment reduces data redundancy by creating a centrally accessible and shared data repository. The results of running metadata and trial data queries over the local area network indicates that the running time is within an acceptable range. Also, for computations on extremely large data sets, stored procedures (discussed below), can be run on the server machine with only the results transported to the client, thereby avoiding large data transfers over the network. Compare this to the current environment, where all data files are transferred to a client machine for every analysis. CHAPTER 5. ANALYSIS 5.4 64 Data Analysis The data warehouse system adds value to the data analysis process by enabling faster and easier access to research data. Furthermore, the data warehouse environment makes three additional functionalities available to researchers: 1. Built in functions within SQL allow researchers to perform quick analysis on the entire data set with only results served to the client machine. For example, common statistical functions such as correlations, linear regressions, etc., are built into the SQL function library [29]. For instance, the trial data query outlined in Figure 5.5, which sums a fact table column, could just as easily apply statistical functions to it. 2. The ability to define functions and stored procedures [3] enables complex data analysis procedures to be coded, stored, and executed on the data warehouse system as user-defined functions or procedures. These procedures support better utilization of resources, and avoid the need to query large data sets over the network. 3. The availability of user defined procedures and database triggers allows common analysis and data cleaning tasks to be automated. Database triggers are actions that are triggered by events such as data updates, deletions, and inserts. For example, upon data insert, a trigger could run stored procedures that extract key information such as discharge frequency, movement start time, etc. Furthermore, these functionalities, combined with a standardized data structure and access language, are expected to aid data mining processes. For instance, data CHAPTER 5. ANALYSIS 65 mining algorithms, that analyze data filtered by SQL queries, can be coded as stored procedures. Finally, as described earlier, we have also implemented a Java-based Matlab interface that queries the data warehouse and returns the results into a structure similar to the one currently used in Dr. Scott’s lab (proinfo structure). This is a necessary step in the short run due to a large number of analysis scripts dependent on the proinfo structure. The testing done with this interface is described below. Java-based Matlab Interface As described earlier, the Java-based Matlab interface retrieves data into a structure that is similar to the one currently used in Dr. Scott’s lab (proinfo structure). Since data from all 5 dimension tables is loaded into the structure, it is a very inefficient use of the data warehouse. Furthermore, the dimension table and fact table queries (although quite fast) have to be executed for every file matching the user criteria. So if 200 files match a user criteria, then (200 * 5) dimension table queries and 200 trial data queries are executed 4 . Also, the cost of selecting relevant data is not captured when running file-based access scripts such as readpro.m 5 . For instance, in this case, the required files were just copied from CD’s to the local hard drive. A simple test that copied 1000 .pro files to the local hard drive showed that on average 0.2 seconds are added per .pro file for retrieving relevant data to the hard drive. This is just the time to transfer data to the local machine, and still does not capture the data selection cost entirely. Thus, 4 One way to avoid this is to add additional logic in the application to query all the dimension tables and fact table once and then create the structure using trial numbers and file keys, however, this would require these values to be brought into memory for every data point. 5 This is an existing Matlab script used for reading .pro file data [41] 66 CHAPTER 5. ANALYSIS 250 200 150 Time (secs) 100 File-based - Without copy time File-based - With Copy Time Java-based Matlab Interface On local Machine Java-based Matlab Interface On client machine 50 0 50 100 150 200 File-based Without copy time 18 37 53 70 File-based - With Copy Time 28 57 83 110 Java-based Matlab Interface On local Machine 43 86 115 163 Java-based Matlab Interface On client machine 47 108 163 215 Number of files Figure 5.8: The results of comparing the Java-based Matlab interface against filebased access scripts. This is a worst case scenario, where data from all dimension tables and the fact table are queried multiple times to create a proinfo structure. On average, executing the query through the interface adds an additional 0.4 seconds compared to the running time for the filebased access script. Also, running the interface over a network client adds approximately 0.2 seconds to the running time (the same as copying the files to the local hard drive from CD’s). CHAPTER 5. ANALYSIS 67 the results below capture the worst case scenario for the data warehouse usage. The results of testing this interface, shown in Figure 5.8, indicates that on average the Java-based Matlab interface is twice as slow as the file-based access script, in this worst case scenario. However, in absolute terms, it is adding only an additional 0.4 seconds per file. For most retrievals (100 to 200 files), this amounts to only 40 to 80 additional seconds. Furthermore, in the long run, the interface will be used in an efficient manner without the need to construct a proinfo like structure. For instance, the trial data query, outlined in Figure 5.5, is an example of an efficient use of the data warehouse. So the Java-based interface can be extended to include methods that can execute SQL queries based on user parameters and return a result matrix without constructing a proinfo like structure. 5.5 Operational Aspects This section focuses on two key operational aspects of a data warehouse system: parsing (or preparing data for upload) and updating or inserting data. In our case, the operational aspects are simplified because the warehouse is frequently read but infrequently updated. This means that parsing and upload step is a one-time process that is done once every few days and would require the system to be in an off-line mode. This is a standard procedure for a warehouse system. In order to test for the mean parse time, we used a sample size of 864 .pro files randomly chosen across all subjects (appendix B outlines the process by which this sample size is selected). The time taken to parse 864 files was 442,531 milliseconds. Thus giving a mean parse time of 512 milliseconds or 0.5 seconds per .pro file. The same 864 files were uploaded to the database in 2,853 seconds (3.3 seconds per .pro CHAPTER 5. ANALYSIS 68 file). Thus the entire parsing and upload operation takes approximately 4 seconds per data file. From an operational point of view this is acceptable given that daily data recordings generate data files in the order of hundreds at most. Thus, daily updates can be done within 1-2 hours and are automated. 5.6 Emergent Issues As the analysis above shows, the data warehouse adds efficiency to the scientific process by introducing query support and automating data management tasks. However, the file-based approach is advantageous in that it is flexible and scalable. The following subsections discuss how the warehouse system measures up with respect to the issues of schema evolution and scalability. 5.6.1 Schema Evolution The hybrid approach outlined in Chapter 4 (where the facts are stored in a regular table and dimensions are stored in object tables), offers extensibility in two ways. First, because dimensional data is stored in object tables, it is possible to add and drop attributes without dropping the tables. Secondly, the fact table can also be expanded in terms of adding new attributes. However, dropping columns requires the table to be re-created. In the file based approach we have maximum flexibility in terms of changing or reorganizing data formats. However, there is a significant cost in terms of propagating such changes to scripts that load and analyze the data. For instance, a version change in .pro data file requires changes in the read pro.m script and all data analysis scripts CHAPTER 5. ANALYSIS 69 that use the read pro.m script. In terms of flexibility, as noted in Chapter 2 and 4, the raw data format changes from experiment to experiment. For example, two experiments for the same task can collect different signals on different days. It is important for such changes to be absorbed easily without major modifications to the data cleaning scripts or the warehouse schema. Again, the file based approach gives maximum flexibility, but at the cost of creating inefficiencies at the analysis level. In the new environment, the grammar based parsing script absorbs all such variances and outputs files that can be uploaded to the data warehouse. Furthermore, the parser is easily extended by adding parsing rules for new input formats. Although at present the actual data upload is not incorporated into the parser, this could be done in the future by expanding the rule set such that the parser not only cleans and re-arranges the raw data but also uploads it to the warehouse. This would allow the parser to reconcile differences between the raw data and the warehouse schema. 5.6.2 Scalability In the file based approach, scalability is not an issue and data retrieval is only limited by the amount of RAM available. Relational data management systems are scalable [19]. Furthermore, indexing frequently retrieved fields, such as foreign keys in the dimension tables, ensures that query time will be scalable as the data set increases. However, there are two issues with maintaining elaborate indexes: 1. Storage cost: with a large data set, the amount of storage consumed by the indexes can be significant. Due to its relative size, the fact table indexes are significantly larger than the dimensional table indexes. However, with the present CHAPTER 5. ANALYSIS 70 data set we see that fact table indexes amount to 4% (approximately 965 MB) of the data set size. Even when doubling of the size of the data set, the index will need a maximum of 1.9 GB of memory. 2. Update cost: Index maintenance becomes an issue with large uploads of data. However the testing reported in the previous section shows that insert time is acceptable and thus index maintenance is not an issue. Furthermore, mature relational database system, such as DB2, have efficient index storage and maintenance mechanisms such as storing indexes in B+ trees [19]. Thus storage and maintenance cost of indexes will not be an issue. 5.7 Summary Table 5.1 summarizes this chapter in terms of the key requirements identified in Chapter 4. From the discussion above, we can assert that the data warehouse not only offers a viable option for managing scientific behavioral data, but also adds value to the scientific research process. The added value is derived from: 1. Enabling faster and easier access to research data. 2. Automating data management and analysis tasks. 3. Providing shared and concurrent access to research data. 4. Potentially aiding future data mining efforts. 5. Allowing researchers to spend more time on their core scientific activities, due to the efficiencies above, as opposed to writing code. 71 CHAPTER 5. ANALYSIS Requirement File-based Database managed Formal query Support No support, either in terms of metadata or trial data queries. Full support for both data management queries and trial data extraction. This includes ability to filter data based on signals or time epochs (horizontal and vertical slicing of data). Data management No automated way of executing data management tasks such as backups, recovery, and ensuring consistency. Also, data source does not allow concurrent access. Data management tasks now automated and done by the database system. Furthermore, the data warehouse allows concurrent access to data and thus eliminates redundancy. Analysis Since data is stored in ASCII format, it is easily readable by analysis software such as Matlab. However, lack of query tools makes it hard to identify data of interest and also makes data analysis resource intensive. SQL queries allow easy access to data. Furthermore, built-in functions within SQL and the ability to define functions and procedures adds efficiency to the process. Finally, the custom Matlab interface allows data to be queried from within the analysis platform. Warehouse Operations Data is stored on CD’s with a redundant copy stored outside the lab. These CD’s were managed by the lab administrator. This is a simple storage scheme, however data management is cumbersome. Experimental data is parsed by a Perlbased grammar and uploaded to the database by built-in DB2 functions. This process takes about 0.5 seconds per data file. Data is stored on the database server and is backed up to a tape device on a regular basis. This process is automated and managed by the DBMS Schema Evolution Maximum flexibility in terms of changing the data structure, however, hard to propagate changes to data extraction and analysis scripts. The warehouse implementation allows maximum flexibility in the dimension tables in terms of adding and dropping attributes. In the fact table, adding signals is possible, but dropping tables requires the entire table to be recreated. Scalability The only limiting factor in terms of scalability is the amount of random access memory available for program execution. Since key attributes in the fact table and dimension tables are indexed, query performance should not deteriorate with increased data volumes. Table 5.1: A summary of the discussion analyzing the new database managed environment against the file managed environment Chapter 6 Conclusion And Future Works This chapter concludes the discussion presented so far. The goal of this chapter is to summarize our discussion, draw out generalize lessons and solutions from the implementation, and outline future research in the area. 6.1 Thesis Summary As outlined in Chapter 1, the goal of this thesis is to develop an effective and efficient data management and analysis system for scientific behavioral data. Specifically, we propose a data warehousing model for this task. To accomplish this goal, we developed a proof-of-concept system for Dr. Scott’s research lab that conducts behavioral studies on limb motor control. As mentioned in Chapter 3, we use the evaluation criteria outlined by Burstein and Gregor as internal benchmarks for evaluating our work. Figure 6.1 describes these criteria in detail, and maps them to the current research. In Summary, this research has made the following contributions: 1. Through a proof-of-concept system, we demonstrated the viability of using a 72 CHAPTER 6. CONCLUSION AND FUTURE WORKS Burstein’s criteria Current proposal Significance: The significance of the study is more practical than theoretical. It will contribute towards establishing the viability of using a data warehouse system based on an object-relational platform in managing scientific behavioral data. It also contributes to the actual scientific research by delivering a more efficient data storage, retrieval, and analysis tool. • Is the study significant theoretically? • Is the study significant practically? Internal Validity: refers to the credibility of the arguments made: • Does the system work? Does it meet its stated objectives and requirements? • Were predictions made in the study about the system? 73 Although some requirements are inherently subjective, overall the system does meet the requirements outlined in Chapter 4. Different systems (object-oriented, relational, and objectrelational) have been considered. Also, rival implementation designs have been considered and evaluated throughout the development process. • Have rival systems been considered? External Validity: 1. Are the findings congruent with, connected to, or confirmatory to prior theory? 2. Is the system generic enough to be applied to other settings? 3. Is the transferable theory from the study made explicit? Objectivity/Confirmability • Are the study’s method and procedure described explicitly and in detail? While there is no one-size-fits-all theory for managing scientific data, it is generally accepted that an object-oriented model is more appropriate due to the complexity of data and the need for a flexible schema. This study demonstrates the viability of using a warehouse system, based on an object-relational platform, for managing scientific data. The generality of this study, in terms of directly mapping our system to another problem, is limited by the fact that scientific applications have unique data sets and requirements. However, the study is generalizable in terms of the design principles that can be applied to other problems in the area. These are discussed in this chapter. An appropriate research methodology has been carefully selected and is outlined in this chapter. The actual system and the experimentation on the system is described in Chapter 4 and 5 respectively. • Can we follow the procedure of how data was collected? Reliability/dependability/Auditability: • Are the research questions clear? • Are the basic constructs clearly specified? The research goal as stated previously, is to develop a data management and analysis model that can store, query, and analyze scientific data efficiently. Specifically, we propose a data warehouse system based on object-relational DBMS technology. The constructs in this case would be the theoretical data models on which this system is based. These have been described and contrasted in Chapter 3. Table 6.1: Research evaluation criteria. The left column shows the evaluation criteria suggested by Burstein, et al [8]. The right column shows how the current research measures up to these criteria. CHAPTER 6. CONCLUSION AND FUTURE WORKS 74 data warehouse system, based on an object-relational database platform, to manage and analyze scientific behavioral data. In doing so, we identified and articulated key data management and analysis problems faced by researchers using the KINARM paradigm. These challenges are translated into system requirements that could be generalized for other behavioral labs. Furthermore, we deliver a working data management system to behavioral scientists, and show added value in using this system. 2. We also identify key limitations (outlined in the section below) of the warehouse system and our proposed solutions. These limitations serve as generalizable lessons and solutions for future and/or further development of a warehouse system for behavioral data. 6.2 6.2.1 Key Limitations And Possible Solutions Arrays To Store Signals One key limitation of the current data warehouse implementation is that we distribute temporal data over different rows in the fact table. A better design for storing the signal data would be to use an array within a column or an object. Figure 6.1 demonstrates how the array structure could be used in the fact table. The use of array structures within the relational framework would simplify both the fact table design and the end-user queries. Furthermore, it would allow us to define a coarser granularity at the relational level (each row contains data related to the entire trial), while retaining the ability to slice within the trial for specific subsets of data. For example, Figure 6.2, gives a sample query for the schema shown in Figure 6.1. The 75 CHAPTER 6. CONCLUSION AND FUTURE WORKS Fact table - current design ... Time 5 10 15 … 5 10 15 Signal1 0 0 1 ... 1 1 0 Signal2 1.5 5 3 ... 3 4 3 Trial T1 T1 T1 ... T2 T2 T2 ... Fact table - with array type in columns ... Signal1 [0,0,1] [1,1,0] Signal2 [1.5,5,3] [3,4,3] Trial T1 T2 ... Figure 6.1: Illustrates how an array-based structure can be used to store behavioral data signals. slicing within the trial is based on the signal array index, and not the actual instant in time of the movement. However, the conversion between an array index and the corresponding trial time instant is trivial via the sampling rate of the trial (array index * 1/sampling rate (Hz) = trial time instant). Although the array data type is part of the SQL3 1 specification, the basic DB2 implementation does not support it. From our experience, it is recommended to use arrays to encapsulate data from the entire signal. The open source PostgreSQL 1 The extension of the original SQL specification to incorporate object-oriented features in relational systems [19] CHAPTER 6. CONCLUSION AND FUTURE WORKS 76 SELECT Signal1[1:20], Signal2[21:40] from fact_table WHERE trial = ‘T1’ and ... Figure 6.2: Sample query to illustrate how an array-based fact table (See Figure 6.1) could be queried. In this query, the first 20 data points from signal1, and the next 20 data points from signal2 are retrieved. data management system supports the array type [38]. Also, extensions for the DB2 system that support such structures are commercially available [26]. Either one of these technologies should be considered for future implementations. 6.2.2 Source Data Upload A key implementation challenge in this work was transforming source data that was file-based and uploading it to the data warehouse system. As outlined in Chapters 2 and 4, there are a lot of variations in the source data from one experiment to another. Furthermore, the data files were not tagged by a metadata language such as XML. Combined with large data volumes, the task of uploading historical data to the warehouse system becomes formidable. This is the likely scenario in other behavioral laboratories. Although the grammar based parsing approach addresses this problem partly, to be truly effective, it has to be integrated with the step that uploads data to the warehouse system. For example, at present, each variation in the source file results into a separate DB2 load/import script for the upload. Thus, future developments need to integrate the data parsing and upload step. In fact, the parser could be extended to propagate changes in the source data to the warehouse schema automatically. This would be a critical step in extending this model to other behavioral labs. CHAPTER 6. CONCLUSION AND FUTURE WORKS 6.3 77 Future Work The section above not only identifies key limitations of the current implementation, but also points to potential future work. In addition to that, we identify the following key areas for future development, specifically for the warehouse system developed for Dr. Scott’s lab: 1. Developing a library of functions and procedures for common analytical tasks. These functions, ideally, should be coded and stored within the database system. As mentioned in Chapter 5, there are numerous benefits in developing such database managed procedures, including reducing data transfers across the local area network, and transferring computationally intensive tasks to a powerful server machine. 2. Development of front-end data analysis and visualization tools. The current implementation provides a basic front-end tool, however, it can be enhanced further for novice end-users. This involves developing a graphical user interface that not only enables users to query the data, but also to execute analytical tasks based on pre-coded stored procedures and functions discussed above. For example, a researcher could choose appropriate data selection criteria such as task, cell, subject, etc., and then ask a question such as “what is the mean discharge rate, and the preferred direction for the cell”. This would be translated into a parameterized user-defined procedure that queries the database and performs the required analysis. 3. As outlined in previous chapters, a data warehouse system could speed up the data mining processes by providing a structured data source that could be CHAPTER 6. CONCLUSION AND FUTURE WORKS 78 efficiently queried. This is especially true with data such as Dr. Scott’s, which is voluminous and complex, and requires temporal analysis. Having created a well structured data source, future developments should look into building a data mining module as part of the warehouse system. Again, functionalities such as database managed procedures and functions are useful in developing such tools. 4. The data warehouse could also be extended to incorporate the raw experimental data so that a researcher can go between a raw data signal and a filtered/processed data signal. In our implementation, we have incorporated processed data, however, the raw data is not warehouse managed. 5. Finally, the data warehouse system could be extended such that experimental data could be cross-linked to relevant publications, student analysis, and other documentation. This would extend the functionality of the system so that it serves as both a data source as well as a knowledge base for the lab. 6.4 Summary In this thesis, we have demonstrated that data warehousing is a viable model for efficient storage and analysis of scientific behavioral data. We have also demonstrated how object-relational systems could be used to manage complex scientific data. Furthermore, we have shown the added value of such an approach to the scientific research process in terms of the efficiencies introduced through the warehouse system. Finally, through the system development process, we identify limitations of the current design and generalizable solutions for future development. Bibliography [1] K. Aberer. The use of object-oriented data models in biomolecular databases. In Conf.on Object-Oriented Computing in the Natural Sciences, Heidelberg, Germany, 1994. [2] M. G. Axel and I. Song. Data warehouse design for pharmaceutical drug discovery research. In 8th International Conference and Workshop on Database and Expert Systems Application (DEXA) Workshop, pages 644–650, 1997. [3] G. Baklarz and B. Wong. DB2 Universal Database v7.1, Database Administration Certification Guide, chapter 7, page 363. Prentice Hall PTR, 4th edition, 2001. [4] F. Bancilhon. Object-oriented database systems. In Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 152–162. ACM Press, 1988. [5] F. Bancilhon. The o2 object-oriented database system. 21(2):pages 7–7, 1992. SIGMOD Rec., [6] A. Baruffolo and L. Benacchio. Object-relational DBMSs for large astronomical catalogue management. In Proc. Astronomical Data Analysis Software Systems conference series, volume 145 of 7, pages 382–385, 1998. [7] V. R. Basili, R. W. Selby, and D. H. Hutchens. Experimenting in software engineering. IEEE Transactions on Software Engineering, 12(7):733–743, July 1986. [8] F. Burstein and S. Gregor. The system development or engineering approach to research in information systems: An action research perspective. In Proc.10th Australasian Conference on Information Systems, 1999. [9] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering data mining: from concept to implementation. Prentice-Hall, Inc., 1998. [10] R. Cattell. Experience with the odmg standard. StandardView, 3(3):90–95, 1995. 79 BIBLIOGRAPHY 80 [11] Y. Chi, C.R. Mechoso, M. Stonebraker, K. Sklower, R. Troy, R.R. Muntz, and E. Mesrobian. Esmdis: Earth system model data information system. In Yannis E. Ioannidis and David M. Hansen, editors, Ninth International Conference on Scientific and Statistical Database Management, Proceedings, August 11-13, 1997, Olympia, Washington, USA, pages 116–118. IEEE Computer Society, 1997. [12] E.F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377–387, 1970. [13] D. Conway. Parse::recdescent - generate recursive-descent parsers. Web, April 2003. http://search.cpan.org/∼dconway/. Current as of April 14, 2004. [14] Microsoft Corporation. Microsoft sql server. http://www.microsoft.com/sql/ default.asp. Current as of 23 April, 2004. [15] Nation Instruments Corporation. Labview homepage. Web. http://www.ni. com/labview/. Current as of April 14, 2004. [16] Oracle Corporation. Oracle database. http://www.oracle.com/database/. Current as of 23 April, 2004. [17] T. Critchlow, G. Madhavan, and R. Musick. Automatic generation of warehouse mediators using an ontology engine. In Proceedings of the 5th Knoweledge Representation meets Databases (KRDB) Workshop, pages 8.1–8.8, May 1998. [18] W. Dubitzky, O. Krebs, and R. Eils. Minding, olaping, and mining biological data: Towards a data warehousing concept in biology. In Proc. Network Tools and Applications in Biology (NETTAB), CORBA and XML: Towards a Bioinformatics Integrated Network Environment, pages 77–82, Genoa, Italy, 2001. [19] R. Elmasri and S. B. Navathe. Fundamentals of Database Systems. AddisonWesley, 3rd edition, 2000. [20] J. Fredriksson, P. Roland, and P. Svensson. Rationale and design of the european computerized human brain database system. In Proc. Eleventh International Conference on Scientific and Statistical Database Management, volume Aug 1999, pages 148–157. [21] P. Gray and C. Israel. The data warehouse industry. Web, February 1999. http: //www.crito.uci.edu/itr/publications/pdf/data warehouse.pdf. Current as of 14 April, 2004. BIBLIOGRAPHY 81 [22] P. Gray and H. J. Watson. Present and future directions in data warehousing. SIGMIS Database, 29(3):83–90, 1998. [23] R. Grossman, X. Qin, D. Valsamis, and W. Xu. Analyzing high energy physics data using databases: A case study. In Proc. Seventh International Conference on Scientific and Statistical Database Management, pages 283–286, 1994. [24] K.K. Hirji. Exploring data mining implementation. Communications of the ACM, 44(7):87–93, 2001. [25] International Business Machines (IBM). Db2 product family. http://www-306. ibm.com/software/data/db2/. Current as of 23 April, 2004. [26] International Business Machines (IBM). Informix timeseries datablade module. http://www-306.ibm.com/software/data/informix/blades/timeseries/. Current as of 23 April, 2004. [27] MathWorks Inc. MATLAB, The Language of Technical Computing. MathWorks Inc, 3 Apple Hill Drive, Natickm MA 01760-2098, 6th edition, August 2002. [28] W. H. Inmon. Building the Data Warehouse. Wiley Computer Publishing, 2nd ed edition, 1996. [29] International Business Machines (IBM) Corporation. SQL Reference Guide. Web. http://webdocs.caspur.it/ibm/web/udb-6.1/db2s0/index.htm. Current as of 14 April, 2004. [30] R. R. Johnson. Elementary Statistics. PWS-KENT Publishing Company, 6th edition, 1992. [31] R. Kimball, L. Reeves, M. Ross, and W. Thornthwaite. The Data Warehouse Lifecycle Toolkit. John Wiley & Sons, Inc., 1998. [32] M. Krippendorf and I. Song. The translation of star schema into entityrelationship diagrams. In 8th International Conference and Workshop on Database and Expert Systems Application (DEXA) Workshop, pages 390–395, 1997. [33] D. Maier and D. M. Hansen. Bambi meets godzilla: Object databases for scientific computing. In Proc. Seventh International Conference on Scientific and Statistical Database Management, pages 176–184, 1994. [34] S. McClure. Object database vs. object-relational databases. Web, August 1997. http://www.ca.com/products/jasmine/analyst/idc/14821E. htm#BKMTOC22. Current as of 14 April, 2004. BIBLIOGRAPHY 82 [35] J.F. Nunamaker, M. Chen, and T.D.M. Purdin. System development in information systems research. Journal of Management Information Systems, 7(3):89– 106, Winter 1990-1991. [36] T. Pedersen. Aspects of Data Modelling and Query Processing For Complex Multidimensional Data. PhD thesis, Faculty of Engineering and Science, Aalborg University, Denmark, 2000. [37] Plexon Inc. User’s Guide, Version 2.0, Data Recording Software, Plexon Recorder. Web, June 2003. http://www.plexoninc.com/pdf/ RecorderV2Manual.pdf. Current as of April 14, 2004. [38] The PostgreSQL Global Development Group. PostgreSQL 7.4.2 Documentation. http://www.postgresql.org/docs/7.4/static/index.html. Current as of 23 April, 2004. [39] A. Rauf and S.M. Shah-Nawaz. An integrated database system at the national level for water resource engineers and planners of bangladesh. In Proc. 12th. International Conference on Scientific and Statistical Database Management, pages 247–249, 1997. [40] R. Rew, G. Davis, S. Emmerson, and H. Davies. Netcdf user’s guide for c. Web, June 1997. http://www.unidata.ucar.edu/packages/netcdf/cguide. pdf. Current as of April 14, 2004. [41] S. H. Scott, P. Cisek, S. Dorrepaal, J. Swaine, and S. Kong. Brainstorm - Technical Document Version 1. Laboratory of Dr. Steven Scott, Queen’s University, Dept. of Anatomy and Cell Biology, Botterell Hall, Rm. 459. [42] S.H. Scott. Role of motor cortex in coordinating multi-joint movements: Is it time for a new paradigm? Canadian Journal of Physiology and Pharmacology, 78:923–933, 2000. [43] S.H. Scott. Neural activity in Primary Motor Cortex Related to Mechanical Loads Applied to the Shoulder and Elbow During a Postural Task. The American Physiological Society, June 2001. [44] J. P. Shim, M. Warkentin, J. F. Courtney, D. J. Power, R. Sharda, and C. Carlsson. [45] M. Stiber, G.A. Jacobs, and D. Swanberg. Logos: a computational framework for neuroinformatics research. In Proc. Ninth International Conference on Scientific and Statistical Database Management, volume 11-13, pages 212–222, 1997. BIBLIOGRAPHY 83 [46] R. Weber. Toward a theory of artifacts: A paradigmatic base for information systems research. Journal of Information Systems, 1, Issue 2:3–17, Spring 1987. [47] R. Williams, P. Messina, F. Gagliardi, J. Darlington, and G. Aloisio. European union united states joint workshop on large scientific databases. Web, 1999. www.cacr.caltech.edu/euus. Current as of April 14th 2004. [48] H. Wu, B. Norum, J. Newmark, B. Salzberg, C.M. Warner, C. DiMarzio, and D. Kaeli. The censsis image database. In 15th International Conference on Scientific and Statistical Database Management, Proceedings, 9-11 July, 2003, pages 117–126. IEEE Computer Society, 2003. Appendix A Matlab Scripts A.1 Metadata query 1 function MTstart = extract_MTstart(limit, options) % % % % % % % % % % % % % % % % % % % % % % % EXTRACT_MTSTART This function will collect data from ’limit’ number of files (where limit is an integer parameter inputted by the user), and extract the MTstart values for all trials in one direction for all files. The user can also specify the task and the monkeys to be used. Parameters: 1. 2. limit - The number of files to get. options - A cell array of options. a) ’task’ - A character representing the task to be examined (optional - ’a’ is default). b) ’monkeys’ - A cell string array representing the monkeys to be considered (optional - ’use all’ is default). Author: Title: Dept: Date: Jon Swaine Computer Programmer Department of Cell Biology and Anatomy March 15, 2004 Written for use in Stephen Scott’s Data Analysis laboratory in Botterell Hall, Queen’s University, Kingston, Ontario, Canada. % Intialize variables pro_files = []; file_counter = 0; % Default task is ’a’, unloaded reaching task = ’a’; % Default is to run all of the monkeys. monkeys = {’A’ ’B’ ’C’ ’D’}; % Check the options parameter to see what parameters have been included if nargin == 2 for x = 1:length(options) % If the user does not specify the task, then the default is unloaded % reaching (the ’a’ task). 84 APPENDIX A. MATLAB SCRIPTS if strcmp(options{x}{1}, ’task’) task = options{x}{2}; % If the user doesn’t specify the subject(s), analyze all of them (’A’, % ’B’, ’C’, and ’D’) elseif strcmp(options{x}{1}, ’monkeys’) monkeys = options{x}{2}; end end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % GET THE FILES NEEDED FOR COMPARISON %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for x = 1:length(monkeys) % The directory containing the folders with cell data. input_dir = [’c:\data\PVector_Data\Monkey’, monkeys{x}, ’\’]; % Enter directory where folders are located eval([’cd ’, input_dir]); % Collect all files in temporary variable temp = dir; % Collect only the names of the files (folders) in a cell array folders = {temp.name}; clear temp; % Go through the names, looking for folders that have the word ’cell’ in them for y = 1:length(folders); % If the name has ’cell’ in it, assume it’s a folder containing % cell data if ~isempty(findstr(’cell’, folders{y})) % Enter the folder eval([’cd ’, input_dir, folders{y}]); % Collect all of the files and then extract the names to a % variable called ’files’ temp2 = dir; files = {temp2.name}; clear temp2; % Go through the file names and replace the 5th character with a ’1’, % indicating that we only want data for target 1. for z = 1:length(files) files{z}(5) = ’1’; end % Get rid of any duplicate file names. unique_files = unique(files); clear files; % Go through each unique file name and ... for z = 1:length(unique_files) % ... if the file name has a ’.pro’ extension and if the task % matches the one we’re looking for, add the filename to % the list of files to be analyzed. if unique_files{z}(2) == task & strcmp(unique_files{z}(9:12), ’.pro’) file_counter = file_counter + 1; pro_files{file_counter} = unique_files{z}; break; end % If we have collected ’limit’ number of files, then stop % collecting files if file_counter == limit break; end end clear unique_files; 85 APPENDIX A. MATLAB SCRIPTS end % If we have collected ’limit’ number of files, then stop % collecting files if file_counter == limit break; end end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % END - GET THE FILES NEEDED FOR COMPARISON %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Go through the pro files to be analyzed, read the pro file into a % struct and get the MTstart value. for x = 1:length(pro_files) % Enter proper directory input_dir = [’c:\data\PVector_Data\Monkey’, upper(pro_files{x}(1)), ’\’]; eval([’cd ’, input_dir, ’cell’, pro_files{x}(6:8)]); % Read the pro file into a structure pro = read_pro(pro_files{x}); % Look in the features field and extract the MTstart value (at index 3) % for each trial. for y = 1:size(pro.features, 2) MTstart(x, y) = str2num(pro.features(3, y).value); % Report error if 0 value found if ~MTstart(x,y) disp(’Zero value detected.’); end end end A.2 Metadata query 2 function a_pre_b_list = find_files_with_tasks_a_and_b(limit, options) % % % % % % % % % % % % % % % % % % % % % FIND_FILES_WITH_TASKS_A_AND_B This function will find ’limit’ number of cells (where limit is an integer parameter inputted by the user), where there is data for both the ’a’ and the ’b’ task. Parameters: 1. 2. limit - The number of files to get. options - A cell array of options. a) ’task’ - A character representing the task to be examined (optional - ’a’ is default). b) ’monkeys’ - A cell string array representing the monkeys to be considered (optional - ’use all’ is default). Author: Title: Dept: Date: Jon Swaine Computer Programmer Department of Cell Biology and Anatomy March 15, 2004 Written for use in Stephen Scott’s Data Analysis laboratory in 86 APPENDIX A. MATLAB SCRIPTS % Botterell Hall, Queen’s University, Kingston, Ontario, Canada. b_pro_files = []; % Default value for tasks is ’a’ and ’b’ tasks = {’a’, ’b’}; % Default value for monkey is C monkeys = {’C’}; % Check the options parameter to see what parameters have been included if nargin == 2 for x = 1:length(options) % If the user doesn’t specify the subject(s), analyze monkey C if strcmp(options{x}{1}, ’monkeys’) monkeys = options{x}{2}; end end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % GET THE FILES NEEDED FOR COMPARISON %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% file_counter = 0; for x = 1:length(monkeys) % Enter the proper directory input_dir = [’c:\data\PVector_Data\Monkey’ upper(monkeys{x}), ’\’]; eval([’cd ’, input_dir]); % Save all files within the folder (should be more folders) in a % temporary variable temp = dir; folders = {temp.name}; clear temp; % Go through the folders and when you find that the folder contains % cell data, see what’s in it for y = 1:length(folders) % Is it cell? if ~isempty(findstr(’cell’, folders{y})) % Enter directory eval([’cd ’, input_dir, folders{y}]); temp = dir; % Collect names files = {temp.name}; clear temp; % Go through the files and if you find any with a ’b’ for task % and a ’.pro’ extension, add the filename to a list. for z = 3 : length(files) if files{z}(2) == ’b’ & files{z}(9:12) == ’.pro’ file_counter = file_counter + 1; b_pro_files{file_counter} = files{z}; % Once enough files have been collected, exit the ’for z’ loop if file_counter == limit break; end end end clear files; end % Once enough files have been collected, exit the ’for y’ loop if file_counter == limit break; end end % Once enough files have been collected, exit the ’for x’ loop 87 APPENDIX A. MATLAB SCRIPTS if file_counter == limit break; end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % END - GET THE FILES NEEDED FOR COMPARISON %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% counter = 1; % Go through and read the b and a data. Then compare the timestamps on % them to see if ’a’ was collected before ’b’. for x = 1:length(b_pro_files) input_dir = [’c:\data\PVector_Data\Monkey’ upper(b_pro_files{x}(1)), ’\’]; eval([’cd ’, input_dir, ’cell’, b_pro_files{x}(6:8)]); % Read the ’b’ file b_pro = read_pro(b_pro_files{x}); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Read the corresponding ’a’ file % Construct the corresponding ’a’ file name. a_file = [b_pro_files{x}(1), ’a’, b_pro_files{x}(3:12)]; % If the set number for the ’b’ file is 2 and there is no ’a2’ file, % Check to see if there is an ’a1’ file instead. if isempty(dir(a_file)) & b_pro_files{x}(3) == ’2’ a_file = [b_pro_files{x}(1), ’a1’, b_pro_files{x}(4:12)]; % If the set number for the ’b’ file is 1 and there is no ’a1’ file, % Check to see if there is an ’a2’ file instead. elseif isempty(dir(a_file)) & b_pro_files{x}(3) == ’1’ a_file = [b_pro_files{x}(1), ’a2’, b_pro_files{x}(4:12)]; % If there is still no ’a’ file, continue with the for loop elseif isempty(dir(a_file)) disp(’No ’’a’’ file available.’); continue; end % Read the ’a’ file into a_pro a_pro = read_pro(a_file); % End reading of a_file %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % If ’a’ was recorded before ’b’, add the filename to the list. % Check dates first, and if they are the same, then check the % times. % If the ’a’ date is found to be earlier than the ’b’ date if datenum(a_pro.date) < datenum(b_pro.date) % Add to list a_pre_b_list{counter} = b_pro_files{x}; % Increase counter counter = counter + 1; % If the dates are the same ... elseif datenum(a_pro.date) == datenum(b_pro.date) % ... see if the ’a’ time is prior to the ’b’ time if datenum(a_pro.time) < datenum(b_pro.time) 88 APPENDIX A. MATLAB SCRIPTS % Add to list a_pre_b_list{counter} = b_pro_files{x}; % Increase counter counter = counter + 1; end end end % Display some results disp([’Number of files collected with a and b tasks - ’ num2str(length(b_pro_files))]); disp([’Number of files where a was collected prior to b - ’ num2str(length(a_pre_b_list))]); A.3 Trial data query function cell_spike_rate = find_cell_spike_rate_between_RT_MT(limit, task, monkeys) % % % % % % % % % % % % % % % % % % % % % % % FIND_CELL_SPIKE_RATE_BETWEEN_RT_MT This function will collect data from ’limit’ number of files (where limit is an integer parameter inputted by the user), and extract the MTstart and RTstart values. It will then find the total number of cell spikes between those two times and calculate the cell spike firing rate. Parameters: 1. 2. limit - The number of files to get. options - A cell array of options. a) ’task’ - A character representing the task to be examined (optional - ’a’ is default). b) ’monkeys’ - A cell string array representing the monkeys to be considered (optional - ’use all’ is default). Author: Title: Dept: Date: Jon Swaine Computer Programmer Department of Cell Biology and Anatomy March 15, 2004 Written for use in Stephen Scott’s Data Analysis laboratory in Botterell Hall, Queen’s University, Kingston, Ontario, Canada. % Intialize variables pro_files = []; file_counter = 0; task = ’a’; monkeys = {’A’ ’B’ ’C’ ’D’}; sampling_rate = 200; % Check the options parameter to see what parameters have been included if nargin == 2 for x = 1:length(options) % If the user does not specify the task, then the default is unloaded % reaching (the ’a’ task). if strcmp(options{x}{1}, ’task’) task = options{x}{2}; % If the user doesn’t specify the subject(s), analyze all of them (’A’, % ’B’, ’C’, and ’D’) elseif strcmp(options{x}{1}, ’monkeys’) monkeys = options{x}{2}; end 89 APPENDIX A. MATLAB SCRIPTS end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % GET THE FILES NEEDED FOR COMPARISON %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for x = 1:length(monkeys) % The directory containing the folders with cell data. input_dir = [’c:\data\PVector_Data\Monkey’, monkeys{x}, ’\’]; % Enter directory where folders are located eval([’cd ’, input_dir]); % Collect all files in temporary variable temp = dir; % Collect only the names of the files (folders) in a cell array folders = {temp.name}; clear temp; % Go through the names, looking for folders that have the word ’cell’ in them for y = 1:length(folders); % If the name has ’cell’ in it, assume it’s a folder containing % cell data if ~isempty(findstr(’cell’, folders{y})) % Enter the folder eval([’cd ’, input_dir, folders{y}]); % Collect all of the files and then extract the names to a % variable called ’files’ temp2 = dir; files = {temp2.name}; clear temp2; % Go through the file names and replace the 5th character with a ’1’, % indicating that we only want data for target 1. for z = 1:length(files) files{z}(5) = ’1’; end % Get rid of any duplicate file names. unique_files = unique(files); clear files; % Go through each unique file name and ... for z = 1:length(unique_files) % ... if the file name has a ’.pro’ extension and if the task % matches the one we’re looking for, add the filename to % the list of files to be analyzed. if unique_files{z}(2) == task & strcmp(unique_files{z}(9:12), ’.pro’) file_counter = file_counter + 1; pro_files{file_counter} = unique_files{z}; break; end % If we have collected ’limit’ number of files, then stop % collecting files if file_counter == limit break; end end clear unique_files; end % If we have collected ’limit’ number of files, then stop % collecting files if file_counter == limit break; end end 90 APPENDIX A. MATLAB SCRIPTS end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % END - GET THE FILES NEEDED FOR COMPARISON %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Steps: % For all selected files % 1. Generate the pro structure from the pro file % 2. Find RT and MT start. % 3. Then find the number of spikes between the times at % which RT and MT start occur. % 4. Calculate the rate of cell spike firing. for x = 1:length(pro_files) % Go into proper directory input_dir = [’c:\data\PVector_Data\Monkey’, upper(pro_files{x}(1)), ’\’]; eval([’cd ’, input_dir, ’cell’, pro_files{x}(6:8)]); % Read pro file pro = read_pro(pro_files{x}); for y = 1:size(pro.features, 2) % Get RTstart (index 1) and MTstart (index 3) RTstart = str2num(pro.features(1, y).value); MTstart = str2num(pro.features(3, y).value); % Get RT and MT indices from the time column (column 1) RTind = min(find(RTstart <= pro.data{y}(:, 1))); MTind = min(find(MTstart <= pro.data{y}(:, 1))); % Get cell spikes between those indices (column 8 is cell spike % column) cellspikes = sum(pro.data{y}(RTind : MTind, 8)); % Calculate the rate cell_spike_rate(x, y) = cellspikes / ((MTind - RTind) * (1 / sampling_rate)); end end 91 Appendix B Statistical Formulae In Chapter 5, we used a sample of 864 .pro files to determine the mean data parse and upload time. The central limit theorem equation outlined below (equation B.1) was used to determine this sample size [30]. The equation allows us to make a statistical inference on the sample size necessary to determine mean parse time and data upload time for .pro files. We assume that the size of the .pro file is proportional to the number of data points in the file, which is proportional to the time it takes to parse/upload the file. Thus we estimate the sample size necessary to determine the mean size of a .pro file at 95% confidence ((1 - α) in the equation) and with error of estimate of 10KB (E in the equation). We then use a sample size to determine the mean parse/upload time. One problem with this analysis is estimating the population standard deviation in file sizes. In this case, the population consists of over 20,000 files scattered across 200 CD’s. This makes it hard to find the population statistics, and thus a large random sample is used to estimate the population standard deviation. 92 APPENDIX B. STATISTICAL FORMULAE z( α2 ) ∗ SD 2 ) n=( E 93 (B.1) SD = Standard Deviation = 150KB. The standard deviation in .pro file sizes is determined from a sample of 4750 files. E = Maximum error of estimate = 10KB α = 0.005. Confidence level of 95%, thus α = (1 - 0.005). z( α2 ) = 1.96. This value was determined from a statistical table showing z values for areas of the standard normal distribution. Appendix C A sample .pro file 94 APPENDIX C. A SAMPLE .PRO FILE Monkey Arm HemisphereMass ArmLen ForearmLenChamber PenNum PenX PenY Rate Mot1 Mot2 Date Time Proto sthapatya RIGHT LEFT 9.2 154 230 1 26 -5 -2 200 1 1 8/29/2003 10:32 AM b1 ver1.0 STATE CONDITIONS State LightOn LightOff TarPos Motor PosLim TimeLim TimeVar 1 0 0 0 0 7 2000 0 2 0 1 0 0 8 1250 250 2 1 0 0 0 8 150 0 3 1 0 0 0 8 750 0 4 1 0 1 0 13 200 0 1 1 0 1 0 13 200 0 2 1 1 1 0 13 1250 250 0 -1 0 1 0 0 0 0 CHANNEL CONFIG Channel Time HandXPos HandYPos ShoAng ElbAng ShoVel ElbVel Cell1 HandXAcc HandYAcc ShoTor ElbTor ShoAcc ElbAcc Mot1Tor Mot2Tor TanAcc TanVel Min 4.5 11.1533 279.69 0.518 1.282 -0.544778 -0.759 0 -0.136418 -1.25166 0.033736 0.129944 -8.07119 -16.7932 -0.015945 -0.008763 0.000734 8.38E-06 Max 4729.25 75.877 303.756 0.626 1.542 0.325 0.318 2 0.178256 -1.03677 0.079059 0.17446 12.4825 12.8872 0.012368 0.024688 1.94156 0.25464 Filter (none) (none) (none) (none) (none) butter6-0.05butter6-0.05(none) butter6-0.05butter6-0.05butter6-0.05butter6-0.05(none) (none) (none) (none) (none) (none) FILE PRODUCTION INFO Source files:db1x0157.sam TRIAL HEADER TargetNumStartXPos StartYPos TarXPos TarYPos Scans 0 18 286 73 309 712 STATE TRANSITIONS 4.75 259 1761.25 1912.5 2234.25 2435.75 2567.75 3818.25 TRIAL FEATURES Feature Method Value RTstart State 1761.25 MTstart TanAcc 2044.75 MTstart State 2234.25 THTstart State 2567.75 RTdur TanAcc 283.5 MTdur TanAcc 523 Time HandXPos HandYPos ShoAng ElbAng ShoVel ElbVel Cell1 259.75 11.1533 286.279 0.604 1.49367 0.217548 -0.436975 264.75 11.244 286.477 0.6044 1.4922 0.204091 -0.408788 269.75 11.381 286.774 0.606 1.489 0.190539 -0.38052 274.75 11.466 286.96 0.607 1.488 0.176833 -0.352126 279.75 11.501 287.116 0.607 1.48633 0.162951 -0.323621 284.75 11.603 287.31 0.608 1.48522 0.148916 -0.295101 0 0 0 0 0 0 HandXAcc HandYAcc ShoTor ElbTor ShoAcc ElbAcc Mot1Tor Mot2Tor TanAcc TanVel 0.558528 0.266653 -0.013822 0.010019 3.62699 -2.29854 -0.011313 0.010065 0.618916 0.058386 0.517531 0.272408 -0.014224 0.011433 3.36072 -2.00993 -0.010703 0.009173 0.584846 0.054516 0.478301 0.276816 -0.014991 0.012712 3.1059 -1.74274 -0.010152 0.008327 0.55263 0.050604 0.441849 0.279512 -0.015978 0.013886 2.86918 -1.496 -0.009694 0.007545 0.522836 0.046749 0.40893 0.280295 -0.017068 0.014764 2.6554 -1.28245 -0.009358 0.00684 0.495771 0.042882 0.379982 0.279113 -0.018101 0.015273 2.46742 -1.10269 -0.009164 0.006219 0.471477 0.039046 95 Appendix D Regular Expressions For Parsing Grammer # GRAMMAR # @Author Baiju Devani # Dec 9th 2003 # Comments: To add a rule to the grammar, one needs to to do the following: # 1) Name the rule with a token. For example ruleNew # 2) Add the rule in the startrule sequence. Or it can be embedded within one # of the the subrules # 3) Add the rule token defintion in terms of the regex and the action to # perform when the rule is found (if any). For example: # ruleNew: /regex_for_things_to_find/ { do something; item[1] } # Parse::RecDescent is used to parse the grammar. This is a powerful # #tool and can do much more. Full documentation can be found on CPAN. startrule: Mdef Ver(?) State <commit> Channel FileInfo Repeat(s) { @::Monkeydef = split /\n/, $item{Mdef}; #$item{Ver} =~ /^ver(\d\.\d)/; # Since this grammar is for monkey data, version is hardcoded 96 APPENDIX D. REGULAR EXPRESSIONS FOR PARSING GRAMMER $::version = "1.0"; main::write_to_File("mdef",\@::Monkeydef,1); @::StateCond = split /\n/, $item{State}; main::write_to_File("StateCond",\@::StateCond,2); 1; } Repeat: Trial StateTrans TrialFeatures Data { # Increase trial num for every trial encountered $::trial_num++; @::TrialHeader = split /\n/, $item{Trial}; main::write_to_File("TrialHeaders",\@::TrialHeader,2); @::StateTrans = split /\n/, $item{StateTrans}; main::write_to_File("StateTrans",\@::StateTrans,1); @::TrialFeatures = split /\n/, $item{TrialFeatures}; main::write_to_File("TrialFeatures",\@::TrialFeatures,2); @::Data = split /\n/, $item{Data}; main::write_to_File("Data",\@::Data,1); } Data: Data1|Data2|Data3|Data4|DataEmg 97 APPENDIX D. REGULAR EXPRESSIONS FOR PARSING GRAMMER Mdef: /^Monkey\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+) \t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(Proto)\n((.+)\n)/m Ver: /^ver\d\.\d/ State: /STATE CONDITIONS\nState(.+)TimeVar\n((.+)\n)*/m Channel: /CHANNEL CONFIG\nChannel.+\n((.+)\n)*/m FileInfo: /FILE PRODUCTION INFO\n((.+)\n)*/m Trial: /TRIAL HEADER\n((.+)\n)*/m StateTrans: /STATE TRANSITIONS\n((.+)\n)*/m TrialFeatures: /TRIAL FEATURES\n((.+)\n)*/m Data1: /(Time)\t(HandXPos)\t(HandYPos)\t(ShoAng)\t(ElbAng)\t(ShoVel)\t(ElbVel)\t (Cell1)\t(HandXAcc)\t(HandYAcc)\t(ShoTor)\t(ElbTor)\t(ShoAcc)\t(ElbAcc)\t (Mot1Tor)\t(Mot2Tor)\t(TanAcc)\t(TanVel)\t\n((.+)\n)*/m { $::grammar_type = 1; # Need to do this so that outer rule of the grammar can refer to item{Data} $item[1]; } Data2: /(Time)\t(HandXPos)\t(HandYPos)\t(ShoAng)\t(ElbAng)\t(ShoVel)\t( ElbVel)\t(Cell1)\t(HandXAcc)\t(HandYAcc)\t(ShoTor)\t(ElbTor)\t (.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(ShoAcc)\t (ElbAcc)\t(Mot1Tor)\t(Mot2Tor)\t(TanAcc)\t(TanVel)\t\n((.+)\n)*/m { $::grammar_type = 2; $item[1]; } Data3: /(Time)\t(HandXPos)\t(HandYPos)\t(ShoAng)\t(ElbAng)\t(ShoVel)\t(ElbVel)\t (Cell1)\t(Cell2)\t(HandXAcc)\t(HandYAcc)\t(ShoTor)\t(ElbTor)\t(ShoAcc)\t(ElbAcc)\t (Mot1Tor)\t(Mot2Tor)\t(TanAcc)\t(TanVel)\t\n((.+)\n)*/m { 98 APPENDIX D. REGULAR EXPRESSIONS FOR PARSING GRAMMER $::grammar_type = 3; $item[1]; } Data4: /(Time)\t(HandXPos)\t(HandYPos)\t(ShoAng)\t(ElbAng)\t(ShoVel)\t(ElbVel)\t(Cell1)\t (Cell2)\t(HandXAcc)\t(HandYAcc)\t(ShoTor)\t(ElbTor)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t (.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(ShoAcc)\t(ElbAcc)\t(Mot1Tor)\t(Mot2Tor)\t(TanAcc)\t (TanVel)\t\n((.+)\n)*/m { $::grammar_type = 4; $item[1]; } DataEmg: /(Time)\t(HandXPos)\t(HandYPos)\t(ShoAng)\t(ElbAng)\t(ShoVel)\t(ElbVel)\t (HandXAcc)\t(HandYAcc)\t(ShoTor)\t(ElbTor)\t((EMG.+)\t)+(ShoAcc)\t(ElbAcc)\t (Mot1Tor)\t(Mot2Tor)\t(TanAcc)\t(TanVel)\t\n((.+)\n)*/m { $::grammar_type = 5; $item[1]; } DataEmg: /(Time)\t(HandXPos)\t(HandYPos)\t(ShoAng)\t(ElbAng)\t(ShoVel)\t(ElbVel)\t(Cell1)\t (HandXAcc)\t(HandYAcc)\t(ShoTor)\t(ElbTor)\t((EMG.+)\t)+(ShoAcc)\t(ElbAcc)\t (Mot1Tor)\t(Mot2Tor)\t(TanAcc)\t(TanVel)\t\n((.+)\n)*/m { $::grammar_type = 6; $item[1]; } 99 Glossary KINARM Kinesiological Instrument for Altered And Reaching Movements. A device/paradigm used for behavioral studies on upper limb movement and coordination. Page 6. OLAP Online Analytical Processing. A database workload characterized by adhoc queries (on large amounts of data) and infrequent updates. Page 21. OLTP Online Transactional Processing. A database workload characterised by a large number of data transactions(inserts, updates, retrieval) in short periods of time. Usually such systems are used concurrently for inserts and updates by a large of users. Page 20. OODBMs Object-Oriented Database Management System. A data management system based on object-oriented contructs. Data is defined in terms of objects which have attriutes and methods and functions to manipulate data. Page 16. ORDBMs Object-Relational Database Management System. A relational data management system that allows use of object-oriented features within the relational database model (see RDBMs and OODBMs). Page 19. RDMs Relational Database Management System. A data management system in which the primary constructs are tables (relations), columns (attributes), and rows (tuples). Relationships between tables is established by keys that are common across the tables. Page 14. SQL Structured Query Language. A standardized data definition, query, and update language for relational database management systems. Page 16. 100