Download azarmgin - Engineering Computing Facility

1 Neuroinformatics Providing the Tools for a Systems Biology Approach to Neuroscience Sogol Azarmgin, M.A.Sc. Candidate, University of Toronto, IBBME  Abstract—The new and emerging field of systems biology enables researchers and scientists to enhance their understanding of biological systems by examining the overall structure and dynamics of a system. Applying systems biology to neuroscience for better understanding of brain functions and mechanisms requires integration of information from the level of gene to the level of behavior. The enormous amount of information at each of the many levels involved along with the complexity of these data due to their diverse and distributed nature would inherently necessitate the development of new databases and their associated informatics tools. This document presents an overview of the challenges and the issues facing the development, implementation, and maintenance of new neuroinformatics tools. Index Terms—neuroinformatics, systems biology, databases, data heterogeneity, integrative neuroscience. I. INTRODUCTION S YSTEMS biology, a new and an emerging field, aims at exploring the behavior and relationships of all of the elements in a particular biological system while it is functioning [1]. This approach would inherently require collective efforts from multiple research areas such as genetics and molecular biology, high precision measurement, computer science and other scientific and engineering fields [2]. As a result, data sharing and development of data repositories with their associated tools are of crucial importance to make the best use of the massive data sets available. Prior to prevalence of electronic data sharing, the traditional mode of experimental science involved little data sharing where each laboratory generated its data to solve a problem, published a fraction of it, and moved on. The published data remained in the hard copy volume or reprint, where it was inaccessible by electronic means; the unpublished data were lost forever. This classical approach was altered during the 1980’s by the gene-sequencing community. Genetic and protein sequences were submitted, in parallel with journal publications, to central databases where they could be archived and made accessible to others for further analyses such as determining homologies across different species [3]. Such databases and their associated tools have formed the foundations for the development of bioinformatics, a key discipline, as essential to molecular biology as are its experimental methods. In direct analogy to the indispensable role of bioinformatics to molecular biology, neuroinformatics is about to gain considerable importance in helping to overcome the daunting challenge of understanding the human brain. To ensure greatest possible benefits from the pool of collected data over the past decades, neuroscientific community has started moving in the similar directions and has begun its efforts (e.g. the Human Brain Project) to develop new information management systems in the form of interoperable databases with associated data management tools. These tools would include algorithms and software packages to perform querying, data mining, image processing, modeling, simulation and electronic collaboration. Neuroinformatics offers a promising future to the field of neuroscience, but there are numerous challenges that need to be resolved in order to make advancements. These challenges include issues pertaining to development, implementation and maintenance of databases in addition to technical, standardization, legal and ethical concerns. This paper focuses its attention on the challenges faced by the neuroscientific community in the three phases of development, implementation, and maintenance of databases and their associated tools. It also provides a perspective into the future of neuroinformatics. II. DEVELOPMENT OF DATABASES & ASSOCIATED TOOLS Unlike the linear genetic and protein sequences, the neuroscientific data are very complex and heterogeneous in nature spanning multiple scales and dimensions of analysis. The complexity of dealing with these multiple dimensions and scales is further complicated by the specific connectivity of neuronal pathways and the dynamic temporal dimensions of the brain [4]. Fig. 1 is an illustration of the complexity involved in neuroscience research. As depicted in this figure, the multiple dimensions of analysis include, and are not limited to imaging, neurophysiology, neurochemistry, cellular signaling, morphology, molecular biology and genetics. The multiple scales of analysis include intracellular, extracellular, single neurons, networks of neurons and systems of neurons that extend to the whole brain studies. This overwhelming complexity requires innovative ways for development of databases and their associated tools so that they can serve as 2 effective channels of communication and collaboration between neuroscientists worldwide and provide them with the capability to analyze brain’s mechanisms and functional interactions in greater depth. Achieving this goal would in turn require a clear understanding and agreement on the scope of these databases and the techniques for data acquisition and representation. Fig. 1: The complexity of neuroscience research is due to the multiple scales (top) and dimensions (bottom) of analysis across the life time and different situations [4]. A. Scope of Databases and associated software tools According to Kötter [5], the current scope of neuroscience databases and their associated software tools ranges from data inventories for personal use, to specialized data collections by a group or community of collaborating neuroscientists, to large multi-scale projects of general interest. Individual and specialized data collections and their associated software tools usually arise when certain research groups having certain interests are following their own specific requirements. Many of these efforts are not published and at times there maybe recreations of similar work in different places. Nonetheless, some of these efforts have grown and have made considerable impact on modeling and analysis capabilities. For example, the GENeral NEural SImulation System (GENESIS) is a modelbased database system that holds data ranging from the characteristics of ion channels in cell membranes to the details of the connections between neurons. But these data only become usefully organized when a user creates a model of a neuron or networks of neurons [6]. Other examples of isolated efforts by different groups include the BrainMap Project and NEURON [5]. Large-scale projects, typically funded and carried out by a consortium of research groups, are aimed at supporting variety of research approaches to develop databases and tools that can efficiently deal with the complexity of brain data. The International Consortium for Brain Mapping, ICBM and the American Human Brain Project, both initiated in 1993 to support research into databases and related tools for neuroscientists [7]. Much like the preceding Human Genome Project which completely changed the way research was carried in molecular biology, the Human Brain Project has thus far shown that the scope of informatics approaches is of international scale and significance and requires a highly organized collaboration between neuroscientists worldwide. A. Data Acquisition Acquiring data is perhaps the first prerequisite for any database development effort. In other words, without availability of data, there is not much incentive to initiate database development efforts and as it has been the case for most situations, it is the growing amount of available data that calls for more efficient ways of handling them. Neuroscience is no exception as there has been an enormous explosion of experimental data over the past few decades ranging from multielectrode recordings in behaving animals to functional imaging experiments in humans. While the availability of data poses no problems, it is the lack of compatibility between different data formats that can hinder data exchange and integration. The incompatibility and lack of standardization is the direct result of various independent development efforts in the past. A second problem in data acquisition is incompleteness of datasets and their accompanying information. Typically what is published is a small subset of data produced in the course of an experiment. The pressure for publication space and the bias for publishing only the positive findings limit data presentations to the most significant data whereas seemingly minor, poorly understood or merely confirmatory data are suppressed. Furthermore, the accompanying descriptions of experimental details are often insufficient to repeat or evaluate all aspects of the experiment without additional information. As a results, some believe that the scope of databases should be extended to include raw and even unpublished data, which itself has given rise to controversy among neuroscientists [5], [3]. The success of genomic and proteomic databases is partly due to the active participation of data producers that have an interest in seeing their data presented in a database, often as a requirement for having their work published. In neuroscience the situation has been different thus far. Many researchers are reluctant to take the time and trouble to share their data. They argue that the complexity of primary data makes understanding them too difficult for anyone other than the original producers. Others express their concern for copy right issues and the loss of control over their data. Gordon Shepherd at Yale has taken the initiative to address this issue. The olfactory receptor database within 3 Shepherd’s SenseLab allows users to enter unpublished sequences that are kept hidden from others. Instead, when a search finds matches with unpublished data, the database provides the searcher with the contact information for the researcher who submitted the data. Researchers can thus avoid revealing their primary data and yet benefit from identifying potential collaborators [8], [9]. There are also other opposing arguments against sharing of primary data which Koslow [7] address by stating that the scientific goal of discovery outweighs the arguments for not sharing data. He argues that the primary data will gain in value if it is put into the public domain once it has been analyzed and published. The combination of this new data with other data and further analysis will lead to increased value, new knowledge and understanding (Fig. 2). Fig. 2: The value of data increases if shared in public domain, yielding new knowledge and understanding [4]. The highly controversial decision pertaining to what level of data sharing is desirable and what type of data needs to be included in these cohesive interoperable databases is also reflected on the two opposing views in relation to development efforts. Some scientists believe it is more efficient to fund and develop a wide variety of small databasing projects and then build a link between those that prove to be most successful. Others on the other hand, believe that the problem of how to form database federations have to be tackled by experts working in dedicated informatics centers [6]. There are also scientists and database architects who propose designing filters to let users decide how to weigh the data. NeuroScholar, a database on neural connectivity in the rat brain is an example of this case. The users of this system can search for studies based on attributes such as the journal they appeared or the techniques the investigators used. They can then combine the weighted data from several studies to assess the strength of hypotheses they might wish to test [10]. B. Data Representation Data representation poses a central problem to most databasing approaches. In contrast to gene and protein sequences that can be represented linearly, the complexity of neuroscientific data prevents the possibility of codifying all data into one standard format. There are no universal and simple codes for fundamental objects of neuroscience, such as neuronal cell type and their activity patterns, cortical columns and layers. As a result, the complexity of data representation in neuroscience can range from simple viewing of text or tables to dynamically-created graphical interfaces. Therefore, analysis and processing steps are required to make data of variable sources comparable. The lack of standardization is particularly noticeable in neuroanatomy where the nomenclature can easily create confusion [11]. For examples, the same brain region can go by the name of caudate nucleus or nucleus caudatus, or referred to as part of a larger structure called basal ganglia. This confusion leaves only the experts to sort the data. Therefore, ontological systems that specify the relationship between words and the operationally defined concepts they represent are essential for databases. The US National Library of Unified Medical Language System (UMLS) includes neuroanatomical nomenclature systems designed to address this issue [11], [12]. Lack of standardization in data representation and thus incompatible data formats are roadblocks when it comes to transforming from one data format to another which would impose the requirement for developing algorithms that can perform this transformation. The Objective Relational Transformation (ORT) algorithm as an examples is used to transform neuroanatomical data from one mapping scheme to another [13]. C. Tools for Data Analysis Having the capability to merely store and retrieve data may be of little use if the user does not have access to analytical tools to allow for interactive and comparative analyses between the data. Therefore, it is essential to develop sophisticated tools that extend and refine sorting, analysis, and data integration. The heterogeneous nature of neuroscientific data would require software packages for image processing, compression, graphical interfaces, visualization, geometric warping (spatial transformation so that subjects are geometrically comparable), modeling, simulation, computation, querying and data mining to name a few. III. SYSTEM IMPLEMENTATION Notwithstanding the complexity and rapid development of neuroscientific data, the existing database technologies can still be used to carry out the goals of neuroinformatics. Examples of standard database management products available include Microsoft SQL server, MySQL and PostgreSQL on UNIX platform as well as commercial systems that can be licensed from Oracle, Sysbase, Informix, IBM and others. In spite of the relative ease in obtaining the required technology, several issues need to be addressed prior to implementation of database projects in neuroscience. A potential limiting factor in some database projects comes from the requirement for efficient storage and processing of large datasets (e.g. image files), flexible representation of heterogeneous data types and, object oriented rather than relational data representation [5]. 4 The exchange of data across the Internet would be facilitated if the data was already in Internet-suitable format (e.g. XML standard). There also needs to be a standardized approach to facilitate data conversions and communications between databases. Some scientists recommend that a universal markup language for neuroscience can efficiently address this problem [14]. Other issues to be addressed include performance issues and data processing. Due to bandwidth limitations, an increase in the size of data and the number of database users can dramatically affect the performance. Therefore, strategies for efficient data transfer and presentation are crucial. There is also the response time and the serverclient interaction that may affect the general acceptance of databases. The tools associated with databases may be required to perform complicated analyses and intensive failsafe computations and thus efficient data representation and optimization techniques are of necessity [5]. ` --Build a variety of integrated tools to complement the databases. VI. CONCLUSION Neuroscience research is generating increasing amounts of data that go far beyond the traditional means of analyzing data. The complexity and the variety of neuroscientific data ranging from physiological recordings to cellular and molecular interactions to brain images have driven scientists and researchers towards developing and adapting new electronic methods that can make more efficient use of the kinds of data available. These methods take the form of neuroinformatics tools. Once developed and proven effective, these tools will become as essential in neuroscience research as they are in genomic and proteomic research. There remain, however, significant problems both technical and sociological. The challenge for the neuroscientific community is to remain cooperative and to link the parallel efforts such that they work in concert, each complementing each other. IV. MAINTENANCE & QUALITY CONTROL As a database project grows, it involves more and more people, leading to differentiation between database developers, administrators, data collectors and general users. Therefore, it becomes necessary to keep track of data and to verify the accuracy and completeness of entries, and the appropriateness of data representations. Data maintenance is specially important to avoid the possible need for the costly reclassification of data particularly when it is desired to keep the old entries from going obsolete. Proper maintenance and suitable quality control measures may also potentially help in addressing some of the legal and ethical issues involved in reposting published data. Much of quality control involves active communication between database administrators and the authors of the study. Common practices would involve authors themselves entering most information concerning their own study, thereby reducing the possibility of misinterpretation [15]. There are also other processes such as peer reviews that may be helpful in quality control if tailored towards special needs of databases. Some argue that there is a need for implementing an independent review system if the databases are to include preliminary and unpublished data. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] V. FUTURE DIRECTIONS Despite the challenges ahead, neuroinformatics offers a promising future for neuroscience. The future efforts of experts in neuroinformatics and other collaborating fields will likely incorporate ways to: --Enhance the awareness of neuroscientists, computer scientists, IT specialists and other collaborating researchers for the need to integrate the enormous amount of data into cohesive interoperable databases. --Develop strategies and systematic ways of data collection, representation and analysis. --Establish effective and suitable measures for maintenance and quality control. [11] [12] [13] [14] T. Ideker, T. Galitski, L. Hood, “A New Approach To Decoding Life: Systems Biology”, Annual Review of Genomics and Human Genetics, vol 2, pp. 342-372, Sept 2001. H. Kitano, “Looking beyond the details: a rise in system-oriented approaches in genetics and molecular biology”, Current Genetics, vol 41, no. 1, pp. 1-10, Apr 04, 2002. G.M. Shepherd, et al., “The Human Brain Project: neuroinformatics tools for integrating, searching and modeling multidisciplinary neuroscience data”, Trends in Neuroscience, vol. 21, no. 11, pp. 460468, Nov 1, 1998. S. Koslow, “Sharing primary data: a threat or asset to discovery?” Nature Reviews Neuroscience, vol. 3, pp. 311-313, Apr 2002. R. Kotter, “Neuroscience databases: tools for exploring brain structurefunction relationships” Philosophical Transactions: Biological Sciences, vol. 356, no. 1412, pp. 1111-1120, Aug 29, 2001 M. Chicurel, “Databasing the brain”, Nature, vol. 406, pp. 822-825, Aug 24, 2000. S. Koslow, “Should the neuroscience community make paradigm shift to sharing primary data?”, Nature, vol. 3, no. 9, pp. 863-865, Sept 2000. P.L. Miller, et al., “Integration of Multidisciplinary Sensory Data: A Pilot Model of the Human Brain Project Approach” Journal of the American Medical Informatics Association, vol. 8, no. 1, Feb 2001. M. Hines, “ModelDB: A Database to Support Computational Neuroscience”, Journal of Computational Neuroscience, vol. 17, pp. 711, Aug 2004. G.A. Burns, et al., “Tools and approaches for the construction of knowledge models from the neuroscientific literature.”, Neuroinformatics, vol. 1, no. 1, pp. 81-110, Spring 2003. A. W. Toga, “NEUROIMAGE DATABASES: THE GOOD, THE BAD AND THE UGLY”, Nature Neuroscience, vol.3, pp. 302-209, April 2002. B.L Humphreys, et al., “the Unified Medical Language System: An Informatics Research Collaboration”, Journal of the American Medical Informatics Association, vol. 5, no. 1, pp. 1-11, Feb 1998. K.E. Stephan, K. Zilles, R. Kotter, “Coordinate-independent mapping of structural and functional data by objective relational transformation (ORT)”, Philosophical Transactions: Biological Science, vol. 355, pp. 37-54, Jan 29, 2000. M. Martone, A. Gupta, M.H. Ellisman, “e-Neuroscience: Challenges and triumphs in integrating distributed data from molecules to brains.”, Nature Neuroscience, vol. 7, no. 5, pp. 467-472, May 2004 5 [15] J.D. Van Horn, M.S. Gazzaniga, “Databasing fMRI studies-towards a ‘discovery science’ of brain function”, Nature, vol. 3, pp. 314-318, April 2002.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download azarmgin - Engineering Computing Facility