Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Scientific Database Approaches John H. Porter University of Virginia & Kristin Vanderbilt University of New Mexico Road Map Why have Scientific Databases? Challenges for Scientific Databases Approaches to Scientific Databases Strategies for Initiating Ecological Databases WHY have Scientific Databases? Improvement of data quality • multiple users provides multiple opportunities for detecting and correcting problems in data Cost • data costs less to save than to collect again • with environmental data, often data cannot be collected again at any cost WHY have Scientific Databases? Environmental Policy and Management • environmental policy decisions require data that are regional or national, but most ecological data is collected at smaller scales • National Policies • International Policies WHY have Scientific Databases? New Science • Long Term – long-term studies depend on databases to retain project history • Synthesis – use of data for a purpose other than which it was collected • Integrated, multidisciplinary projects – depend on databases to facilitate sharing of data Evolution of Data Sharing - Traditional Model Data Collection Use Data Lose or Discard Data Publications Evolution of Data Sharing –New Model Data Collection Use Data Data and Metadata Publications •Regional Analyses •Global Change •Long-term Studies •Synthesis Challenges for Scientific Databases Long-term perspective • without databases, most data do not outlive project that collected them The 20-year rule • GOAL: data that is accessible and interpretable 20-years in the future Meeting Long Term Needs • TECHNOLOGICAL – media & formats that do not become obsolete •CONTEXTUAL- need to capture context of data collection •SEMANTIC - terms need to be well-defined Challenges for Scientific Databases Deal with Diversity • science means asking NEW questions – new kinds of queries • scientific data is heterogeneous and diverse • scientific users have different backgrounds and goals • the user community for a given database will be dynamic Characteristics of Ecological Data High Satellite Images Weather Stations Business Data Data Volume (per dataset) Most Software Gene Sequences GIS Most Ecological Data Primary Productivity Biodiversity Surveys Population Data Soil Cores Low High Complexity/Metadata Requirements Comparison to Business Databases Business-oriented databases have been very different from scientific databases • Relatively small number of well-defined data elements – E.g., Part number, count, price • Repeatable reports (e.g., sales report) • Rules for integrating data well understood • Intolerant of different values associated with an element – E.g., hourly rate of pay Ecoinformatics Development: Alignment with IT community Information Technology Ecoinformatics Reason: IT focused on proprietary business applications modified from James Brunt Changing Times New emphases on “data mining” are forcing business databases to become more like scientific databases • Example: data on customer demographics are linked to regional store inventories • Integration of data resources not designed with integration in mind Ecoinformatics Development: Alignment with IT community XML, Web Services, Semantic Mediation IT Ecoinformatics Reason: IT now focuses on domain-neutral access to distributed data products. Modified from James Brunt The Ecoinformatics Challenge: Can we make information available to ecologists: • In ways they can locate the information they need? • With information in forms they can readily use? How can we assure that the information is current and accurate? Not all Scientific Databases are Alike! Scientific data are available at a number of different “levels” LOW: individual investigator posts data on web page for students to retrieve MEDIUM: Online databases for supporting a project HIGH: system automatically integrates data from a large number of sources Different types of Scientific Databases “Portal”, “Value-Added” or “Integrated” Infobases Researchers International/ National/Regional Systems Project or Site-Based Systems Individual datasets Tools for Creating Scientific Databases Web Server – HTML, XML • IIS • Apache – open source Database Management Systems (DBMS) • Input, query, update, sort, output Statistical Packages • Aggregate, graph Programming Languages • C++, JAVA, PERL, Python, Visual Basic, PHP • Create Custom code Tools for Scientific Database Development Relational Database Management Systems – RDBMS in common use • Access/ Microsoft SQL Server • Oracle • MySQL – open source Statistical Packages • SAS • SPSS • R – open source Spreadsheets Spreadsheets are fantastic tools – but not for scientific databases! • Encourage “bad practice” – irregular data structures that can’t be parsed easily • Lack “auditability” – difficult or impossible to back-track calculations • Proprietary formats become obsolete • Lack export capabilities for other than values or graphs (no formulae) Not Every Scientific DB needs or uses the same tools Example 1 – Basic Data Access • Post comma-delimited files on web server • Metadata files – XML text files (structured) or unstructured Example 2 – Add Products • Use SAS to conduct error-checking and generate graphics from data • Use scripts/programs to automate production process Possible Systems Example 3- Manage Metadata in DBMS • Metadata in Access Database • Provide comma-delimited data files Example 4- Manage Metadata on Web • Link web forms to backend DBMS Example 5- Full DMBS system • Metadata in DBMS • Data dynamically queried from DBMS using web interface Level of Structure Unstructured Data/Metadata • Easy to produce • Hard to use Structured Data/Metadata • Harder to produce • Easy to export, alter, update • The specific tool used to structure data (e.g., XML, DBMS) is increasingly less critical than the structure itself Evolving a Database Development of a database is an evolutionary process Implement system based on current priorities - but think ahead! Seek scalable solutions • avoid bottlenecks • adding the 1000th piece of data should be as easy as adding the first (or easier) Developing a Database Questions to Ask Why is this database NEEDED? Who will be the USERS of the database? What types of QUESTIONS should the database be able to answer? What INCENTIVES will be available for data providers? Meeting the Challenges Prioritize • focus on developing the most critical data resources • most commonly, critical data refer to the research site as a whole – Meteorology & Climatology – Bibliography of past research at the station – GIS data layers for the station research area Meeting the Challenges Get additional resources • NSF Grants • Upcoming NSF initiatives: – SEI+II – interdisciplinary research – National Ecological Observing Network (NEON) • Institutional Support Meeting the Challenges Work with researchers and enlist their help in developing ecological databases • Develop policies for data collection and sharing that dictate the responsibilities of: –The data provider/producer –The data system –Users of the data Use Standard Methods when Possible Advantages of using standard methods • Increases intercomparability (and hence, value) of data, facilitating cross-site comparisons • Reduces cost of methods development Standards Costs of using standards • Standard methods may be poorly suited to local conditions • Developing standards is time consuming and difficult For some types of monitoring, standards may not exist, or may do a poor job characterizing desired parameters Standards “The wonderful thing about standards is that there are so many of them to choose from” Sources of Standards • Published literature • Government Agencies (e.g., USGS, EPA) • Project standards (e.g., LTER Climate Stations) • Resource Discovery Initiative for Field Stations (RDIFS) directory (under development) Information Systems Developing an information system is a critical component of research • You can’t exploit data you no longer have! Creating good “metadata” (data about data) is crucial to maintaining data usability over time Exploit Partnerships & Existing Resources OBFS Resource Discovery Initiative for Field Stations (RDIFS) • • • • • Ecoinformatics Training Publications Database Registry for field station data (free advertising!) Database of standards Keyword Thesaurus Ecoinformatics.org/ Knowledge Network for Biocomplexity Project • Ecological Metadata Language • Tools Ecological Metadata Language (EML) Other Possible Collaborations ORNL Mercury System • Cataloging and metadata tools with the data and metadata left on your system Global Change Master Directory • online system for metadata with searching capabilities OpENDAP.org • Online tools for oceanographic data Exploiting External Resources Ecological Society of America journal Ecological Archives • accepts “data papers” for major and important data sets. Concluding Thoughts Developing ecological information systems seems a daunting task Every system starts somewhere. Even oaks start with acorns! Once started, you can build on successes, a little at a time Remember, the compound interest on zero is zero! Next Step Experience is a good guide to helping build the sort of database your users will want to use Its good to try out the existing systems to see what works (and what doesn’t) as a user