Download A Script for Archiving Digital Research Data: Improving Accuracy

A SCRIPT FOR ARCHIVING DIGITAL RESEARCH DATA IMPROVING ACCURACY AND EFFICIENCY IN THE DATAVERSE NETWORK Rachel Carriere, Thu-Mai Christian, Erin Crane, & Cheryl Thompson | School of Information & Library Science | University of North Carolina at Chapel Hill ABSTRACT RESULTS The H. W. Odum Institute for Research in Social Science at the University of North Carolina at Chapel Hill collects and preserves digital social science research data and makes it publicly available online for discovery and secondary analysis via the Dataverse Network (DVN). The current data ingest workflow requires a multitude of tasks within several software programs to correct data variable label truncation, which is a result of the 255-character limit in statistical software packages. Because of this inherent limitation in the statistical software, the ingest of data into the archive is often a complex process that introduces a single point of failure in the ingest workflow that can result in data corruption. To avoid the risks and single point of failure in the data ingest workflow, a Python runtime script was developed to eliminate direct user interaction with the DVN database. Rather, the script performs background processes that locate the appropriate record, reads the TXT file containing complete data variable labels, and communicates with the DVN database to correct any truncated labels. The burden on the archivist is reduced and records in the DVN are accurate and complete. An examination of the data ingest workflow presents an opportunity to eliminate this single point of failure by introducing a newly-developed computing script that automates the process of correcting truncated data variable labels—thus preserving the complete archival record. METHOD This poster reports findings from an analysis of the DVN data ingest workflow and presents one solution for improving the efficiency and accuracy of data ingest. Several observations and interview sessions were conducted to study the various tasks and tools involved in the current workflow. Models illustrating the workflow and tools were developed to assist in the identification of points of failure and opportunities for improvement. The model below highlights deficiencies in the current data ingest workflow. GOALS TRIGGER: RECEIVE DATA SUBMISSION PACKAGE Transform submission package into archival package Store data submission package Edit the data file for completeness and accuracy Convert submission files to archival preservation formats (e.g., .pdfa, .por) Scripting offers the power of customizing archival platforms and technologies to meet the needs of today’s digital archival collections, archivists, and the research professionals who depend on them. The increasing use of and dependency on digital research data have prompted funding agencies to issue mandates requiring researchers to develop a data management plan that includes details about data access, distribution, and archiving. Like other research universities, the University of North Carolina Provost has assembled a task force to develop recommendations on the stewardship of digital research data. As a result, much interest in the digital data archive has been generated. The Dataverse Network platform offers a solution to social science data management and preservation needs; however, the introduction of a script to address an inherent challenge confronting archivists is necessary to increase the functionality of the DVN and the usefulness of its records. Decide whether data file requires editing Create text file for data edits Create SQL code for data edits Apply edits to data file in DVN Verify edits were performed Make the archival package available Publicly release archival package in DVN 1. The archivist uploads his/her data files to the Dataverse Network (DVN) and notates the automatically-generated Universal Numerical Fingerprint (UNF). 2. The archivist initiates the Python runtime script, which prompts the archivist to enter the UNF and the file path to the TXT data variable label file. 3. The Python script communicates with the DVN PostgreSQL database engine to identify the appropriate record and overwrite truncated data variable label strings with the correct strings. 4. The script displays to the archivist the data variables that were modified for quality control and documentation. 5. The data variable labels are complete in the DVN, which enables discovery and proper analysis of the data. SUMMARY Create catalog record and upload files into Dataverse Network (DVN) REPLACED BY SCRIPT THE SCRIPT Possible single point of failure if apply edits to wrong data in DVN NEXT STEPS • Convert the Python script to a Java GUI application to improve ease of use and usability • Integrate the Java (JSP) application into the DVN web interface for data submissions • Test the script with researchers and data producers to understand how the data ingest process could be integrated into the research life cycle Acknowledgements | Thanks to Jonathan Crabtree, Assistant Director of Archive and Information Technology, Odum Institute; Dr. Stephanie Haas, Systems Analysis Professor; & Freeman Lo, Applications Analyst

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Script for Archiving Digital Research Data: Improving Accuracy