Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data center wikipedia , lookup
Versant Object Database wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Data analysis wikipedia , lookup
3D optical data storage wikipedia , lookup
Information privacy law wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Data vault modeling wikipedia , lookup
Relational model wikipedia , lookup
Business intelligence wikipedia , lookup
DDI Across the Life Cycle: One Data Model, Many Products Click to edit Master title style Inter-university Consortium for Political and Social Research (ICPSR) and to edit Survey ResearchClick Operations (SRO) Master IASSIST Meeting Tampere, Finland May 29, 2009 subtitle style Presenters • Mary Vardigan, • Sanda Ionescu, Assistant Director, ICPSR Documentation Specialist, ICPSR • Sue Ellen Hansen, Director, SRO Technical Systems • Felicia LeClere, Associate Group Research Scientist, ICPSR • Peter Granda, Archivist, ICPSR The Collaborators • Both are units of the Institute for Social Research, University of Michigan – ICPSR is a large social science data archive – SRO is a data collection center Past Collaborations • Working together on the National Survey of Family Growth, sponsored by NCHS, to create data and an interactive codebook • Partnered on the Collaborative Psychiatric Epidemiology Surveys, sponsored by NIMH – This involved a harmonization of three datasets and interactive documentation featuring question comparison and five languages – www.icpsr.umich.edu/CPES Rationale for Collaboration • We share a need for rich, high-quality metadata • We want to comply with metadata standards – in particular, the Data Documentation Initiative (DDI) • DDI 3 enables life cycle perspective • We need to pass data easily from SRO to ICPSR without information loss SRO-ICPSR Joint Project • Shared DDI-compliant data model and database design for survey metadata • Challenges: – Different computing platforms – Different end products – Different staff orientations Task B and D Other File Types (e.g. SAS, SPSS, etc) DDI 2 or 3 File Task B Blaise Database (BDB) Client Relational Database (offline SQL Server Express) Client Relational Database (offline SQL Server Express) SRO Relational Database (online/networked SQL Server) Edit / Review metadata Stand-alone client application ICPSR Import Tool Other Importing Tool Export codebook ICPSR Relational Database (online/networked Oracle) Export questionnaire Export data Display metadata <XML/WSDL> Client application with sync data SRO/ICPSR/Other web client Web server Task A SRO Blaise Parsing Tool Tasks C and D <Metadata & Data> <Transform-ations> <Data Storage> Blaise Datamodel (BMI) <Application Logic> Offline\Local Application Online or Offline User specifies files (location, file type, etc.) using an application ICPSR web client:: • Variable Search • Internal Variable Browser • NSFG Data Management Products and Benefits SRO • Tools to enhance MQDS, which produces XML documentation from Blaise instruments • Tool to permit external users to add metadata for NSFG ICPSR • Variable-level database that permits users to search across the ICPSR collection; compare variables; create new datasets and questionnaires • Internal variable search for harmonization Data Life Cycle Coverage Michigan Questionnaire Documentation System (MQDS) Sue Ellen Hansen Nicole Kirgis What Does MQDS Do? • Facilitates automated documentation and harmonization of Blaise survey instruments and datasets – Extracts survey question metadata – Standardized format Survey Question Metadata • • • • • • • Question universe Variable name and label Question text Question variable text (fills) Data type Code values and code text Skip instructions • etc. Data Documentation Initiative (DDI) • Standard specification for technical documentation of social science data • eXtensible Markup Language (XML) – Widely used – Facilitates sharing of data • Initial focus on standard dataset codebook • Ongoing development http://www.ddialliance.org/ MQDS Version 1 • Extracted metadata from Blaise data model as XML tagged data • Provided user interface for selection of – Blaise files – Instrument questions and sections – Types of metadata to extract – Languages to display – Style sheet for generation of instrument documentation or codebook Using MQDS V1 XML: Codebook in Five Languages National Latino and Asian American Study www.icpsr.umich.edu/CPES MQDS Version 1 • Limitations – XML not DDI-compliant • DDI Version 2 did not have XML tags for all metadata provided by Blaise • Did not provide easy means of adding XML tags without becoming noncompliant – XML files for complex surveys can be very large (text files) • Entire files had to be processed in computer memory • Limited ability to fully automate documentation DDI Version 3 • Released April 2008 • Focus on complete data lifecycle –going beyond the codebook DDI Version 3 • Included extensions proposed by DDI working group on instrument design Persistent Content of Question Use of Question in Instrument Question text • Static • Dynamic or variable Order and routing • Sequence / skip patterns • Loops Multiple-part question Universe Response domain • Open • Set categories • Special types (date, time, etc.) Analysis unit Definitional text Instructions MQDS Version 3 • Joint SRC and ICPSR venture • Goals: – Address version 2 limitations • Process Blaise instrument of any size – Exploit new elements and validate to the recently released DDI version 3 standard – Move from processing XML metadata in memory to streaming metadata to a relational database MQDS Version 3 Relational Database: Import, Export, Transform SQL Server / SQL Server Express XML (DDI 3) Relational Db Blaise Datamodel (BMI) User specifies input files (location, file type, etc.) Blaise Database (BDB) 2. Export 1. Import User specifies output files (location, Language/locale, XML output options, etc.) 3. Transform Questionnaire Other File Types (e.g. SAS, SPSS, etc) Database connection settings DDI 3 elements not in *.bmi Codebook User specifies stylesheet selection criteria, type of output desired (html, rtf, pdf), etc. MQDS Version 3 • Relational database – DDI compliant standardized tables – Flexibility for SRC and ICPSR to add extensions that meet their specific organizational needs – Allows • Automated documentation of any Blaise survey instrument • Importing and documenting data produced by other software • Lower cost development of other tools that facilitate editing and disseminating data MQDS V3 Prototype: Exporting Language XML MQDS Development • Expect to release Summer 2009 • Working out a distribution plan for Blaise users Data Life Cycle Coverage Applications: Customized Editing Tool Peter Granda ICPSR MQDS Version 3 • Relational database – DDI compliant standardized tables – Flexibility for SRC and ICPSR to add extensions that meet their specific organizational needs – Allows: Development of new tools to deal with the practical problems involved in transforming data and documentation derived from BLAISE instruments into public-use products Features of the Tool • Loads MQDS output into database tables • Web interface to permit quick viewing • Application that permits both internal and external clients to access and edit variable-level information • Ability to include disposition codes to designate which variables to include in public-use files • Maintain permanent record of decisions made throughout the editing process SELECT VARIABLE TO EDIT FROM DATABASE POPULATED WITH METADATA FROM MQDS WITH POSSIBLE REVISIONS FROM SUBSEQUENT DATA PROCESSING STEPS Variable Name Variable Label Value Labels Question Text Universe Statements List of Standard Formats VARIABLE DISPOSITION: • Place in public-use file • Place in restricted-use file • Leave in original file created by the data producer Data Life Cycle Coverage Social Science Variables Database: The Public Search Sanda Ionescu ICPSR SSVD – The Public Search • ICPSR variables search – Internal (staff, other authorized users) – External (public) SSVD – The Public Search • Enables ICPSR users to search variables across datasets • Assists in data discovery, comparison, harvesting, and analysis • Useful in question mining for designing new research SSVD – The Public Search • Concept first tested in a pilot project completed in 2005 – Good functionality – Demonstrated benefits of using DDI markup: easy import; complex, granular searches; userfriendly display – Limited number of data sets (69 ICPSR studies included) SSVD – The Public Search • Expand the project to ultimately include most of ICPSR’s holdings – Generate DDI documentation for most ICPSR studies • Need for automated production – Build a solid, state-of-the-art, DDI compliant database • Handle large number of files • Support multiple applications SSVD – The Public Search • The Hermes batch processing system *: ASCII data file SPSS system / portable file (Mandatory) Statistical setups: SPSS, SAS, Stata Ready-to-go data files: SAS transport, SPSS portable, Stata system Question text file in fixed format (Optional) DDI 2.1 variable-level documentation with frequencies [and question text (optional)] (Part of ) PDF Codebook *This is a simplified diagram SSVD – The Public Search • Hermes: – Consistent, reliable source of variables descriptions in DDI – DDI documentation limited to content of input files • Labels may be truncated or may contain abbreviations • Question text may be missing although available in original documentation SSVD – The Public Search • Additional quality standards necessary for DDI documentation, to maximize effectiveness of Public Search: – Presence of question text, whenever available – Increased readability of variable/value labels, especially if question text is not present SSVD – The Public Search • Not all ICPSR studies qualify for variablelevel searches • Criteria for selecting studies; not included: – Aggregate/statistical data (ex. Census data, Data Books, Roll Call records, etc.) – Poor documentation – Some restricted data SSVD – The Public Search • Pre-SSVD upload: – Review of DDI output from Hermes to apply content quality standards and study selection criteria – Additional work to upgrade DDI where necessary (and feasible) • • • • Add question text Complete truncated text Improve readability of labels Add frequencies SSVD – The Public Search • Preparing studies for SSVD: – Started end of 2006 – Included DDI produced for previous projects – Reviewed all variable-level DDI created at ICPSR, November 2006 to present (new releases and updates) SSVD – The Public Search • New database finalized Fall 2008 • Built to match DDI 3.0 data model • Both DDI 2.x and DDI 3.0 compliant – Designed to accept both DDI 2.x and 3.0 input and produce output in both versions • ICPSR version currently uploads DDI 2.1 and generates DDI 3.0 individual variables descriptions. SSVD – The Public Search • First batch of variable-level description files uploaded into SSVD: – Approx. 3,500 DDI files (one file per dataset), representing • Approx. 1,300 ICPSR studies (approx. 18.5 percent of total ICPSR holdings, excluding US Census; approx. 30 percent of holdings with data and setups) – Over 1,000,000 individual variable descriptions; 23,000,000 categories SSVD – The Public Search • Currently in Beta-testing phase. – Email bugs at [email protected] • Uses Oracle Text. http://www.icpsr.umich.edu/ICPSR/ssvd/index.html SSVD – The Public Search Moving forward… • Fall 2009: switch to Solr searches (based on Lucene) – Faster – More sophisticated: results filtered by multiple relevant parameters • Enable side-by-side/same page display of selected variables for comparison • Enable variable search from individual study page (search within study) SSVD – The Public Search Moving forward… • Adding content: – Second batch of DDI files ready to upload: • 900 DDI files, representing 500-600 studies (will bring total close to 45 percent of ICPSR studies with data and setups) – Initiate retrofit project to examine older studies that were not covered in the first conversion phase SSVD – The Public Search Moving forward… • Transition to automated DDI upload – DDI uploaded at the time of study publication – First quality check performed by study processing staff – Acceptable DDI immediately released for public view – Problematic DDI suppressed from public view for further review, and upgrade as appropriate Data Life Cycle Coverage Applications: Internal Variable Search and Documentation Felicia LeClere, ICPSR The Integrated Fertility Survey Series • 5 year grant from NICHD to harmonize data from 10 large surveys of marriage, fertility, and child-bearing in the United States • 10 surveys beginning in 1955 through 2002 Problem of Harmonization • In order to make decisions about harmonizing across all files need: • Question text • Value labels and categories • Be able to find and export metadata from all 10 files at the variable level • Be able to document each variable, recode and variable choice Tools from Variables Database • Need to be able to do nested searches that are documented • Need to be able to search all fields individually and in sequence • Need to be able to download results and document what search terms were used ICPSR SSVD Internal Search • All 10 data sets were loaded in ICPSR’s version of the shared data base • Designed to capture all of the relevant fields that were marked up in DDI Entry screen for internal search Search results screen Excel download from search Can also download value labels and codes Search Utilities • Downloaded search fields serve to: – 1. Identify variables to be harmonized – 2. Provide metadata for “translation tables” which are used to harmonize files Harmonization steps • Use search results to populate two intermediate steps to reforming data set • Exploratory comparative tables » Use this comparative table to make decisions about harmonization by examining universes, question texts, and response categories • Translation tables » These tables are designed to provide instructions on recoding the underlying items from the 10 surveys to a single harmonized item. The table provides instructions to an automated SAS program that recodes items from 10 surveys. Comparative table – date of birth 63 Translation Table for place of birth Harmonization steps • After the translation table, the recode instructions for all 10 files are built into the SAS file and a new data file has been created. • The underlying metadata data provided by the database allow us to (1) search all 10 files, (2) explore comparability and (3) recode to new variables