Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry University of Liverpool Richard Marciano University of Maryland Thank you! Special thanks to John Harrison & Jerome Fuselier (Liverpool), Chien-Yi Hou (UNC), Shreyas &Luis Aguilar (UCB) Available via https://github.com/cheshire3 iRODS available via https://www.irods.org Project web site http://diggingintodata.web.unc.edu • Integrating Data Mining and Data Management Technologies for Scholarly Inquiry • Goals: – Text mining and NLP techniques to extract content (named Persons, Places, Time Periods/Events) and associate context • Data: – Internet Archive Books Collection (with associated MARC where available) ~7.2T – Jstore ~1T – Context sources: SNAC Archival and Library Authority records. • Tools – Cheshire 3 – Fast open source XML search engine for storing & indexing XML books. Used for extracting GeoLocations and Persons from XML books indexed in Cheshire3. – iRODS – Policy-driven distributed data storage – Amazon S3 storage and EC2 computing Current Version • iRODS and C3 on Amazon EC2 and S3 Data Ingestion iRODS Rule Engine Amazon S3 Bucket 1 Bucket 2 Data Presentation Indexing Cheshire3 iCAT Retrieval Amazon EC2 Cache Resource Summary • Indexing and IR work very well in the Grid / Cloud environment, with the expected scaling behavior for multiple processes • Still in progress: – We are still processing collecting the books collection from the Internet Archive – We are still extracting place names, personal names, corporate names and linking with reference sources (such as GeoNames, VIAF, and SNAC) Current Digital Curation Projects Creating a data observatory • CI-BER (CyberInfrastructure for Billions of Electronic Records): – – – Funded by NSF/ NARA(2010-2013): ~ $1M See: http://ci-ber.blogspot.com/ Big data management project based on the integration of heterogeneous datasets: • a. b. c. Testbed collection of 100 million files and 50 terabytes of data with content from over 100 federal agencies Open source collaborative geo-analytics prototype “Citizen-led crowdsourcing” prototype NSF: “Brown Dog”, a $10.5M NSF/DIIBs award (2013-2018) -- the “super mutt” of software: – – – http://go.illinois.edu/BrownDog NCSA (Kenton McHenry) + CI-BER (Richard Marciano) Creating a Data Observatory to: • • • Provide access to big data training sets Benefits: Accelerate the development of digital curation algorithms and services What if: – – Students could be embedded in a major NSF partnership? We used this implementation project as an opportunity to teach students practical digital curation skills? Digital Curation Lab Mission Statement: The Digital Curation Lab (DCL) aims to be a leader in the Digital Curation educational field, providing a model for other universities and laboratories around the world. Vision: The Digital Curation Lab (DCL) will provide a real-world digital curation experience for students and professionals to experiment and innovate. The DCL will also be a key enabler of digital curation projects within the Washington, DC Metro area, and source of cutting edge research that will transform digital curation technologies, practices, and institutions. Values: In order to achieve its vision and mission, the DCL needs to value: applied research in in digital curation field. • Innovative, • Practical and theory-based education. • Enduring partnerships with on and off-campus organizations including alumni organizations. • Transformational impact on technologies, institutions, and practice. 5. INTERDISCIPLINARY COLLEGE RESOURCE 2. OUTSIDE PARTNERS 3. ALUMNI HOME 1. DATA SHOP 7. INNOVATION HUB 4. LESSON PLAN & SOFTWARE REPOSITORY - stretching - aging - rejuvenation 6. QUALITY TESTING SHOP