Download Integrating Data Mining and Data Management

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Integrating Data Mining and Data
Management Technologies for
Scholarly Inquiry
Ray R. Larson
University of California, Berkeley
Paul Watry
University of Liverpool
Richard Marciano
University of Maryland
Thank you!
Special thanks to
John Harrison & Jerome Fuselier (Liverpool),
Chien-Yi Hou (UNC),
Shreyas &Luis Aguilar (UCB)
Available via https://github.com/cheshire3
iRODS available via https://www.irods.org
Project web site http://diggingintodata.web.unc.edu
• Integrating Data Mining and Data
Management Technologies for Scholarly
Inquiry
• Goals:
– Text mining and NLP techniques to extract content
(named Persons, Places, Time Periods/Events) and
associate context
• Data:
– Internet Archive Books Collection (with associated
MARC where available) ~7.2T
– Jstore ~1T
– Context sources: SNAC Archival and Library
Authority records.
• Tools
– Cheshire 3 – Fast open source XML search engine
for storing & indexing XML books. Used for
extracting GeoLocations and Persons from XML
books indexed in Cheshire3.
– iRODS – Policy-driven distributed data storage
– Amazon S3 storage and EC2 computing
Current Version
• iRODS and C3 on Amazon EC2 and S3
Data Ingestion
iRODS
Rule
Engine
Amazon
S3
Bucket 1
Bucket 2
Data Presentation
Indexing
Cheshire3
iCAT
Retrieval
Amazon
EC2
Cache
Resource
Summary
• Indexing and IR work very well in the Grid
/ Cloud environment, with the expected
scaling behavior for multiple processes
• Still in progress:
– We are still processing collecting the books
collection from the Internet Archive
– We are still extracting place names, personal
names, corporate names and linking with
reference sources (such as GeoNames, VIAF,
and SNAC)
Current Digital Curation Projects
Creating a data observatory
•
CI-BER (CyberInfrastructure for Billions of
Electronic Records):
–
–
–
Funded by NSF/ NARA(2010-2013): ~ $1M
See: http://ci-ber.blogspot.com/
Big data management project based on
the integration of heterogeneous datasets: •
a.
b.
c.
Testbed collection of 100 million files and
50 terabytes of data with content from over
100 federal agencies
Open source collaborative geo-analytics
prototype
“Citizen-led crowdsourcing” prototype
NSF: “Brown Dog”, a $10.5M
NSF/DIIBs award (2013-2018) -- the
“super mutt” of software:
–
–
–
http://go.illinois.edu/BrownDog
NCSA (Kenton McHenry) + CI-BER
(Richard Marciano)
Creating a Data Observatory to:
•
•
•
Provide access to big data training sets
Benefits: Accelerate the development of
digital curation algorithms and services
What if:
–
–
Students could be embedded in a major
NSF partnership?
We used this implementation project as an
opportunity to teach students practical
digital curation skills?
Digital Curation Lab
Mission Statement:
The Digital Curation Lab (DCL) aims to be a leader in the Digital Curation
educational field, providing a model for other universities and laboratories around the
world.
Vision:
The Digital Curation Lab (DCL) will provide a real-world digital curation experience
for students and professionals to experiment and innovate. The DCL will also be a
key enabler of digital curation projects within the Washington, DC Metro area, and
source of cutting edge research that will transform digital curation technologies,
practices, and institutions.
Values:
In order to achieve its vision and mission, the DCL needs to value:
applied research in in digital curation field.
• Innovative,
• Practical and theory-based education.
• Enduring partnerships with on and off-campus organizations including alumni
organizations.
• Transformational impact on technologies, institutions, and practice.
5. INTERDISCIPLINARY
COLLEGE
RESOURCE
2. OUTSIDE
PARTNERS
3. ALUMNI HOME
1. DATA
SHOP
7. INNOVATION
HUB
4. LESSON PLAN
& SOFTWARE
REPOSITORY
- stretching
- aging
- rejuvenation
6. QUALITY TESTING SHOP