Download Slides (PowerPoint)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
HATHITRUST
A Shared Digital Repository
HathiTrust Digital Library Interface and
Services
Angelina Zaytsev
Collection Services Librarian
[email protected]
Agenda
• Collection overview
• Interface overview
• Other services
If time permits:
• Programs
• Working groups & committees
• Governance & partnership
10/25/2016
Collection Overview
10/25/2016
HathiTrust Collections: Oct 2016
• 14.7 million total items
– 7.4 million book titles
– 405,000 serial titles
– 767,000 US federal government documents
– 5.7 million items open (public domain & CClicenses)
6 April
2016
10/25/2016
10/25/2016
10/25/2016
10/25/2016
For more information...
You can click through to see the results for all
of these categories!
https://www.hathitrust.org/statistics_visualizations
10/25/2016
What kind of content formats?
•
•
•
•
Scanned from book-like materials
Image formats: TIFFs and JPEG2000s
Plain-text OCR
PDFs are generated on-the-fly and delivered to
users (NOT stored in the repository)
Some:
• Born-digital pdfs (and maybe epubs soon!)
• Photos
10/25/2016
Where does content come from? - Digitization
source
10/25/2016
Type
Characteristics
Google
● 94.8% of the collection
● Download restrictions
● Primarily scanned in black and white
with some color pages
● Large-scale mass digitization = quality
can vary
Internet Archive
● 3.7% of the collection
● No download restrictions
● Scanned in full color (as a result, file
sizes are 2.5 times larger than Google
content)
● Large-scale mass digitization = quality
can vary
Locally digitized & vendor
services
● 1.48% of the collection
● Various restrictions may apply
● Typically small scale, “boutique”
digitization = high quality (with some
exceptions)
Where does content come from? - Top 10
contributing libraries
Institution
10/25/2016
Volumes
University of Michigan
4,714,231
University of California
3,835,563
Harvard University
841,969
Cornell University
585,190
University of Wisconsin - Madison
561,945
Indiana University
530,763
University of Illinois at Urbana-Champaign
528,545
University of Minnesota
503,057
The University of Texas
460,139
Pennsylvania State University
390,345
Special Collections
• Universidad Complutense de Madrid: Latin, Spanish and French
documents from the 1500-1800s
• Keio University: 92,000+ Japanese and some Chinese language
materials
• Islamic Manuscripts from University of Michigan: 8th-20th century
CE mss., 1,795 titles in Arabic, Persian, and Ottoman Turkish
languages, collaborative cataloging project
• Benson Latin American Collection, University of Texas at Austin:
460,000 vols related to Latin American culture and history
• Minnesota Digital Library & Minnesota Historical Society: 60,000
photos related to Minnesota history
• US Fed Gov Docs: 766,000+ documents and growing!
10/25/2016
Access is determined by several factors:
10/25/2016
Copyright status of the
item
Derived from:
● Bibliographic metadata (inc for
US fed gov docs)
● Manual copyright review
● Permissions agreements
Geographic location of the
user
In the United States
vs.
Outside the United States
Member affiliation
Yes/no?
Digitization source and/or
contributing institution
Any restrictions imposed by these
entities?
Type of work
Search
(bibliographic
and full-text)
Text and Data
Mining
Viewable*
Full-PDF
download
Print
disabilities*
Preservation
uses (Section
108)*
Public domain
worldwide
Worldwide
Worldwide
Worldwide
Partners only
if 3rd-party
restrictions.
Worldwide
N/A
Public domain Worldwide
(US) – Non-US
works
published
between 1873
and 1923.
Available
within the
United States
When
accessed from
with the
United States
Works that
rights holders
have opened
access to in
HathiTrust
Worldwide
unless license
forbids it
Worldwide
Forthcoming
Not available
Worldwide
Works that
Worldwide
are incopyright or of
undetermined
status
If not,
worldwide.
Partners in the Partners in the N/A
US if 3rd party US; partners
restrictions.
worldwide
where laws
If not, anyone permit
in the US
Worldwide (if
digitized by
Google, fullPDF only
available if
opened with
CC license)
Not available
Worldwide
N/A
Partners in the
US; partners
worldwide
where laws
permit
Partners in the
US; partners
worldwide
where laws
permit
* Note: Access to in-copyright works is subject to conditions listed in HathiTrust’s policies on Access and
Use.
10/25/2016
Interface Overview
10/25/2016
• Full-text search
• Catalog search
• Pageturner
• Collection Builder
10/25/2016
Other services
• Get bib records in bulk:
https://www.hathitrust.org/data
• Get datasets:
https://www.hathitrust.org/datasets
• Get high resolution image files:
https://www.hathitrust.org/data_api
• Get bib records for known identifiers:
https://www.hathitrust.org/bib_api
• Get some data about all HT content:
https://www.hathitrust.org/hathifiles
10/25/2016
For help
• See https://www.hathitrust.org/help
• Contact [email protected]
10/25/2016
Questions?
BUT...
HathiTrust is more than just a
library!
10/25/2016
HathiTrust Research Center
• Goal: to build a secure environment and provide
services to support data mining and text analysis
• Portal: https://analytics.hathitrust.org/
– Soon: analysis against the copyrighted
content in HT
• Advanced Collaborative Support (ACS): minigrants where awardees get staff time, not $
• HathiTrust+Bookworm: visualize word trends
• Extracted Features dataset: bits of data about
the content
10/25/2016
US Federal Documents Program
• Build a Registry of all known US fed gov docs
• Collect a complete corpus of all know US fed
gov docs
https://www.hathitrust.org/usgovdocs
10/25/2016
Shared Print Program
• Build a shared print monograph program
across the membership in order to reduce
collective costs of maintaining print
collections
• Goal: secure retention commitments for all
monographs in HathiTrust
https://www.hathitrust.org/shared_print_program
10/25/2016
Copyright Review Management System
• Volunteers from HT member libraries undertake
manual copyright review of certain categories of
materials
• To date, has focused on the following categories:
– Monographs published in Australia, the United
Kingdom and Canada from 1876-1945
– Monographs published in the United States from
1923-1977
https://www.hathitrust.org/copyright-review
10/25/2016
Members participate in other groups and
committees
• User Support Working Group
https://www.hathitrust.org/wg_user-support_charge
• Collections Committee
https://www.hathitrust.org/collections-committee-charge
• Metadata Policy, Strategy, Use and Sharing Advisory Group
(MUSAG) https://www.hathitrust.org/wg_musag_charge
• HathiTrust Quality Assurance and Standards Working Group
https://www.hathitrust.org/qaswg_charge
10/25/2016