Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
HATHITRUST A Shared Digital Repository HathiTrust Digital Library Interface and Services Angelina Zaytsev Collection Services Librarian [email protected] Agenda • Collection overview • Interface overview • Other services If time permits: • Programs • Working groups & committees • Governance & partnership 10/25/2016 Collection Overview 10/25/2016 HathiTrust Collections: Oct 2016 • 14.7 million total items – 7.4 million book titles – 405,000 serial titles – 767,000 US federal government documents – 5.7 million items open (public domain & CClicenses) 6 April 2016 10/25/2016 10/25/2016 10/25/2016 10/25/2016 For more information... You can click through to see the results for all of these categories! https://www.hathitrust.org/statistics_visualizations 10/25/2016 What kind of content formats? • • • • Scanned from book-like materials Image formats: TIFFs and JPEG2000s Plain-text OCR PDFs are generated on-the-fly and delivered to users (NOT stored in the repository) Some: • Born-digital pdfs (and maybe epubs soon!) • Photos 10/25/2016 Where does content come from? - Digitization source 10/25/2016 Type Characteristics Google ● 94.8% of the collection ● Download restrictions ● Primarily scanned in black and white with some color pages ● Large-scale mass digitization = quality can vary Internet Archive ● 3.7% of the collection ● No download restrictions ● Scanned in full color (as a result, file sizes are 2.5 times larger than Google content) ● Large-scale mass digitization = quality can vary Locally digitized & vendor services ● 1.48% of the collection ● Various restrictions may apply ● Typically small scale, “boutique” digitization = high quality (with some exceptions) Where does content come from? - Top 10 contributing libraries Institution 10/25/2016 Volumes University of Michigan 4,714,231 University of California 3,835,563 Harvard University 841,969 Cornell University 585,190 University of Wisconsin - Madison 561,945 Indiana University 530,763 University of Illinois at Urbana-Champaign 528,545 University of Minnesota 503,057 The University of Texas 460,139 Pennsylvania State University 390,345 Special Collections • Universidad Complutense de Madrid: Latin, Spanish and French documents from the 1500-1800s • Keio University: 92,000+ Japanese and some Chinese language materials • Islamic Manuscripts from University of Michigan: 8th-20th century CE mss., 1,795 titles in Arabic, Persian, and Ottoman Turkish languages, collaborative cataloging project • Benson Latin American Collection, University of Texas at Austin: 460,000 vols related to Latin American culture and history • Minnesota Digital Library & Minnesota Historical Society: 60,000 photos related to Minnesota history • US Fed Gov Docs: 766,000+ documents and growing! 10/25/2016 Access is determined by several factors: 10/25/2016 Copyright status of the item Derived from: ● Bibliographic metadata (inc for US fed gov docs) ● Manual copyright review ● Permissions agreements Geographic location of the user In the United States vs. Outside the United States Member affiliation Yes/no? Digitization source and/or contributing institution Any restrictions imposed by these entities? Type of work Search (bibliographic and full-text) Text and Data Mining Viewable* Full-PDF download Print disabilities* Preservation uses (Section 108)* Public domain worldwide Worldwide Worldwide Worldwide Partners only if 3rd-party restrictions. Worldwide N/A Public domain Worldwide (US) – Non-US works published between 1873 and 1923. Available within the United States When accessed from with the United States Works that rights holders have opened access to in HathiTrust Worldwide unless license forbids it Worldwide Forthcoming Not available Worldwide Works that Worldwide are incopyright or of undetermined status If not, worldwide. Partners in the Partners in the N/A US if 3rd party US; partners restrictions. worldwide where laws If not, anyone permit in the US Worldwide (if digitized by Google, fullPDF only available if opened with CC license) Not available Worldwide N/A Partners in the US; partners worldwide where laws permit Partners in the US; partners worldwide where laws permit * Note: Access to in-copyright works is subject to conditions listed in HathiTrust’s policies on Access and Use. 10/25/2016 Interface Overview 10/25/2016 • Full-text search • Catalog search • Pageturner • Collection Builder 10/25/2016 Other services • Get bib records in bulk: https://www.hathitrust.org/data • Get datasets: https://www.hathitrust.org/datasets • Get high resolution image files: https://www.hathitrust.org/data_api • Get bib records for known identifiers: https://www.hathitrust.org/bib_api • Get some data about all HT content: https://www.hathitrust.org/hathifiles 10/25/2016 For help • See https://www.hathitrust.org/help • Contact [email protected] 10/25/2016 Questions? BUT... HathiTrust is more than just a library! 10/25/2016 HathiTrust Research Center • Goal: to build a secure environment and provide services to support data mining and text analysis • Portal: https://analytics.hathitrust.org/ – Soon: analysis against the copyrighted content in HT • Advanced Collaborative Support (ACS): minigrants where awardees get staff time, not $ • HathiTrust+Bookworm: visualize word trends • Extracted Features dataset: bits of data about the content 10/25/2016 US Federal Documents Program • Build a Registry of all known US fed gov docs • Collect a complete corpus of all know US fed gov docs https://www.hathitrust.org/usgovdocs 10/25/2016 Shared Print Program • Build a shared print monograph program across the membership in order to reduce collective costs of maintaining print collections • Goal: secure retention commitments for all monographs in HathiTrust https://www.hathitrust.org/shared_print_program 10/25/2016 Copyright Review Management System • Volunteers from HT member libraries undertake manual copyright review of certain categories of materials • To date, has focused on the following categories: – Monographs published in Australia, the United Kingdom and Canada from 1876-1945 – Monographs published in the United States from 1923-1977 https://www.hathitrust.org/copyright-review 10/25/2016 Members participate in other groups and committees • User Support Working Group https://www.hathitrust.org/wg_user-support_charge • Collections Committee https://www.hathitrust.org/collections-committee-charge • Metadata Policy, Strategy, Use and Sharing Advisory Group (MUSAG) https://www.hathitrust.org/wg_musag_charge • HathiTrust Quality Assurance and Standards Working Group https://www.hathitrust.org/qaswg_charge 10/25/2016