Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
ASIS&T 2008 Annual Meeting Columbus, OH 28 October, 2008 Beyond Data Mining: Delivering the Next Generation of Services from Library Data Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC WorldCat as an “Aggregate Collection” Data Mining and Analysis of WorldCat: “…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.” Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107. WorldCat: July 2008 Manifestations (records): 108,828,533 Total holdings: 1,292,763,300 Digital Items: 3,182,550 Works: 84,096,107 Institutions: 69,000 Physical Items: ~1.2 billion Global Origins of WorldCat Materials Rest of World 27% Germany 10% Unknown 17% France 4% Canada 3% UK 8% US 28% Global Origins of WorldCat Materials Content Languages: 478 Materials w/non-US origins: 49% of WC non-English 57.9 million (55%) Top 5 non-English: Top 5: German: 12 million Germany: 10.0 million French: UK: 8.8 million Spanish: 3.5 million France: 4.2 million Dutch: Netherlands: 2.9 million Canada: 2.9 million 6.1 million 2.6 million Japanese:2.4 million Non-English Metadata Language: 28 million (66 languages) Top 5: German: 11 million French: 1.8 million Dutch: Finnish: 0.7 million 5.0 million Swedish: 1.9 million WorldCat as a Decision-Making Resource Collection management • Cooperative collection development • Comparative collection analysis • Collection assessment • Mass digitization • Off-site storage • Preservation WorldCat as a Decision-Making Resource Services • Virtual reference • Recommender services • Social networking Systems • Precision WorldCat as a Decision-Making Resource Three Areas of Data Mining Research: • OCLC WorldMap • Audience Level • Publisher Name Server OCLC WorldMap OCLC WorldMapTM: Objectives Geographically represent WorldCat data • Titles published in each country • Holdings for titles published in each country • Languages represented for titles published in each country OCLC WorldMapTM: Objectives Geographically represent data from UNESCO, ARL, and NCES for each country • Number of • Libraries • Library volumes • Certified/degreed librarians • Registered library users • Library expenditures • Cultural heritage institutions (museums and archives) • Publishers OCLC WorldMapTM: Objectives Research prototype • Support OCLC data mining research • Visually display data for review and analysis • Internal use • Sales and marketing • External use • Library collection assessment and comparison • Data may be processed AT A GLANCE • Complement the AAU/ARL Global Resources Network project • Project of the Council on Library and Information Resources (CLIR) http://pubserv.oclc.org:12223/WorldMap/ OCLC Audience Level Audience Level: Rationale and Objectives Holdings represent selection decisions by librarians … implies there are more than 1 billion individual selection decisions in the WorldCat holdings file Selections serve the interests of a library’s target community … • Associate community (audience level) to library ? profiles - e.g., ARL, non-ARL academic, public, K12 school … Thus we can infer materials’ audience level from holdings patterns, which in turn can support: • • • • Collection management Readers’ advisory services Reference services Information retrieval Example Computation: Build Community Library symbol Library name Library type Weight OHI State Library of Ohio Other x OCO Columbus Metropolitan Library Public 0.33 CDC Cedarville University Academic 0.67 LIM Lima Public Library Public 0.33 OUN Ohio University Research 1.00 OSD SEO Automation Consortium Other BGU Bowling Green State University Academic 0.67 MIA Miami University Academic 0.67 AKR University of Akron Academic 0.67 BGF Firelands College Academic 0.67 CIN University of Cincinnati Research 1.00 TOL University of Toledo Academic 0.67 KSU Kent State University Research 1.00 HIR Hiram College Academic 0.67 YNG Youngstown State University Academic 0.67 x “FRBRizing” Audience Level Results •Calculate Audience Level for each Manifestation •Aggregate weighted holdings for Work OCLC Number Total Holdings Usable Holdings Manifestation Audience Level 15504400 147 114 0.783825 29613712 172 117 0.769453 40393191 207 136 0.789426 62762763 190 124 0.758274 81016224 1 0 x Evaluating the OCLC Audience Level • Random sample of 30 Zoology books, all audience levels • Human subjects • Ranked books “in increasing order of difficulty” • Strong statistical correlation between human subjects’ ranking and programmatic ranking Evaluating the OCLC Audience Level 30 25 Subjet's Rankings 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Audience Level Ranking http://audiencelevel.oclc.org/ OCLC Publisher Name Server Publisher Name Server: Research Objectives Resolve for data mining and quality of WorldCat • ISBN prefixes to publisher name • Variant publisher names to a preferred form Complement Collection Analysis Service • Librarians • Publishers Capture and profile attributes of individual publishers • Location(s) • Language(s) of materials published • Genre(s)/format(s) • Dominant subject domain(s) • Parent company and subsidiaries Publisher Name Server: Methodology Programmatically cluster publishers’ records using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait Hand parse the entities and resolve ISBN prefixes Publisher Name Server: Database 1750 publishing entities Relational database, preserving hierarchical relationships Begins with high-occurrence entities: • “Top 10” lists (USA, UK, Canada, Australia, Germany, France, Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia, Taiwan, New Zealand) • Top 10 university presses • Mergers and acquisitions, last 8 years Publisher Name Server: Data Captured Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Variant Forms Publishers’ Weekly Online ISBN Prefixes Hoover’s Handbook Online HQ City Standard and Poor’s Corporate Descriptions HQ Country The Directory of Corporate Affiliations (DIALOG) Other Cities Company websites URL ----Languages Formats Conspectus Subjects DATA MINING Publisher Name Server: Database More than 56,000 separate strings mapped to 1750 entities • 8.5 million OCLC records • 22% of these are Library of Congress records • ~490 million holdings Hierarchical relationships maintained Entity-Parsing in a World of Mergers and Acquisitions Pearson PLC Penguin Books Allen Lane Puffin Books Ladybird Books Pearson Canada Copp Clark Riverhead Books Pearson Technology Group Adobe Press Cisco Press Putnam Books Berkeley Publishing Group Pearson Education, Inc. Avery Addison-Wesley Publishing Company Benjamin/Cummings Publishing Company Allyn and Bacon Scott, Foresman and Company Prentice-Hall, Inc. HarperCollins Educational Publishers Dominie Press Longmans, Green, and Co. Publisher Profiles Oxford University Press • 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) Pearson PLC • Includes 14 subsidiaries and acquisitions • Aggregate: 291,433 records (0.27% of WorldCat) Publisher Profiles – Top Languages Oxford Univ. Press: Pearson PLC: English 96.74% English 95.27% Latin 0.51% Spanish 1.43% German 0.39% German 1.33% Chinese 0.39% French 0.60% French 0.37% Dutch 0.55% Spanish 0.28% Latin 0.26% Afrikaans 0.14% Malay 0.06% Middle English 0.13% Ancient Greek 0.05% Malay 0.09% Portuguese 0.05% Swahili 0.09% Italian 0.04% Publisher Profiles – Conspectus Divisions Oxford Univ. Press: Pearson PLC: Language/ Literature 27.12% Language/ Literature 18.67% History 11.92% Business/ Economics 13.30% Music 9.78% Computer Science 9.42% Philosophy/ Religion 9.55% Engineering 8.04% Business/ Economics 6.15% History 7.59% Medicine 4.36% Mathematics 6.04% Law 3.85% Education 5.64% Sociology 3.75% Sociology 4.18% Political Science 3.58% Philosophy/ Religion 3.81% Biology 2.60% Physical Sciences 2.75% Publisher Profiles – Conspectus Categories Oxford Univ. Press: Pearson PLC: English literature 10.66% English language 7.74% English language 5.86% Business admin. 4.62% Instrumental music 3.48% English literature 3.63% Vocal music 3.09% Economics 2.94% Literature on music 2.26% Comp. programming 2.39% History – Britain 1.82% Electrical engineering 2.24% Economic history 1.38% Early childhood ed. 2.05% American lit. 1.35% Computer software 1.88% History – S. Asia 1.30% U.S. federal law 1.80% General history 1.29% Computer Science 1.54% Publisher Profiles – Conspectus Subjects Oxford Univ. Press: Pearson PLC: English – modern 5.57% English – modern 7.68% English lit – prose 2.51% Management 2.53% English lit – 19th c. 2.23% Programming 1.74% Juvenile lit. 1.06% Arithmetic 1.09% English lit – poetry 1.03% Economic theory 1.06% English lit – collections 0.80% Marketing 1.06% Biographies 0.76% General algebra 1.04% English lit – 1900-1960 0.74% Accounting 0.97% Shakespeare 0.68% Juvenile lit. 0.93% Sacred choruses 0.66% English lit – 19th c. 0.89% Projected MARC coding of Authorized Forms 710 Added Entry – Corporate Name • Add $4 for publisher name • Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) 752 Added Entry – Hierarchical Place Name • Add $2 FAST where place of publication matches FAST geographical subject headings Future Research • Further data mining • Profile aspects of publication output • Deeper scaling into WorldCat (beyond ISBN) • Plan for long-term maintenance • ISBN-13 compliance • File expansion of ongoing mergers/ acquisition activities Thank You! Questions and Discussion Lynn Silipigni Connaway Timothy J. Dickey [email protected] [email protected]