Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Charleston Conference 7 November 2008 Data Mining, Advanced Collection Analysis, and Publisher Profiles: An Update on the OCLC Publisher Name Authority File Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research Overall Research Goals To Build a Database that Will: Identify • Authoritative strings for publisher names • Common variants for names and locations • Hierarchical references indicating relationships and nesting of subsidiaries • Definitions of publishing entities Overall Research Goals To Build a Database that Will: Produce • Profiles, including data-mined information regarding formats, languages, subjects, etc. for publishers Conform • to international authority and standards practice, and • inter-operate with other OCLC products Issues & Challenges Database Quality: Historical Practices • “…the shortest form in which it can be understood.” [AACR2 2004] • Different versions of cataloging rules • Abbreviations Errors and misspellings Local Practices Method: Data Mining in an “Aggregate Collection” Data Mining and Analysis of WorldCat: “…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.” Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107. WorldCat: July 2008 Manifestations (records): 108,828,533 Total holdings: 1,292,763,300 Digital Items: 3,182,550 Works: 84,096,107 Institutions: 69,000 Physical Items: ~1.2 billion Global Origins of WorldCat Materials Rest of World 27% Germany 10% Unknown 17% France 4% Canada 3% UK 8% US 28% Global Origins of WorldCat Materials Content Languages: 478 Materials w/non-US origins: 49% of WC non-English 57.9 million (55%) Top 5 non-English: Top 5: German: 12 million Germany: 10.0 million French: UK: 8.8 million Spanish: 3.5 million France: 4.2 million Dutch: Netherlands: 2.9 million Canada: 2.9 million 6.1 million 2.6 million Japanese:2.4 million Non-English Metadata Language: 28 million (66 languages) Top 5: German: 11 million French: 1.8 million Dutch: Finnish: 0.7 million 5.0 million Swedish: 1.9 million OCLC Publisher Name Server Publisher Name Server: Objectives Resolve for data mining and quality of WorldCat • ISBN prefixes to publisher name • Variant publisher names to a preferred form Complement Collection Analysis Service • Librarians & Publishers Publisher Name Server: Objectives Capture and profile attributes of individual publishers: • Location(s) • Language(s) of materials published • Genre(s)/format(s) • Dominant subject domain(s) • Parent company and subsidiaries Publisher Name Server: Methodology Programmatically cluster publishers’ records using ISBN prefixes • Data clustering • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) Hand parse the entities and resolve ISBN prefixes Publisher Name Server: Database 1750 publishing entities Relational database, preserving hierarchical relationships Begins with high-occurrence entities: • “Top 10” lists • Top 10 university presses • Mergers and acquisitions, last 8 years Example: Top U.S. Publishing Entities by ISBN ISBN Prefix WorldCat Records Publishing Entity 0-13 50,298 Prentice-Hall, Inc. 0-07 44,545 McGraw Hill, Inc. 0-06 44,362 HarperCollins (Firm) 0-16 40,451 United States G.P.O. 0-471 37,710 John Wiley & Sons 0-312 33,318 St. Martin's Press 0-671 31,765 Simon & Schuster, Inc. 0-02 27,602 MacMillan Publishers 0-15 18,420 Harcourt Brace & Company 0-394 18,043 Random House (Firm) 0-590 17,290 Scholastic Inc. 0-385 16,768 Doubleday and Company, Inc. 0-395 16,699 Houghton Mifflin Company 0-19 15,724 Oxford University Press 0-03 15,417 Holt, Rinehart, and Winston Publisher Name Server: Data Captured Data: Publisher Name, Preferred Form Source of Preferred Form Former Names Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Variant Forms Publishers’ Weekly Online ISBN Prefixes Hoover’s Handbook Online HQ City Standard and Poor’s Corporate Descriptions HQ Country The Directory of Corporate Affiliations (DIALOG) Other Cities Company websites URL ----Languages Formats Conspectus Subjects DATA MINING Publisher Name Server: Current Scope More than 56,000 separate strings mapped to 1750 entities • 8.5 million OCLC records • 22% of these are Library of Congress records • ~490 million holdings Hierarchical relationships maintained Entity-Parsing in a World of Mergers and Acquisitions Pearson PLC Penguin Books Allen Lane Puffin Books Ladybird Books Pearson Canada Copp Clark Riverhead Books Pearson Technology Group Adobe Press Cisco Press Putnam Books Berkeley Publishing Group Pearson Education, Inc. Avery Addison-Wesley Publishing Company Benjamin/Cummings Publishing Company Allyn and Bacon Scott, Foresman and Company Prentice-Hall, Inc. HarperCollins Educational Publishers Dominie Press Longmans, Green, and Co. Publisher Profiles within WorldCat Oxford University Press • 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) Pearson PLC • Includes 14 subsidiaries and acquisitions • Aggregate: 291,433 records (0.27% of WorldCat) Springer (Firm) • 197,263 records (0.18% of WorldCat) Reed Elsevier PLC • Includes dozens of subsidiaries • Aggregate: 370,029 records (0.34% of WorldCat) WorldCat Publisher Profiles – Top Languages Oxford Univ. Press: Pearson PLC: English 96.74% English 95.27% Latin 0.51% Spanish 1.43% German 0.39% German 1.33% Chinese 0.39% French 0.60% French 0.37% Dutch 0.55% Spanish 0.28% Latin 0.26% Afrikaans 0.14% Malay 0.06% Middle English 0.13% Ancient Greek 0.05% Malay 0.09% Portuguese 0.05% Swahili 0.09% Italian 0.04% WorldCat Publisher Profiles – Top Languages Springer (Firm): Reed Elsevier PLC: English 61.25% English 83.64% German 37.10% French 9.34% French 1.02% Dutch 2.32% Italian 0.29% Spanish 0.95% Polish 0.13% Italian 0.60% Czech 0.04% Latin 0.27% Spanish 0.04% Afrikaans 0.16% Hungarian 0.03% Ancient Greek 0.12% Dutch 0.02% Portuguese 0.09% Danish 0.02% Polish 0.06% WorldCat Publisher Profiles - Formats Oxford University Press: Pearson PLC: Printed Material 89.57% Printed Material 92.98% Computer File 8.23% Microform 2.82% Microform 1.39% Computer File 2.15% Sound Recording 0.50% Video Recording 0.70% Video Recording 0.16% Sound Recording 0.67% Springer (Firm): Reed Elsevier PLC: Printed Material 81.69% Printed Material 92.31% Computer file 17.51% Computer File 5.46% Microform 0.71% Microform 1.85% Video Recording 0.05% Video Recording 0.14% WorldCat Publisher Profiles – Conspectus Divisions Oxford Univ. Press: Pearson PLC: Language/ Literature 27.12% Language/ Literature 18.67% History 11.92% Business/ Economics 13.30% Music 9.78% Computer Science 9.42% Philosophy/ Religion 9.55% Engineering 8.04% Business/ Economics 6.15% History 7.59% Medicine 4.36% Mathematics 6.04% Law 3.85% Education 5.64% Sociology 3.75% Sociology 4.18% Political Science 3.58% Philosophy/ Religion 3.81% Biology 2.60% Physical Sciences 2.75% WorldCat Publisher Profiles – Conspectus Categories Oxford Univ. Press: Pearson PLC: English literature 10.66% English language 7.74% English language 5.86% Business admin. 4.62% Instrumental music 3.48% English literature 3.63% Vocal music 3.09% Economics 2.94% Literature on music 2.26% Comp. programming 2.39% History – Britain 1.82% Electrical engineering 2.24% Economic history 1.38% Early childhood ed. 2.05% American lit. 1.35% Computer software 1.88% History – S. Asia 1.30% U.S. federal law 1.80% General history 1.29% Computer Science 1.54% WorldCat Publisher Profiles – Conspectus Subjects Pearson PLC: Oxford Univ. Press: English – modern 5.57% English – modern 7.68% English lit. – prose 2.51% Management 2.53% English lit. – 19th c. 2.23% Programming 1.74% Juvenile lit. 1.06% Arithmetic 1.09% English lit. – poetry 1.03% Economic theory 1.06% English lit. – collections 0.80% Marketing 1.06% Biographies 0.76% General algebra 1.04% English lit. – 1900-1960 0.74% Accounting 0.97% Shakespeare 0.68% Juvenile lit. 0.93% Sacred choruses 0.66% English lit. – 19th c. 0.89% WorldCat Publisher Profiles – Conspectus Divisions Reed Elsevier PLC: Springer (Firm): Computer Science 16.83% Language/ Literature 14.18% Engineering 15.12% Law 11.78% Mathematics 12.96% Engineering 11.73% Medicine 9.93% Business/ Economics 6.82% Physical Sciences 9.83% Medicine 6.50% Biology 5.22% Physical Sciences 5.01% Business/ Economics 5.13% History 4.57% Health Professions 4.48% Biology 4.32% Chemistry 3.14% Health Professions 3.70% Geography 2.58% Chemistry 3.51% WorldCat Publisher Profiles – Conspectus Categories Reed Elsevier PLC: Springer (Firm): Computer science 5.23% English literature 5.84% General math 4.48% Health professions 3.40% Health professions 4.03% English language 2.79% Electrical engineering 3.73% U.S. federal law 2.32% General engineering 3.25% General engineering 2.26% Mathematical analysis 3.06% Electrical engineering 2.10% Computer software 2.37% General law 1.70% Comp. programming 2.34% Industrial economics 1.65% Probability/ Statistics 2.20% Business admin. 1.53% Mech. engineering 2.17% U.S. state law 1.46% WorldCat Publisher Profiles – Conspectus Subjects Reed Elsevier PLC: Springer (Firm): Health professions 3.56% English – modern 2.68% Math collections 2.76% English - prose 2.06% Computer science 1.84% Health professions 1.92% Programming 1.46% U.S. state law 1.37% Access/ security 1.10% Industrial management 1.22% Artificial intelligence 1.03% Legal periodicals 1.16% Mathematical stats 1.03% English lit. - 1900-1960 1.15% Analytical physics 1.02% Engineering materials 0.86% Industrial management 0.99% English fiction 0.83% Engineering materials 0.90% Nuclear physics 0.68% Projected MARC coding of Authorized Forms 710 Added Entry – Corporate Name • Add $4 for publisher name • Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) 752 Added Entry – Hierarchical Place Name • Add $2 FAST where place of publication matches FAST geographical subject headings Ongoing Research Further data mining • Profile other aspects of publication output • Profile other publishers • Trends over time • Author clusters • Geographic holdings patterns • Collection Analysis Ongoing Research Plan for long-term maintenance • ISBN-13 compliance • File expansion of ongoing mergers/ acquisition activities • Deeper scaling into WorldCat (beyond ISBN) OCLC Publisher Name Server Project page: http://www.oclc.org/research/projects/publisherns/ Thank You! Questions and Discussion Lynn Silipigni Connaway Timothy J. Dickey [email protected] [email protected]