Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
디지털 인문분야 빅 데이타에 대한 텍스트 마이닝 적용에 관한 연구 (A Study of Applying Text Mining for Big Data in Digital Humanities) 송민 부교수 문헌정보학과 연세대학교 Outline Definition and History of Digital Humanities (DH) Examples of DH Projects Why Text Mining? Big Data in Digital Humanities Solution for Big Data Case Study: Indications of Emotional Connection: Epistolary Text Mining for Intimate Language Introduction of Myself 필라델피아에 있는 Thomson Reuters사에 Senior Software Engineer (1999년 부터 2006년) 뉴저지 공대 (New Jersey Institute of Technology)에 정보 시스템과에 부교수 (2006년 부터 2011) 연세대학교 도서관학 학사 Indiana University 문헌정보학 석사 Drexel University의 School of Information Science and Technology에서 박사학위 전공 분야는 Text Mining Definition and History of Digital Humanities 디지탈 인문이란? ◦ The digital humanities is an area of research, teaching, and creation concerned with the intersection of computing and the disciplines of the humanities (Svensson, 2010). ◦ Largely a focus on methods in humanities research ◦ Challenges: utilization of vast amount of digital resources being created in a meaningful, scholarly way History of Digital Humanities: 1949-2011 • 1949: Father Roberto Busa began his index of every word in the works of St. Thomas Aquinas (11M words); visits Thomas Watson and enlists IBM • 1966: Computers and the Humanities founded • 1978: Founding of the Association for Computers and The Humanities • 1985: Perseus Project begun at Harvard • 1996: First draft of XML spec released • 2006: NEH Office of Digital Humanities • 2011: DH 2011, Stanford University Example Projects from the last 20 years 1. 2. 3. 4. 5. LORE (Australia) Old Bailey (UK) TAPoR (Canada) NINES (USA) MONK Project (USA) 1: Aus-e-lit (lore tool) “LORE: Literature Object Re-use and Exchange LORE is an extension for the Firefox Web Browser which aims to support the digital practice of Literature scholars and teachers by enabling them to: Author Compound-objects: ◦ Gather web resources and organise and tag them, describe links between them and make notes about them via the graphical editor. ◦ Discover compound objects that others have created about related topics and web resources. ◦ Publish the information that scholars have assembled as compound objects so that others can find them and reuse the information. ◦ Communicate the information using automatically generated documents, slideshows and diagrams. 8 (pic aus-lit) http://www.versi.edu.au/ 9 2: Old Bailey “The Old Bailey Proceedings Online makes available a fully searchable, digitised collection of all surviving editions of the Old Bailey Proceedings from 1674 to 1913. It allows access to over 197,000 trials and biographical details of approximately 2,500 men and women executed at Tyburn, free of charge for non-commercial use. In addition to the text, accessible through both keyword and structured searching, this website provides digital images of all 190,000 original pages of the Proceedings, 4,000 pages of Ordinary's Accounts, advice on methods of searching this resource, information on the historical and legal background to the Old Bailey court and its Proceedings, and descriptions of published and manuscript materials relating to the trials covered. Contemporary maps, and images have also been provided” See: http://www.oldbaileyonline.org/static/Project.jsp 10 Pic/old bailey http://www.versi.edu.au/ 11 http://www.versi.edu.au/ 12 3: TAPoR Launched in 2003 TAPoR is a portal for: Collecting texts - TAPoR lets you keep a library of references with links to original documents on the web or elsewhere Analysing texts - You can then pass these texts to tools that analyse the text and then store the results Tools on TAPoR are web services not on the TAPoR server http://www.versi.edu.au/ 13 Picture TAPoR http://www.versi.edu.au/ 14 4: NINES Networked Infrastructure for Nineteenth-Century Electronic Scholarship Goals: To serve as a peer-reviewing body for digital work in the long 19thcentury (1770-1920), British and American; Peer-reviewing and legitimating digital literary editing, to support scholars’ priorities and best practices in the creation of digital research materials; To develop software tools for new and traditional forms of research and critical analysis. It provides scholars with access to a federated digital environment and a suite of computerised analytic and interpretive tools. NINES is currently aggregating 670,373 peer reviewed digital objects from 96 federated sites. 15 http://www.versi.edu.au/ 16 MONK • • • • • Focus: apply text-mining tools to digital libraries in the humanities; facilitate “reading at library scale.” Funded for two years (2007-2009) by the Andrew W. Mellon Foundation Involved faculty and staff at Illinois (GSLIS and NCSA), Northwestern, Nebraska, Maryland, Alberta, McMaster Content (150M words of literary text) contributed by Virginia, Indiana, UNC, ProQuest, Cengage Coverage: literature of many genres, in English, from 1600-1920s Text Mining “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) Why Do We Need Text Mining for Digital Humanities? Text mining is now an essential part of Digital Humanities ◦ ◦ ◦ ◦ Reveal hidden patterns Generate new research questions Confirm or counter intuition Help understand historical sources. Why Text Mining is Hard? - Context in Computing Numeric computing is easy – computers are designed for it and math uses a formal unambiguous language Text computing is HARD – meaning is all contextual Various Different Languages Ambiguous Meaning Big Data Different Languages All computers “speak” binary and use formal unambiguous expressions. There are currently 6,900 known active human languages. Social media uses slang, abbreviations, and invented words. Must be able to translate into common normalization language, or utilize crosslingual tools. Most tools only available for English and major European / Asiatic languages. Machine translation very expensive and linguistic errors tend to confuse NLP tools. “Really Big Data” – Digital Humanities Traditional sciences are “small data” compared with the information world of news and social media 200 MILLION new tweets a day 1BILLION new Facebook items a day: average person adds 3 items to Facebook every single day “Really Big becomes REALLY Big” Social media in particular is vastly outpacing traditional information sources Entire New York Times 1945-2005 = 18M articles = 2.9 billion words 5 BILLION words added to Twitter each DAY (almost twice the total volume of the Times in the last 60 years) And Even Bigger HaitiTrust includes Google Books and contains 4% of all books every printed = 9.4 million digitized works = 3.3 billion pages = 2 trillion words Estimated 49.5 trillion words ever printed in books over last 600 years Twitter alone will reach that size in just 27 years. With its current rate of tripling post volume each year, it will take just three years What Is the Solution for Big Data? Cloud Computing ◦ Cloud Computing is a general term used to describe a new class of network based computing that takes place over the Internet. ◦ In other words, this is a collection/group of integrated and networked hardware, software and Internet infrastructure (called a platform). Source: http://www.free-pictures-photos.com/ Cloud Architecture Key Technology:Virtualization (하드웨어적 측면) App App App Operating System App App App OS OS OS Hardware Hypervisor Traditional Stack Hardware Virtualized Stack MapReduce (소프트웨어적 측면) Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed on Google’s cluster everyday. Case Study: Indications of Emotional Connection: Epistolary Text Mining for Intimate Language 2009년부터 2010년까지 수행된 연구 • Song, M., Ruecker, S., and Youngman, P. (2009) Indications of Emotional Connection: Epistolary Text Mining for Intimate Language, BookOnline 09, October 2, Corfu, Greece • Research Problem The personal letter is a complex genre ◦ a wide range of individual variation involved in the personal communications between two people. ◦ various techniques such as self-revelation, humor, the sharing of secrets, the establishment of informal agreements, and the employment of a vocabulary or idiom were used. Research Goals Explore large volumes of personal letters (more than a million pages of correspondence). Identify patterns of language that indicate the creation and extension, or alternatively the proposal and rejection, of some form of private language. Develop a shared personal vocabulary as a result of mining personal letters. Develop new visual tools to search linguistic constructions of intimacy. Research Questions What kinds of evidence for the development and extension of private language can be found by analytical means, and how can that analysis be modified or adapted in order to improve the quality of evidence found? What can combing through a large body of epistolary material contribute to our understanding of the phenomenon of communication of a personal kind at a distance? How can the visual details of the interface and related experimental visualizations be adjusted to best support the kind of iterative inquiry and analysis necessary for algorithmic criticism of personal communication? Data Sets The half million items in the Enron email collection, available at www.cs.cmu.edu/~enron/. British and Irish Women's Letters and Diaries (15001950), which includes the experiences of approximately 500 women, in over 100,000 pages of diaries and letters. North American Immigrant Letters, Diaries and Oral Histories (180-1950), which includes 2,162 authors and approximately 100,000 pages of information. North American Women's Letters and Diaries, which includes the immediate experiences of 1,325 women and 150,000 pages of diaries and letters. Note: as the project plan has matured, we have added more items to this collection The Proposed Intimacy Mining Technique The Proposed Intimacy Mining Technique To identify private language used in personal letters, following techniques are applied ◦ Natural Language Processing (NLP) techniques, including part of speech analysis (POS), deep sentence and paragraph parsing, and text chunking. To build a local vocabulary for each private language, the following approach is proposed: ◦ a two-level knowledge extraction approach: named entity extraction and concept extraction. Interactive Visualization Goals of the interactive visualization component of the system are: ◦ To produce interactive visualizations to support literary scholars in working through the results of the text mining activities. ◦ To provide the humanities scholar studying the letters with a way to zoom from representations of larger patterns into intermediate representations of details, all within the context of having available at any time a reading pane containing the actual texts of the letters or email messages. http://www.versi.edu.au/ 39 References Patrik Svensson (2010) The Landscape of Digital Humanities. Digital Humanities Quarterly, http://digitalhumanities.org/dhq/vol/4/1/000080/00 0080.html M. Grobelnik, D. Mladenic, and N. Milic-Frayling, Text Mining as Integration of Several Related Research Areas KDD’01 Workshop on Text Mining, 2001. Song, M., Ruecker, S., and Youngman, P. (2009) Indications of Emotional Connection: Epistolary Text Mining for Intimate Language, BookOnline 09, October 2, Corfu, Greece Questions? 감사합니다!