Download Indications of Emotional Connection: Epistolary Text Mining for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
디지털 인문분야 빅 데이타에 대한
텍스트 마이닝 적용에 관한 연구
(A Study of Applying Text Mining for Big
Data in Digital Humanities)
송민 부교수
문헌정보학과
연세대학교
Outline






Definition and History of Digital Humanities (DH)
Examples of DH Projects
Why Text Mining?
Big Data in Digital Humanities
Solution for Big Data
Case Study: Indications of Emotional Connection:
Epistolary Text Mining for Intimate Language
Introduction of Myself
필라델피아에 있는 Thomson Reuters사에
Senior Software Engineer (1999년 부터
2006년)
 뉴저지 공대 (New Jersey Institute of
Technology)에 정보 시스템과에 부교수
(2006년 부터 2011)
 연세대학교 도서관학 학사
 Indiana University 문헌정보학 석사
 Drexel University의 School of Information
Science and Technology에서 박사학위
 전공 분야는 Text Mining

Definition and History of Digital Humanities

디지탈 인문이란?
◦ The digital humanities is an area of
research, teaching, and creation concerned
with the intersection of computing and the
disciplines of the humanities (Svensson, 2010).
◦ Largely a focus on methods in humanities
research
◦ Challenges: utilization of vast amount of digital
resources being created in a meaningful,
scholarly way
History of Digital Humanities: 1949-2011
•
1949: Father Roberto Busa began his index of every word
in the works of St. Thomas Aquinas (11M words); visits
Thomas Watson and enlists IBM
•
1966: Computers and the Humanities founded
•
1978: Founding of the Association for Computers and
The Humanities
•
1985: Perseus Project begun at Harvard
•
1996: First draft of XML spec released
•
2006: NEH Office of Digital Humanities
•
2011: DH 2011, Stanford University
Example Projects from the last 20 years
1.
2.
3.
4.
5.
LORE (Australia)
Old Bailey (UK)
TAPoR (Canada)
NINES (USA)
MONK Project (USA)
1: Aus-e-lit (lore tool)
“LORE: Literature Object Re-use and Exchange
LORE is an extension for the Firefox Web Browser which
aims to support the digital practice of Literature
scholars and teachers by enabling them to:

Author Compound-objects:
◦ Gather web resources and organise and tag them,
describe links between them and make notes about
them via the graphical editor.
◦ Discover compound objects that others have
created about related topics and web resources.
◦ Publish the information that scholars have
assembled as compound objects so that others can
find them and reuse the information.
◦ Communicate the information using automatically
generated documents, slideshows and diagrams.
8
(pic aus-lit)
http://www.versi.edu.au/
9
2: Old Bailey
“The Old Bailey Proceedings Online makes available a fully
searchable, digitised collection of all surviving editions of the
Old Bailey Proceedings from 1674 to 1913. It allows
access to over 197,000 trials and biographical details of
approximately 2,500 men and women executed at Tyburn,
free of charge for non-commercial use.
 In addition to the text, accessible through both keyword and
structured searching, this website provides digital images of
all 190,000 original pages of the Proceedings, 4,000 pages
of Ordinary's Accounts, advice on methods of searching
this resource, information on the historical and legal
background to the Old Bailey court and its Proceedings,
and descriptions of published and manuscript materials
relating to the trials covered. Contemporary maps, and
images have also been provided”
 See: http://www.oldbaileyonline.org/static/Project.jsp

10
Pic/old bailey
http://www.versi.edu.au/
11
http://www.versi.edu.au/
12
3: TAPoR
Launched in 2003
TAPoR is a portal for:
 Collecting texts - TAPoR lets you keep a
library of references with links to original
documents on the web or elsewhere
 Analysing texts - You can then pass these
texts to tools that analyse the text and
then store the results
 Tools on TAPoR are web services not on
the TAPoR server

http://www.versi.edu.au/
13
Picture TAPoR
http://www.versi.edu.au/
14
4: NINES
Networked Infrastructure for Nineteenth-Century Electronic Scholarship
Goals:
 To serve as a peer-reviewing body for digital work in the long 19thcentury (1770-1920), British and American;
 Peer-reviewing and legitimating digital literary editing, to support
scholars’ priorities and best practices in the creation of digital
research materials;
 To develop software tools for new and traditional forms of research
and critical analysis.
 It provides scholars with access to a federated digital environment
and a suite of computerised analytic and interpretive tools.
 NINES is currently aggregating 670,373 peer reviewed digital
objects from 96 federated sites.
15
http://www.versi.edu.au/
16
MONK
•
•
•
•
•
Focus: apply text-mining tools to digital libraries in
the humanities; facilitate “reading at library scale.”
Funded for two years (2007-2009) by the Andrew
W. Mellon Foundation
Involved faculty and staff at Illinois (GSLIS and
NCSA), Northwestern, Nebraska, Maryland, Alberta,
McMaster
Content (150M words of literary text) contributed
by Virginia, Indiana, UNC, ProQuest, Cengage
Coverage: literature of many genres, in English, from
1600-1920s
Text Mining

“The objective of Text Mining is to exploit
information contained in textual
documents in various ways, including
…discovery of patterns and trends in
data, associations among entities,
predictive rules, etc.” (Grobelnik et al.,
2001)
Why Do We Need Text Mining for
Digital Humanities?

Text mining is now an essential part of
Digital Humanities
◦
◦
◦
◦
Reveal hidden patterns
Generate new research questions
Confirm or counter intuition
Help understand historical sources.
Why Text Mining is Hard? - Context
in Computing
Numeric computing is easy – computers
are designed for it and math uses a formal
unambiguous language
 Text computing is HARD – meaning is all
contextual
 Various Different Languages
 Ambiguous Meaning
 Big Data

Different Languages
All computers “speak” binary and use formal
unambiguous expressions.
 There are currently 6,900 known active
human languages. Social media uses slang,
abbreviations, and invented words.
 Must be able to translate into common
normalization language, or utilize crosslingual tools. Most tools only available for
English and major European / Asiatic
languages.
 Machine translation very expensive and
linguistic errors tend to confuse NLP tools.

“Really Big Data” – Digital Humanities
Traditional sciences are “small data”
compared with the information world of
news and social media
 200 MILLION new tweets a day
 1BILLION new Facebook items a day:
average person adds 3 items to Facebook
every single day

“Really Big becomes REALLY Big”
Social media in particular is vastly
outpacing traditional information sources
 Entire New York Times 1945-2005 = 18M
articles = 2.9 billion words
 5 BILLION words added to Twitter each
DAY (almost twice the total volume of
the Times in the last 60 years)

And Even Bigger
HaitiTrust includes Google Books and
contains 4% of all books every printed =
9.4 million digitized works = 3.3 billion
pages = 2 trillion words
 Estimated 49.5 trillion words ever printed
in books over last 600 years
 Twitter alone will reach that size in just
27 years. With its current rate of tripling
post volume each year, it will take just
three years

What Is the Solution for Big Data?

Cloud Computing
◦ Cloud Computing is a general term used to
describe a new class of network based
computing that takes place over the Internet.
◦ In other words, this is a collection/group of
integrated and networked hardware, software
and Internet infrastructure (called a platform).
Source: http://www.free-pictures-photos.com/
Cloud Architecture
Key Technology:Virtualization
(하드웨어적 측면)
App
App
App
Operating System
App
App
App
OS
OS
OS
Hardware
Hypervisor
Traditional Stack
Hardware
Virtualized Stack
MapReduce (소프트웨어적 측면)

Model for processing large data sets.
Contains Map and Reduce functions.
 Runs on a large cluster of machines.
 A lot of MapReduce programs are
executed on Google’s cluster everyday.

Case Study: Indications of Emotional
Connection: Epistolary Text Mining
for Intimate Language
2009년부터 2010년까지 수행된 연구
• Song, M., Ruecker, S., and Youngman, P. (2009) Indications
of Emotional Connection: Epistolary Text Mining for
Intimate Language, BookOnline 09, October 2, Corfu,
Greece
•
Research Problem

The personal letter is a complex genre
◦ a wide range of individual variation involved in
the personal communications between two
people.
◦ various techniques such as self-revelation,
humor, the sharing of secrets, the
establishment of informal agreements, and the
employment of a vocabulary or idiom were
used.
Research Goals
Explore large volumes of personal letters
(more than a million pages of
correspondence).
 Identify patterns of language that indicate the
creation and extension, or alternatively the
proposal and rejection, of some form of
private language.
 Develop a shared personal vocabulary as a
result of mining personal letters.
 Develop new visual tools to search linguistic
constructions of intimacy.

Research Questions
What kinds of evidence for the development and
extension of private language can be found by
analytical means, and how can that analysis be
modified or adapted in order to improve the
quality of evidence found?
 What can combing through a large body of
epistolary material contribute to our
understanding of the phenomenon of
communication of a personal kind at a distance?
 How can the visual details of the interface and
related experimental visualizations be adjusted to
best support the kind of iterative inquiry and
analysis necessary for algorithmic criticism of
personal communication?

Data Sets





The half million items in the Enron email collection,
available at www.cs.cmu.edu/~enron/.
British and Irish Women's Letters and Diaries (15001950), which includes the experiences of
approximately 500 women, in over 100,000 pages of
diaries and letters.
North American Immigrant Letters, Diaries and Oral
Histories (180-1950), which includes 2,162 authors
and approximately 100,000 pages of information.
North American Women's Letters and Diaries, which
includes the immediate experiences of 1,325 women
and 150,000 pages of diaries and letters.
Note: as the project plan has matured, we have added
more items to this collection
The Proposed Intimacy Mining Technique
The Proposed Intimacy Mining Technique

To identify private language used in
personal letters, following techniques are
applied
◦ Natural Language Processing (NLP) techniques,
including part of speech analysis (POS), deep
sentence and paragraph parsing, and text
chunking.

To build a local vocabulary for each private
language, the following approach is
proposed:
◦ a two-level knowledge extraction approach:
named entity extraction and concept extraction.
Interactive Visualization

Goals of the interactive visualization
component of the system are:
◦ To produce interactive visualizations to support
literary scholars in working through the results of
the text mining activities.
◦ To provide the humanities scholar studying the
letters with a way to zoom from representations
of larger patterns into intermediate
representations of details, all within the context of
having available at any time a reading pane
containing the actual texts of the letters or email
messages.
http://www.versi.edu.au/
39
References
Patrik Svensson (2010) The Landscape of Digital
Humanities. Digital Humanities Quarterly,
http://digitalhumanities.org/dhq/vol/4/1/000080/00
0080.html
 M. Grobelnik, D. Mladenic, and N. Milic-Frayling,
Text Mining as Integration of Several Related
Research Areas KDD’01 Workshop on Text Mining,
2001.
 Song, M., Ruecker, S., and Youngman, P. (2009)
Indications of Emotional Connection: Epistolary
Text Mining for Intimate Language, BookOnline 09,
October 2, Corfu, Greece

Questions?

감사합니다!