Download hw1-lauren

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lauren Marino
CPSC 445 HW1
Intuitively, most people would not say literature was quantifiable, but data
mining is already being put to use in the analysis of poetry in languages as diverse as
Chinese, Japanese, and English. In 1998, Mayumi Yamasaki, Masayuki Takeda,
Tomoko Fukuda, and Ichiro Nanri used machine learning to discover characteristic
patterns from collections of classical Japanese poems. They first put together a system
of grammar for identifying patterns in the poems, then used data mining techniques to
search out those patterns and compare the frequency of the occurrence of certain patterns
across five different collections of classical Japanese poems. Though no definite
conclusions were reached, they found that some patterns occurred frequently in some
collections but not at all in others, while other patterns occurred in all five. These
tendencies suggested directions for further research.
In August 2004, researchers Yong Yi, Zhong-shi He, Liang-yan Li, and Tian Yu
presented their method of using machine learning to classify traditional Chinese poetry
into Bold-and-Unrestrained or Graceful-and-Restrained styles. Because poetry style is
most often determined subjectively and intuitively by a human reader, it is difficult to
derive quantitative principles or format rules from human classifications. By using
machine learning, the researchers hoped to find a quantitative model of traditional
Chinese poetry identification. By using a naïve Bayesian method of classification, they
were able to identify poetry style based on the occurrence of 1087 commonly used
Chinese characters with approximately 90 percent accuracy.
In November 2004, participants at McMaster University, Open Sky Solutions, the
University of Alberta, University of Georgia, University of Illinois, University of
Maryland, University of Nebraska, and the University of Virginia came together to work
on the Nora project, a software tool for “discovering, visualizing, and exploring
significant patterns across large collections of full text humanities resources in existing
digital libraries.” (www.noraproject.org/description.php) Through the cooperation of
several digital libraries, the Nora project has gained access to approximately 10,000
literary texts, which roughly amounts to 5GB of data. That’s only a small fraction of the
data stored in the world’s digital libraries.
One of the first experiments for the Nora project was the investigation of erotic
language in the writings of poet Emily Dickinson. Initially, a user ranks how erotic a
training set of documents on a scale of 1 to 5. The software then proceeds to attempt to
classify the rest of the documents. Most interesting, though, is that by using a naïve
Bayesian classification the software can tell the user which individual words it thought
were potential indicators of the erotic. One researcher became particularly excited upon
seeing the word “mine” rank high on that list of words.
“The minute I saw it, I had one of those ‘I knew that’ moments.Besides
possessiveness, ‘mine’ connotes delving deep, plumbing, penetrating--all
things we associate with the erotic at one point or another. And Emily
Dickinson was, by her own accounting and metaphor, a diver who
relished going for the pearls. So ‘mine’ should have been identified as a
‘likely hot’ word, but has not been, oddly enough, in the extensive
literature on Dickinson’s desires.”
(http://www.thevalve.org/go/valve/article/poetry_patterns_and_provocati
on_the_nora_project/)
So even in the Nora project’s earlier stages, it was able to reveal something that
human scholarship had never before discovered. The Nora project eventually expanded
beyond experiments on the works of Dickinson to encompass a variety of 19th century
British and American literature, and even now the Nora project team is working to
develop software that uses data mining to aid in the work of humanities scholars. In
January 2007, the Nora project merged with the Wordhoard project at Northwestern
University to create MONK (Metadata Offer New Knowledge), a digital environment
which aims to help scholars identify and analyze patterns in the texts they study. By
June 2008, they hope to install beta applications of MONK alongside several large digital
text collections so that any scholar may use data mining to gain insight into literary texts.
References
[1] The MONK project website, http://monkproject.org/
[2] Yi, Yong et al. “Studies on Traditional Chinese Poetry Style Identification”, August
2004, http://ieeexplore.ieee.org/iel5/9459/30022/01378534.pdf
[3] Yamasaki, Mayumi et al. “Discovering Characteristic Patterns from Collections of
Classical Japanese Poems”, 1998
http://www.springerlink.com/content/b8620534v6p70154/fulltext.pdf
[4] The Nora project website, http://www.noraproject.org/
[5] Kirschenbaum, Matthew “Poetry, Patterns, and Provocation: The nora Project”
January 2006,
http://www.thevalve.org/go/valve/article/poetry_patterns_and_provocation_the_nora_pr
oject