Download The Socialist Network

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
The Socialist Network
Matje van de Campa∗ , Antal van den Boscha1
a
Tilburg center for Cognition and Communication, Tilburg University, Warandelaan 2,
5037 AB Tilburg, Netherlands
Abstract
We develop and test machine learning-based tools for the classification of personal relationships in biographical texts, and the induction of social networks from
these classifications. A case study is presented based on several hundreds of biographies of notable persons in the Dutch social movement. Our classifiers mark
relations between two persons (one being the topic of a biography, the other being mentioned in this biography) as positive, neutral, or unknown, and do so at an
above-baseline level. A training set centering on a historically important person
is contrasted against a multi-person training set; the latter is found to produce the
most robust generalization performance. Frequency-ranked predictions of positive
and negative relationships predicted by the best-performing classifier, presented in
the form of person-centered social networks, are scored by a domain expert; the
mean average precision results indicate that our system is better in classifying and
ranking positive relations (around 70% MAP) than negative relations (around 40%
MAP).
Keywords: text mining, machine learning, social network extraction, sentiment
analysis, social history
1. Introduction
Web sites such as Facebook, Twitter and, more recently, Google+, allow us
to communicate with people across the globe with great ease. We are able to find
∗
Corresponding author. Tilburg center for Cognition and Communication, Tilburg University,
Warandelaan 2 D335, 5037 AB Tilburg, Netherlands, +31 13 4662433.
Email addresses: [email protected] (Matje van de Campa∗ ),
[email protected] (Antal van den Boscha )
1
Current address: Centre for Language Studies, Radboud University Nijmegen, Nijmegen,
Netherlands
Preprint submitted to Decision Support Systems
September 5, 2011
like-minded people anywhere in the world and share ideas with them. These Social
Networking Services are not the only web sites that successfully utilize the innate
desire of people to connect with one another. For instance, many online stores
use the principles of social networks to create a digital version of word-of-mouth
advertising by allowing users to write reviews of their products [4] or by targeting
specific influential opinion leaders within a network with their advertisements [14,
15]. Some news sites allow for the building of an online reputation by letting users
indicate whether they agree or disagree with someone’s comments [10].
The networks that arise from these activities are digitally recorded, creating a
multitude of data on human interactions. The availability of such structured data
has sparked new interest in the field of Social Network Extraction and Analysis.
However, little research has been done on the formation and evaluation of offline
social networks from a computational modeling point of view. This is largely due
to the scarcity of digitally available real-world records of such networks, with genealogy as a notable exception. Nevertheless, we do have other, often secondary,
sources that contain indirect traces of these networks. By applying the technology
of today to the heritage of our past, it may be possible to uncover yet unknown
patterns and provide a better insight into our society’s social development, whether
it be on- or offline.
In this paper we present a case study based on historical biographical information, so-called secondary historical sources, describing people in a particular
domain, region and time frame: the Dutch social movement between the mid-19th
and mid-20th century. “Social movement” refers to the social-political-economical
complex of ideologies, worker’s unions, political organizations, and art movements
that arose from the ideas of Karl Marx (1818–1883) and followers. In the Netherlands, a network of persons unfolded over time with leader figures such as Ferdinand Domela Nieuwenhuis (1846–1919) and Pieter Jelles Troelstra (1860–1930).
Although this network is implicit in all the primary and secondary historical writings documenting the period, and partly explicit in the minds of experts studying
the domain, there is no explicitly modeled social network of this group of persons.
Yet, it would potentially benefit further research in social history to have this in the
form of a computational model.
In our study we focus on detecting and labeling relations between two persons,
where one of the persons, A, is the topic of a biographical article, and the other
person, B, is mentioned in that article. The genre of biographical articles allows
us to assume that person A is topical throughout the text. What remains is to
determine whether the mention of person B can be labeled as positive or negative.
More fine-grained labels are possible, but the primary aim of our case study is to
build a basic network from robustly recognized person-to-person relations at the
highest possible accuracy. As our data only consists of several hundreds of articles
2
describing an amount of people of roughly the same order of magnitude, we are
facing data sparsity, and thus are limited in the granularity of the labels we wish to
predict.
This paper is structured as follows. Section 2 describes the project within which
this research is conducted. After a brief survey of related research in Section 3, we
describe our data and our annotation scheme in Section 4. In Section 5 we describe
how we implement relation classification as a supervised machine learning task.
The outcomes of the experiments on our data are provided in Section 6, followed
by an expert evaluation of outcomes on unseen data in Section 7. We discuss our
findings, formulate conclusions, and identify points for future research in Section 8.
2. HiTiME
The research presented here is conducted within the HiTiME project, which
stands for Historical Timeline Mining and Extraction. The project is a collaboration between Tilburg University and the International Institute of Social History
(IISH), which is located in Amsterdam.
The main goal of HiTiME is to create an extensible knowledge base regarding
the history of the social movement based on data provided by IISH, and to make
this knowledge base searchable and browsable in a meaningful way. Many of the
data sources available at IISH are textual, and contain overlapping or complementary information, but thus far, few meaningful links exist between these sources.
Historical researchers who wish to gather information on a specific topic spend
considerable amounts of time and energy sifting through all that is available in
search of what is relevant to them. This method of searching also assumes that
researchers already have a clear picture of what it is they are looking for and where
to find it, decreasing the chances of them finding the unexpected. Text mining and
information extraction can provide the tools needed to automatically cluster pieces
of related data, making the search process far easier and leaving more time for
innovative research and serendipitous findings.
The data provided by IISH to HiTiME consists of both primary and secondary
historical sources of text. Included are (auto)biographies, letters, archive listings,
and databases. An obvious entry point into all this data is formed by named entities,
since most of our cultural history is made up of the thoughts and actions of people
and organizations. These thoughts and actions are expressed through the various
ways that the entities relate to each other. We consider the network of interpersonal
relations to be the most appropriate starting point, since people are the most atomic
entities in this scenario. By tracking the interconnectivity between them over time,
we assume that the network of organizations will at least partly unfold as well.
3
A biography in general contains information about one specific person compressed to present only the most relevant facts, often in a chronological order. One
of the data sources maintained by IISH is the Biographical Dictionary of Socialism and the Labour Movement in the Netherlands (henceforth BWSA). It contains
biographies of many of the main actors in the rise of Dutch socialism and forms a
suitable case study for our system.
3. Theory
Our research combines Social Network Extraction and Sentiment Analysis. We
briefly review related research in both areas.
3.1. Social Network Extraction
A widely used method for determining the relatedness of two entities was first
introduced by Kautz et al. [12]. They compute the relatedness between two entities by normalizing their co-occurrence count on the Web with their individual hit
counts using the Jaccard coefficient. If the coefficient reaches a certain threshold,
the entities are considered to be related. For disambiguation purposes keywords
are added to the queries when obtaining the hit counts.
Matsuo et al. [16] apply the same method to find connections between members
of a closed community of researchers. They gather person names from conference
attendance lists to create the nodes of the network. The affiliations of each person
are added to the queries as a crude form of named entity disambiguation. When a
connection is found, the relation is labeled by applying minimal rules, based on the
occurrence of manually selected keywords, to the contents of websites where both
entities are mentioned.
A more elaborate approach to network mining is taken by Mika [17] in his presentation of the Flink system. In addition to Web co-occurrence counts of person
names, the system uses data mined from other—highly structured—sources such
as email headers, publication archives, and so-called Friend-Of-A-Friend (FOAF)
profiles. Co-occurrence counts of a name and different interests taken from a predefined set are used to determine a person’s expertise and to enrich their profile.
These profiles are then used to resolve named entity co-reference and to find new
connections.
Elson et al. [9] use quoted speech attribution to reconstruct the social networks
of the characters in a novel. Though this work is most related through the type
of data used, their method can be considered complementary to ours: where they
relate entities based on their conversational interaction without further analysis of
the content, we try to find connections based solely on the words that occur in the
text.
4
Efforts in more general relation extraction from text have focused on finding
recurring patterns and transforming them into triples (RDF). Relation types and
labels are then deduced from the most common patterns [20, 8]. These approaches
work well for the induction and verification of straightforwardly verbalized factoids, but they are too restricted to capture the multitude of aspects that surround
human interaction; a case in point is the kind of relation between two persons,
which people can usually infer from the text mentioning thw relation, but is rarely
explicitly described in a single triple.
3.2. Sentiment Analysis
Sentiment analysis is concerned with locating and classifying the subjective
information contained in a text. Subjectivity is inherently dependent on human
interpretation and emotion. A machine can be taught to mimic these aspects, given
enough examples, but the interaction of the two is what makes humans able to
understand, for instance, sarcasm or irony [21].
Although the general distinction between negative and positive is intuitive for
humans to make, subjectivity and sentiment are very much domain and context
dependent. Depending on the domain and context, a single sentence can have
opposite meanings [18].
The sentiment or tone of a text can be measured with varying granularities: at
the word level, in clauses, sentences [23], paragraphs [1], or even at the level of
the text as a whole [5]. Many of the approaches to automatically solving tasks like
these involve using lists of positively and negatively polarized words or phrases to
calculate the overall sentiment [19]. As shown by Kim and Hovy [13], the order
of the words potentially influences the interpretation of a text. Pang et al. [19] also
found that the presence of a word is more important than the number of times it
appears.
Word sense disambiguation can be a useful tool in determining polarity. Turney [22] proposed a simple yet effective way to determine polarity at the word level.
He calculates the difference between the mutual information gain of a phrase and
the word ’excellent’ and of the same phrase and the word ’poor’.
Though much of the research in sentiment analysis aims at finding and labeling
explicit mentions of sentiment, in most cases sentiment or emotion is expressed
only implicitly. With this in mind Balahur et al. [2] try to find situation descriptions
that trigger certain emotional reactions based on common sense knowledge and
gather models of these situations in a knowledge base.
Most related to our own research is the work of Van Atteveldt et al. [23]. They
classify connections between politicians and political issues as either positive or
negative, based on news coverage of the Dutch parliamentary elections of 2006,
5
using syntactic analysis and word similarity measures. They compare the predictive power of sentence-based features, predicate-based features, and a combination
thereof, and find that the combination yields the best results.
4. Data and Annotation
4.1. Data
Much of the research performed on social network extraction starts with an explicit record of the network and attempts to find evidence to confirm this record. In
contrast, the only explicit information we have in this study is a person’s biography.
We use the Biographical Dictionary of Socialism and the Workers’ Movement
in the Netherlands (BWSA) as input for our system. This digital resource consists
of 574 biographical articles, in Dutch, relating to the most notable actors within
the domain. The texts are accompanied by a database that holds such metadata
as a person’s full name and known aliases, dates of birth and death, and a short
description of the role they played within the Workers’ Movement. The articles
were written by over 200 different authors, thus the use of vocabulary varies greatly
across the texts. The length of the biographies also varies: the shortest text has 308
tokens while the longest has 7,188 tokens. The mean length is 1,546 tokens with a
standard deviation of 784.
The biographies are all available online2 and are accessible either through an
alphabetized index or by querying the articles using string search with minimal
Boolean operators. A query result links to the full article in which the search terms
are highlighted. Each first mention of a person that is not the main entity and that
also has a biography in the BWSA links to that person’s biography. These links
were all added manually. To do this for all mentions would be an arduous task.
Still, since a biography spans an entire life and relationships are not static, we need
to locate every mention of every person in all biographies and classify the context
in which they are mentioned to determine their position within the social network
at different times.
Apart from the mention of a person’s name, the personal relation itself is, in
most cases, not explicitly mentioned. In rare cases words like “friend” or “opponent” are used. The subtle use of language can make it difficult even for humans to
judge whether someone is more positively or negatively related to someone else.
Against these odds we aim to investigate how well we can solve the task of
recreating the social network of Dutch socialism from biographical material using
as little external knowledge as possible.
2
http://www.iisg.nl/bwsa/
6
We create two manually annotated subsets from the BWSA: one that is specifically centered around a single entity, namely Ferdinand Domela Nieuwenhuis
(FDN set), and one that contains data on randomly selected entities (Generic set).
Each instance in these data sets consists of five sentences of which the middle
one, the focus sentence, contains the entity mention that we want to relate to the
main entity of the biography. We aim to investigate which of these sets is the most
representative when applied to the entire data set.
Domela, as he is commonly known, was a controversial key figure in the starting period of the Dutch social movement. He started his career as a Lutheran
preacher, but soon parted with the church to become the pioneer of Dutch social
democracy. When a growing division between socialists and anarchists divided
the party, he turned his back on politics and became an anarchist himself. These
changes in ideology combined with his renown may provide sufficiently rich and
varied data to form a representative sample of the relations we aim our classifiers
to discover in unseen data. However, since interpersonal relations are inherently
linked with personalities, placing the focus on one particular entity might skew the
data in unpredictable ways. Using randomly selected instances of random people
mentioned in random biographies should avoid this personality bias.
4.2. Annotation
The FDN set was annotated by two human annotators; the generic set was
annotated by one annotator. The task description for both sets was the following:
Given the fragment under consideration:
(a) Does it describe a relation between the main entity and the entity
mentioned in the focus sentence?
(b) If so, is the described relation of a negative, neutral or positive
nature?
Figure 1 shows some example fragments. The focus sentences are displayed
in bold text. The first fragment clearly describes an active collaboration between
person A (Ansing) and person B (Domela), which is classified as a positive relation.
Though the relation is, in this case, obviously two-sided, we only annotate the
relation from A to B, since the current context is A’s biography. If the relation
from B to A is of enough importance in the context of B’s life, it will most likely
also be mentioned in B’s biography. By extracting all relations from all biographies
in this manner, we will be able to reconstruct the entire network.
From the second fragment we can conclude beyond doubt that Paris and Baart
were aware of each other and moved within the same circles. However, nowhere
7
in the fragment are they directly linked to one another and no hint is given whether
their connection was positive or negative.
The third fragment is different in the sense that person A (Mannoury) and person B (Wijnkoop) are never mentioned in the same sentence. The focus sentence
tells us of an opposition against persons B and C (Van Ravesteyn), but the fragment
does not give us any information about person A’s connection to that opposition.
The last sentence does place him opposite the “party leaders”, which actually refers
to Wijnkoop and Van Ravesteyn, but it is impossible to infer this without exact domain knowledge. When looking at the entire biography, we see that ten sentences
before this fragment it is said that “during World War I he (Mannoury) and, among
others, H. Gorter formed an opposition against the two party leaders D.J. Wijnkoop
and W. van Ravesteyn” (our translation). The information needed to tie all the hints
together is not included in the fragment itself.
Due to these intricacies in the data and our choice to present fragments of only
five sentenes, it is difficult for humans to perform the task at hand in many cases.
Unsurprisingly, the inter-annotator agreement for the FDN set was fairly low. The
annotators agreed on the existence of a relation in 74.9% of the cases. On the
negative, neutral, and positive classes they agreed at 60.8%, 24.2%, and 66.5%,
respectively. All disagreements were resolved through discussion. The neutral
class proved to be most difficult to resolve. A logical explanation for this is that
while positive and negative judgments can both have varying intensities, neutral is
a point midway between the extremes. Any diversion from this point makes it not
neutral, and a point is much more difficult to capture than a range.
A similar issue surfaced when looking at the disagreements for the unrelated
class. The mere fact that a person is mentioned in another person’s biography is
often enough to assume that they are in some way connected, even though it may
not be explicitly said that they are. Therefore, we decided to group together the
neutral and unrelated classes into a single class of instances with unknown polarity.
This approach allows us to treat every mention of a named entity as a connection
of which we only have to determine whether the polarity leans to being negative or
positive.
The resulting class distributions for both sets are listed in Table 1. They are
roughly the same for both data sets and seem to be a fair representation of human
relations. In general, people are more likely to approach someone in a positive
manner, if only to avoid unnecessary conflict, which is reflected by the larger positive class. The unknown class possibly represents people we are somehow associated with, but as of yet have no real opinion of. Once we do form an opinion
about them, they move either to the positive or the negative class. Since in general
people tend to avoid conflict and strive for social bonds, it is plausible to assume
that negative relations are lower in number and shorter in duration than positive
8
AnsingP ER−A and Domela NieuwenhuisP ER−B were in written contact with each
other since August 1878.
Domela NieuwenhuisP ER−B probably wrote uplifting words in his letter to
AnsingP ER−A , which was not preserved, after reading Pekelharing’sP ER−C report of the program convention of the ANWV in Vragen des Tijds, which was all
but flattering for AnsingP ER−A .
In this letter, DomelaP ER−B also offered his services to AnsingP ER−A and his
friends.
Domela NieuwenhuisP ER−B used this opportunity to ask AnsingP ER−A several
questions about the conditions of the workers, the same that he had already asked
in a letter to the ANWV in 1877, which had been left unanswered.
AnsingP ER−A answered the questions extensively.
ParisP ER−A joined a group of ceramists around Servaas BaartP ER−B .
In 1892, he belonged to one of the seven founders of the ceramists association Loon
naar Werk, the first trade union in Maastricht.
In 1897 he became secretary and soon he took over a part of the daily work
of trade union chairman BaartP ER−B , who was occupied by other activities
within the socialist movement.
Around 1900 ParisP ER−A succeeded BaartP ER−B , who had become administrator
of the cooperation Het Volksbelang, as chairman of Loon naar Werk.
ParisP ER−A was among the small group of socialists who revived the ailing SDAP
department in Maastricht at the end of last century.
When the office found out that the police was informed of the meeting, they continued the discussions at his home.
In 1925 MannouryP ER−A played an important role in the crisis within the CPN.
For some time there had been an opposition within the party against
WijnkoopP ER−B and Van RavesteynP ER−C .
The contrasts became clearer during preparations for the parliamentary elections
in July 1925.
When the party leaders wanted to defer from a Comintern resolution regarding the
candidacy at the party congress in May, MannouryP ER−A saw it as a breach of
discipline.
Figure 1: English translations of some example fragments.
9
Class
negative
positive
unknown
total
Generic
No.
%
86
16.1
238 44.6
210 39.3
534 100
FDN
No.
%
74
16.7
212 48.0
156 35.3
442 100
Table 1: Class distribution
ones, which would explain the small size of the negative class.
5. Relation Extraction and Classification
5.1. Relation Extraction
First, all biographies are lemmatized, POS-tagged and parsed using Frog, a
morpho-syntactic analyzer for Dutch [25]. Next, we use a custom made NER
module to locate all person names in the data. The NER module consists of a
classifier-based sequence processing tool trained on contemporary newspaper data
and a subset of 70 biographies of the BWSA, all fully annotated with named entity
categories Person, Organization, Location and Miscellaneous. In testing, the module only reached a score of 48.3% on the Person category. Therefore a considerable
amount of noise exists in the NER output.
To identify the person to which a named entity refers, the name is split into
chunks representing first name, initials, infix and surname. These chunks, as far
as they are included in the string, are then matched against the BWSA database.
If no match is found, the name is added to the database as a new person. For
now, however, we treat the network as a closed community in testing, by only
extracting those fragments in which person B is one that already has a biography
in the BWSA. At a later stage, biographies of people from outside the BWSA can
be gathered and used to determine their position within the network. This step
filters out most of the NER errors.
5.2. Relation Classification
We define several sets of features to train the classifiers. They can be divided
into two categories: lexical features that are based on the lemmata of the fragments,
and co-occurrence features that give an initial indication of the relatedness between
person A and B. Almost all of the feature sets are based on the text itself, in order
to keep the dependence on external information sources to a minimum. The feature
sets and classification process are described in detail below.
10
5.2.1. Lexical Features
We define five sets of lexical features. Preliminary investigations revealed that
keeping only verbs and nouns works best for the classification of the relations [24].
Person names were found to be of less significance, but their inclusion does not hurt
the classification either, so we decided to include them as well for the formation
of dependency triples. All person names that are mentioned in each fragment are
anonymized by replacing them with ’PER-X’ where X is a letter from A to Z. The
main person of the biography is always denoted with ’PER-A’ and the person we
want to relate that person to is denoted with ’PER-B’. The maximum number of
persons mentioned in a biography is always below 26.
• lemmata - The first set consists of only the verbs and nouns in their canonical
form plus the anonymized person names. This set is the simplest in its form
and will serve as a second baseline next to the majority class baseline. There
are 5184 lemmata in this set.
The dependency parse allows us to use more structured features. We create
two separate sets from the parse. We expect that the triples will reveal patterns that
give more structured information to the classifier about how the person mentions
are related to one another within the fragment and thus help the classification.
• triples - All dependency triples consisting of head-relation-dependent that
contain a verb, a noun or an anonymized person name as either the head or
the dependent. This set has a total of 16,000 features.
• tuples - For this set, we split the dependency triples into two tuples: headrelation and relation-dependent. It has a total of 12,011 features.
Since our data sets are of a relatively small size, we expect that the lemmabased dependency features will not cover the entire data and therefore that they
will not perform well in the classification. In an attempt to make the data more
general, we use Brouwers thesaurus for Dutch (1989), which is similar in structure
to Roget’s Thesaurus for English. The same thesaurus is used by Van Atteveldt et
al. [23] for word sense disambiguation in their classification of relations in Dutch
political texts. The thesaurus contains 10 primary classes, which are further divided into 41 subdivisions and 997 sections. A single lemma can fall into different
categories on each level, depending on its sense. For each of the dependency triples
of the sets described above we look up its main category. We create two different
feature sets in this manner. They combine the structure of the dependency parse
with a more general representation of the lemmata in the text and are the smallest
in size. We expect these feature sets to perform best.
11
• thesaurus - A selection of the dependency triples from the triples set described above of which both the head and the dependent are found in the
thesaurus. The head and dependent are replaced with their associated main
class in the thesaurus. If for either there exists more than one entry with
a different main class, multiple triples are created until all combinations of
senses of both words are included. This set contains 1,896 features.
• thesaurus + WSD - This set is the same as thesaurus, except that only the
first, most common sense of both head and dependent are included. This can
be seen as a crude form of word sense disambiguation. The set contains a
total of 913 features.
5.2.2. Co-occurrence Features
Previous research has shown that co-occurrence features do not significantly
enhance the classification when combined with lemmata [24]. However, combined
with the more informative dependency triples they might help to better distinguish
between positive and negative relations. We define four different measures of cooccurrence. For all of the measures the counts are divided over the sum of all
occurrences of either person A or B.
• fragment - The number of sentences within the fragment that contain a mention of both person A and person B.
• document - The number of sentences in the current biography in which both
person A and B are mentioned.
• BWSA - The number of biographies in which both person A and B are mentioned.
• Yahoo! - The number of hits for a query containing the full names of persons
A and B on the search engine Yahoo!.
5.2.3. Classification
We take a supervised machine learning approach to classifying the relations.
To perform the experiments, we make use of Weka, a Java-based machine learning
workbench [11]. We test which of the following classification algorithms performs
best on the task: Naive Bayes; JRip, a Java implementation of Cohen’s Ripper algorithm [6], with both the number of optimizations and seeds set to 5; LibSVM [3]
with a linear kernel and a cost factor of 0.5; and the k-nearest neighbor algorithm
with k set to 1 [7]. Support vector machines and k-nearest neighbor algorithms are
especially suited for text classification, therefore we expect them to do better than
12
Naive Bayes or the rule based JRip algorithm. However, we decided to include
these methods to investigate the degree of robustness of the data.
All features are binary except for the co-occurrence, which are real numbers.
For each algorithm, we implement two different setups:
• pipe - A cascading pipeline which first classifies and filters out the positive
instances, then classifies the negative. The instances that have received no
label at the end of the pipeline are classified as unknown.
• joint - A joint learning setup where all three classes are classified simultaneously.
6. Results
6.1. Cross Validation
A 90–10 split is made of both data sets, creating two large training sets, and
two smaller sets which we set aside for testing. As a first experiment and to set
a baseline with the simplest type of features, we compare performance for all five
classifiers on the lemma features. The results are displayed in the top row of Tables 2 and 3. All reported scores are macro averaged F-measures. For both data
sets all classifiers score above the majority class baseline on the lemma feature set.
LibSVM gives the highest score on the Generic set, while JRip gives the highest
score on the FDN set, both in a pipeline setup.
When we look at the F-measures for the different classes we see that LibSVM
only reaches a score of 2.2% on the negative class for the Generic set. JRip on the
FDN set attains 10.3% on this same class. JRip’s better performance could be due
to the fact that the FDN set is more homogenous when it comes to the people and
issues it describes, since the set is focused around a single entity, thus making it
easier to separate the classes using simple rules.
Next, we consecutively replace the lemmata with the triples and the tuples,
to test whether adding minimal grammatical structure improves the classification.
These results are displayed in the middle rows of Tables 2 and 3.
In most cases, the tuples set outperforms the triples set. This is most likely due
to its smaller size. Since there are less features, there is bound to be more overlap
between instances, making the classes more easily separable. The tuples set, however, is more than twice the size of the lemmata set. This has a detrimental effect
on performance and thus the scores rarely exceed the lemmata baseline, which is in
line with our expectations. The JRip classifier, however, is not affected by this explosion of features due to its strong built-in feature selection capabilities. It attains
its highest scores using the tuples set for both data sets in both setups.
13
lemmata
triples
tuples
thesaurus
th. + WSD
Naive Bayes
pipe
joint
31.2
28.0
22.1
23.0
25.7
25.3
29.9
33.2
30.9
34.2
JRip
pipe
joint
27.7
27.7
25.7
23.9
32.5
31.4
31.5
28.9
29.0
28.0
LibSVM
pipe
joint
34.2
32.4
31.7
32.2
33.5
30.5
35.7
31.6
33.9
35.4
1-NN
pipe
joint
31.6
32.9
24.8
26.4
28.7
32.1
33.9
30.0
32.8
31.9
Table 2: Macro averaged F-measures for 10-fold cross-validation on 90% of the
Generic set; majority class baseline 20.6%. The highest score for each classifier
and setup over all feature sets is marked in bold font.
lemmata
triples
tuples
thesaurus
th. + WSD
Naive Bayes
pipe
joint
31.6
22.6
26.0
21.8
29.4
23.3
29.3
31.5
30.3
28.1
JRip
pipe
joint
33.5
30.3
28.1
28.2
33.5
32.2
31.1
31.4
30.7
27.3
LibSVM
pipe
joint
32.7
32.9
31.8
31.6
29.6
29.1
33.7
32.1
32.6
33.3
1-NN
pipe
joint
33.1
32.7
34.0
29.3
33.2
31.6
31.7
34.0
33.0
31.4
Table 3: Macro averaged F-measures for 10-fold cross-validation on 90% of the
FDN set; majority class baseline 21.6%. The highest score for each classifier and
setup over all feature sets is marked in bold font.
The results on the thesaurus feature sets are displayed in the bottom rows of
Tables 2 and 3. As hypothesized, these sets produce the best performance with
most classifiers. The crude word sense disambiguation of the second thesaurus set
does not consistently improve the overall classification scores. When looking at
the individual classes’ scores, we see that it mostly has a negative effect on the
negative class.
Overall, LibSVM attains the highest scores for the Generic set with the ambiguous thesaurus set in a pipeline setting. The F-scores for the positive, negative
and unknown classes are, respectively, 48.2%, 13.7%, and 45.3%. The joint learning setup for the same data set and algorithm achieves the second best score with
the disambiguated thesaurus set, with scores of 48.6%, 14.8%, and 42.8% for the
respective classes. When we look at the precision and recall scores we see that in
the joint learning setup precision slightly decreases for the positive and negative
classes, in favor of recall.
For the FDN data set, 1-NN scores the best. It achieves the same F-score in
the pipeline setup with the triples feature set, as it does in the joint setup with the
undisambiguated thesaurus features. The F-scores for the positive, negative and
unknown classes in the triples pipeline are 60.6%, 10.7%, and 30.7%, respectively.
14
Generic
LibSVM
thesaurus + WSD
35.4
35.4
+ fragment
+ document
34.1
36.7
+ BWSA
+ Yahoo!
34.9
+ all co-occ.
35.0
1-NN
31.9
31.4
31.4
32.9
32.8
33.1
FDN
LibSVM
thesaurus
32.1
+ fragment
33.9
+ document
33.6
+ BWSA
32.1
+ Yahoo!
33.0
+ all co-occ.
35.3
1-NN
34.0
32.8
33.1
34.0
34.0
32.1
Table 4: Macro averaged F-measures for 10-fold cross-validation on 90% of the
Generic and FDN sets. The highest score for each data set, classifier and setup over
all feature sets is marked in bold font.
For the thesaurus joint learning classifier they are 52.5%, 20.3%, and 29.2%, respectively. In the joint learning setup, again, we see a decrease in precision, this
time for the negative and unknown classes. For the negative class it drops from
44.4% to 17.4%, while recall increases from 6.1% to 24.2%. For the positive class,
precision slightly increases, while recall drops from 80.0% to 56.3%. This indicates that in the pipeline setup the first classifier tends to give priority to the
positive class in its classification. As a consequence few instances get passed on
to the second classifier, which leads to a high precision but low recall on the small
negative class.
The task of the system, ultimately, is to judge whether connections drawn
from a neutral or non-polarized network are either more positive, or more negative. Thus, it is more important to correctly classify an instance than to find all
connections belonging to each class. We therefore favor precision over recall, especially with regards to the large positive class, and choose the joint learning setup
over the pipeline.
As a final experiment we pair the best performing feature set for each data
set with all or one of the co-occurrence features in a joint learning setup. For the
Generic data set, the best performing feature set is the disambiguated thesaurus set;
for the FDN data set it is the ambiguous thesaurus set. The results are displayed in
Table 4.
LibSVM is the best scoring algorithm for both data sets. There seems to be
no consistency as to which co-occurrence feature works best. For the Generic set,
the BWSA co-occurrence measure produces the highest score; for the FDN set all
co-occurrence measures together perform best.
6.2. Evaluation on Held Out Data
Overall, the performance of the Generic set is consistently better than performance on the FDN set in cross validation. To assess their performance on unseen
15
Generic
(thesaurus + WSD + BWSA)
Precision
Recall F-measure
positive
48.5
66.7
56.1
negative
0.0
0.0
0.0
33.3
23.8
27.8
unknown
macro avg.
27.3
30.2
28.0
FDN
(thesaurus + all co-occ.)
Precision
Recall
F-measure
positive
47.4
40.9
43.9
negative
28.6
25.0
26.7
unknown
20.0
25.0
22.2
macro avg.
32.0
30.3
30.9
Table 5: Left: Results when training on 90% and testing on 10% of the Generic set;
majority class baseline 20.5%. Right: Results when training on 90% and testing
on 10% of the FDN set; majority class baseline 21.6%.
FDN, tested on Generic
(thesaurus + all co-occ.)
Precision
Recall F-measure
positive
45.8
45.8
45.8
0.0
0.0
0.0
negative
unknown
35.7
47.6
40.8
macro avg.
27.2
31.1
28.9
Generic, tested on FDN
(thesaurus + WSD + BWSA)
Precision
Recall
F-measure
positive
57.9
50.0
53.7
negative
16.7
12.5
14.3
unknown
33.3
43.8
37.8
macro avg.
36.0
35.4
35.3
Table 6: Left: Results when training on 90% of the FDN set and testing on 10%
of the Generic set; majority class baseline 20.5%. Right: Results when training on
90% of the Generic set and testing on 10% of the FDN set; majority class baseline
21.6%.
data, we test the best LibSVM joint learning classifier for each data set on a held
out test set. The results are listed in Table 5.
Unfortunately the classifier trained on the Generic set is unable to correctly
classify any instances as negative. The accuracy on this class is 72.2%, which
means that the classifier does not completely resort to majority class voting and
does select some instances as negative (though not the correct ones). The FDN
classifier performs better on the negative class, but consequently has a lower recall
on the positive class, leading to a comparably low average F-measure.
To test whether classifiers trained on the two data sets generalize well to their
mutual test sets, we test the Generic classifier on the held out test set of the FDN
data set, and vice versa. The results are listed in Table 6. Again, when testing on
the held out Generic set, we see that no instances are ever classified as negative.
The macro averages of precision, recall and F-measure are roughly the same as
when we train and test on the Generic set.
The Generic classifier performs better when tested on the FDN set. In fact, it
outperforms the FDN classifier for both the positive and the unknown class, leading
to F-measures that are 10 to 15 points higher. Its performance on the negative class
16
70
60
Mean Average Precision
positive
negative
Frequency
50
40
30
20
10
0
0
1
2
3
4
5
6
7
8
9 10 11 12
No. of Instances
100
90
80
70
60
50
40
30
20
10
0
positive
negative
0
1
2
3
4
5
6
7
8
9 10 11 12
No. of Instances
Figure 2: Left: Frequency distribution of the number of instances per connection.
Right: Mean average precision of the number of instances per connection.
is worse than when trained on FDN, but higher than when trained and tested on
Generic. We conclude from this that the task of classifying generic relationships
from a large pool of connections between different people is more difficult than
classifying relationships that only pertain to one specific entity, such as Domela.
7. Expert Evaluation
In order to evaluate the system’s performance in a real world setting we train
a classifier on the full Generic data set with the best selected features and use
this to classify all remaining person mentions in the BWSA. We extract top-20
lists of both negative and positive relations for four main characters in the data:
Domela, Eduard Douwes Dekker (1820–1887), Franc van der Goes (1858–1939),
and Henriette van der Schalk (1869–1952). These persons are chosen based on the
above-average length of their biography, which we take as an indication of their
importance throughout the data, to make sure enough connections are found to
create the top-20 lists. The lists are ranked by frequency, placing the connection
for which most instances were found at the top.
In total, the ranked lists are made up of 329 instances: 119 negative instances,
with an average of 1.5 (± 0.1) instances per connection, and 210 positive instances,
with an average of 2.6 (± 0.9) instances per connection. The left graph in Figure 2
shows the frequency distribution of the number of instances returned per relation
for both the positive and negative classes over all ranked lists. The maximum number of instances returned for the negative class is 6, for the positive class the maximum is 11. Besides this, the most notable difference we see is that the connections
17
Figure 3: Validated network of positive relations of Ferdinand Domela Nieuwenhuis.
18
for which only one instance is returned constitute over 65% of the negative results,
but only around 25% of the positive results.
We asked an expert in the field of Dutch social history to judge these ranked
lists of friends and foes. The task for the negative relations was to indicate whether
the person mentioned ever had a (major or minor) conflict with the main person.
For the positive relations the task was to indicate whether the person mentioned
ever had a (long or short) positive encounter with the main person. The intensity of
the relation is not factored in. Since it is unrealistic to expect one person to know
everything there is to know about the domain in such detail, instead of only ’yes’
or ’no’, the expert could also choose ’unknown’ in case of doubt or unfamiliarity
with the person(s) in question.
Of the positive rankings, the expert judged 63.8% (± 18.0) to be correct and
7.5% (± 9.6) to be incorrect. 28.8% (± 8.5) of the connections were marked as
unknown. For the negative connections, these numbers are: 21.3% (± 9.5) correct,
37.5% (± 11.9) incorrect and 41.3% (± 10.3) unknown. The unknown relations
are counted as irrelevant results when calculating precision over the ranked lists.
The mean average precision on the positive class is 71.3%. For the negative class
the mean average precision is 36.8%.
The right graph in Figure 2 shows, per class, how the mean average precision
changes when the results are cut off at a certain number of instances. For example,
if only positive relations are returned for which the number of instances is at least
3, the mean average precision is 87.6%, which is the highest score for this class.
Overall, the classifier performs extremely well on classifying positive relations.
Figure 3 shows the extracted network of people that are positively related to Ferdinand Domela Nieuwenhuis. Correct connections are signified by a green node,
incorrect connections by a red node. The ranked positive list for Domela actually
contained the least amount of instances and the most incorrect and unknown relations of all four lists. This could be due to the low performance of the named entity
recognition module: it recognizes only about half of the person names, so many
mentions of Domela or connections to him might not have been included. Also,
there are several people in the data with the same last name, Nieuwenhuis. It is not
unlikely that the named entity disambiguation module might have attributed some
entity mentions to the wrong person, resulting in misplaced or missing connections
in the network.
Clearly, the classification of negative relations is very unreliable. The negative class reaches its peak at a cut-off of 2 instances, with a score of only 44.1%,
while keeping all found negative instances still results in a score of 43.0%. Since
the 1-instance relations constitute more than half of the returned negative connections, filtering them would greatly reduce the number of relevant results. However,
including them will increase the amount of noise in equal measure, making the
19
results unusable in a user-oriented context.
8. Discussion and Future Research
We have presented a system for the classification of personal relationships in
biographical texts, which can be visualized as personal social networks of historical figures. We showed that our classifiers are able to label these relations above a
majority class baseline score. We find that a training set containing relations surrounding multiple persons produces more desirable results than a set that focuses
on one specific entity.
We have tested whether adding minimal syntactic structure to text-based features improves the classification and have found this to be the case when they are
reduced to a more generalized representation using thesaurus categories.
When selecting a classifier, it is debatable whether to choose precision over
recall, or vice versa. Giving precedence to recall will result in more noise in the
output. On a small scale, as with the current data, this can be partially filtered out
by setting a threshold for the probability or frequency needed to introduce a new
connection into the network. When the amount of data increases, however, the
amount of noise does as well. With regards to the overall goal of the encompassing
project – the construction of a browsable knowledge base – this is a very important
issue to consider. Stating only proven facts will limit serendipity, but making too
many guesses will harm the system’s credibility. To make sure that only information with a certain degree of certainty is presented to the user, we give priority to
precision.
Support vector machines and 1-NN algorithms are known to do well on textclassification tasks. Indeed these algorithms achieve the highest scores. Ultimately,
we chose LibSVM in a joint learning setup as the best classifier.
The classifier proves to be better at classifying positive relations than negative relations. This is no surprise as the negative class makes up only 16% of the
training data. Adding training data would likely boost performance, though the
annotation of these relationships in this type of unstructured data has proven to be
a difficult task. Alternative to adding more information of the same type, making
better use of metadata and other sources of information, such as publication records
or personal letters, might be a valid avenue to improve results.
Improvement of the named entity recognition and disambiguation modules will
help to extract more and more accurate relations. Including other types of entities,
such as organizations and locations in the analysis of the text fragments could be
another way to add information. In future work we intend to include organizational
entities in the social networks as they go hand in hand with the people that formed
them.
20
Acknowledgments
The HiTiME project is funded by the Netherlands Organisation for Scientific
Research (NWO) as part of the Continuous Access To Cultural Heritage (CATCH)
programme. We would like to thank the International Institute of Social History
for making their data available for this research, and Marien van der Heijden, for
supporting us with his expert knowledge and feedback.
References
[1] Anand, P., Walker, M., Abbott, R., Fox Tree, J. E., Bowmani, R., Minor,
M., June 2011. Cats rule and dogs drool!: Classifying stance in online debate. In: Proceedings of the 2nd Workshop on Computational Approaches
to Subjectivity and Sentiment Analysis (WASSA 2.011). Association for
Computational Linguistics, Portland, Oregon, pp. 1–9.
[2] Balahur, A., Hermida, J. M., Montoyo, A., June 2011. Detecting implicit
expressions of sentiment in text based on commonsense knowledge. In:
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Association for Computational Linguistics, Portland, Oregon, pp. 53–60.
[3] Chang, C.-C., Lin, C.-J., 2001. LIBSVM: a library for support vector machines. Software available at: http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[4] Chen, C. C., Tseng, Y.-D., 2011. Quality evaluation of product reviews
using an information quality framework. Decision Support Systems 50,
755–768.
[5] Clarke, D., Lane, P., Hender, P., June 2011. Developing robust models
for favourability analysis. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA
2.011). Association for Computational Linguistics, Portland, Oregon, pp.
44–52.
[6] Cohen, W. W., 1995. Fast effective rule induction. In: In Proceedings of
the Twelfth International Conference on Machine Learning. pp. 115–123.
[7] Cover, T., Hart, P., 1967. Nearest-neighbour classification. IEEE Transactions on Information Theory 13.
21
[8] Culotta, A., McCallum, A., Betz, J., 2006. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational
Linguistics (HLT-NAACL), 296–303.
[9] Elson, D. K., Dames, N., McKeown, K. R., 2010. Extracting social networks from literary fiction. Proceedings of the 48th Annual Meeting of the
Association for Computational Linguistics (ACL), 138–147.
[10] Ganley, D., Lampe, C., 2009. The ties that bind: Social network principles
in online communities. Decision Support Systems 47, 266–274.
[11] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten,
I. H., 2009. The weka data mining software: an update. SIGKDD Explor.
Newsl. 11, 10–18.
[12] Kautz, H., Selman, B., Shah, M., 1997. The hidden web. AI Magazine 18,
27–36.
[13] Kim, S.-M., Hovy, E., 2006. Automatic identification of pro and con reasons in online reviews. Proceedings of the COLING/ACL Main Conference Poster Sessions, 483–490.
[14] Kiss, C., Bichler, M., 2008. Identification of influencers–measuring influence in customer networks. Decision Support Systems 46, 233–253.
[15] Li, F., Du, T. C., 2011. Who is talking? an ontology-based opinion leader
identification framework for word-of-mouth marketing in online social
blogs. Decision Support Systems 51, 190–197.
[16] Matsuo, Y., Tomobe, H., Hasida, K., Ishizuka, M., 2004. Finding social
network for trust calculation. European Conference on Artificial Intelligence (ECAI).
[17] Mika, P., 2005. Flink: Semantic web technology for the extraction and
analysis of social networks. Web Semantics: Science, Services and Agents
on the World Wide Web 3 (2–3), 211–223.
[18] Pang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2), 1–135.
[19] Pang, B., Lee, L., Vaithyanathan, S., 2002. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the Conference
on Empirical Methods in Natural Language Processing (EMNLP), 79–86.
22
[20] Ravichandran, D., Hovy, E., 2002. Learning surface text patterns for a
question answering system. Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics (ACL), 41–47.
[21] Reyes, A., Rosso, P., June 2011. Mining subjective knowledge from customer reviews: A specific case of irony detection. In: Proceedings of the
2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Association for Computational Linguistics, Portland, Oregon, pp. 118–124.
[22] Turney, P. D., 2002. Thumbs up or thumbs down? semantic orientation
applied to unsupervised classification of reviews. Proceedings of the 40th
Annual Meeting of the Association for Computational Linguistics (ACL),
417–424.
[23] van Atteveldt, W., Kleinnijenhuis, J., Ruigrok, N., Schlobach, S., 2008.
Good news or bad news? conducting sentiment analysis on dutch text to
distinguish between positive and negative relations. Journal of Information
Technology Politics 5, 73–94.
[24] van de Camp, M., van den Bosch, A., 2011. A link to the past: Constructing historical social networks. Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA
2.011), 61–69.
[25] van den Bosch, A., Busser, B., Canisius, S., Daelemans, W., 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. Computational Linguistics in the Netherlands: Selected Papers from the Seventeenth CLIN Meeting, 99–114.
23