Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Socialist Network Matje van de Campa∗ , Antal van den Boscha1 a Tilburg center for Cognition and Communication, Tilburg University, Warandelaan 2, 5037 AB Tilburg, Netherlands Abstract We develop and test machine learning-based tools for the classification of personal relationships in biographical texts, and the induction of social networks from these classifications. A case study is presented based on several hundreds of biographies of notable persons in the Dutch social movement. Our classifiers mark relations between two persons (one being the topic of a biography, the other being mentioned in this biography) as positive, neutral, or unknown, and do so at an above-baseline level. A training set centering on a historically important person is contrasted against a multi-person training set; the latter is found to produce the most robust generalization performance. Frequency-ranked predictions of positive and negative relationships predicted by the best-performing classifier, presented in the form of person-centered social networks, are scored by a domain expert; the mean average precision results indicate that our system is better in classifying and ranking positive relations (around 70% MAP) than negative relations (around 40% MAP). Keywords: text mining, machine learning, social network extraction, sentiment analysis, social history 1. Introduction Web sites such as Facebook, Twitter and, more recently, Google+, allow us to communicate with people across the globe with great ease. We are able to find ∗ Corresponding author. Tilburg center for Cognition and Communication, Tilburg University, Warandelaan 2 D335, 5037 AB Tilburg, Netherlands, +31 13 4662433. Email addresses: [email protected] (Matje van de Campa∗ ), [email protected] (Antal van den Boscha ) 1 Current address: Centre for Language Studies, Radboud University Nijmegen, Nijmegen, Netherlands Preprint submitted to Decision Support Systems September 5, 2011 like-minded people anywhere in the world and share ideas with them. These Social Networking Services are not the only web sites that successfully utilize the innate desire of people to connect with one another. For instance, many online stores use the principles of social networks to create a digital version of word-of-mouth advertising by allowing users to write reviews of their products [4] or by targeting specific influential opinion leaders within a network with their advertisements [14, 15]. Some news sites allow for the building of an online reputation by letting users indicate whether they agree or disagree with someone’s comments [10]. The networks that arise from these activities are digitally recorded, creating a multitude of data on human interactions. The availability of such structured data has sparked new interest in the field of Social Network Extraction and Analysis. However, little research has been done on the formation and evaluation of offline social networks from a computational modeling point of view. This is largely due to the scarcity of digitally available real-world records of such networks, with genealogy as a notable exception. Nevertheless, we do have other, often secondary, sources that contain indirect traces of these networks. By applying the technology of today to the heritage of our past, it may be possible to uncover yet unknown patterns and provide a better insight into our society’s social development, whether it be on- or offline. In this paper we present a case study based on historical biographical information, so-called secondary historical sources, describing people in a particular domain, region and time frame: the Dutch social movement between the mid-19th and mid-20th century. “Social movement” refers to the social-political-economical complex of ideologies, worker’s unions, political organizations, and art movements that arose from the ideas of Karl Marx (1818–1883) and followers. In the Netherlands, a network of persons unfolded over time with leader figures such as Ferdinand Domela Nieuwenhuis (1846–1919) and Pieter Jelles Troelstra (1860–1930). Although this network is implicit in all the primary and secondary historical writings documenting the period, and partly explicit in the minds of experts studying the domain, there is no explicitly modeled social network of this group of persons. Yet, it would potentially benefit further research in social history to have this in the form of a computational model. In our study we focus on detecting and labeling relations between two persons, where one of the persons, A, is the topic of a biographical article, and the other person, B, is mentioned in that article. The genre of biographical articles allows us to assume that person A is topical throughout the text. What remains is to determine whether the mention of person B can be labeled as positive or negative. More fine-grained labels are possible, but the primary aim of our case study is to build a basic network from robustly recognized person-to-person relations at the highest possible accuracy. As our data only consists of several hundreds of articles 2 describing an amount of people of roughly the same order of magnitude, we are facing data sparsity, and thus are limited in the granularity of the labels we wish to predict. This paper is structured as follows. Section 2 describes the project within which this research is conducted. After a brief survey of related research in Section 3, we describe our data and our annotation scheme in Section 4. In Section 5 we describe how we implement relation classification as a supervised machine learning task. The outcomes of the experiments on our data are provided in Section 6, followed by an expert evaluation of outcomes on unseen data in Section 7. We discuss our findings, formulate conclusions, and identify points for future research in Section 8. 2. HiTiME The research presented here is conducted within the HiTiME project, which stands for Historical Timeline Mining and Extraction. The project is a collaboration between Tilburg University and the International Institute of Social History (IISH), which is located in Amsterdam. The main goal of HiTiME is to create an extensible knowledge base regarding the history of the social movement based on data provided by IISH, and to make this knowledge base searchable and browsable in a meaningful way. Many of the data sources available at IISH are textual, and contain overlapping or complementary information, but thus far, few meaningful links exist between these sources. Historical researchers who wish to gather information on a specific topic spend considerable amounts of time and energy sifting through all that is available in search of what is relevant to them. This method of searching also assumes that researchers already have a clear picture of what it is they are looking for and where to find it, decreasing the chances of them finding the unexpected. Text mining and information extraction can provide the tools needed to automatically cluster pieces of related data, making the search process far easier and leaving more time for innovative research and serendipitous findings. The data provided by IISH to HiTiME consists of both primary and secondary historical sources of text. Included are (auto)biographies, letters, archive listings, and databases. An obvious entry point into all this data is formed by named entities, since most of our cultural history is made up of the thoughts and actions of people and organizations. These thoughts and actions are expressed through the various ways that the entities relate to each other. We consider the network of interpersonal relations to be the most appropriate starting point, since people are the most atomic entities in this scenario. By tracking the interconnectivity between them over time, we assume that the network of organizations will at least partly unfold as well. 3 A biography in general contains information about one specific person compressed to present only the most relevant facts, often in a chronological order. One of the data sources maintained by IISH is the Biographical Dictionary of Socialism and the Labour Movement in the Netherlands (henceforth BWSA). It contains biographies of many of the main actors in the rise of Dutch socialism and forms a suitable case study for our system. 3. Theory Our research combines Social Network Extraction and Sentiment Analysis. We briefly review related research in both areas. 3.1. Social Network Extraction A widely used method for determining the relatedness of two entities was first introduced by Kautz et al. [12]. They compute the relatedness between two entities by normalizing their co-occurrence count on the Web with their individual hit counts using the Jaccard coefficient. If the coefficient reaches a certain threshold, the entities are considered to be related. For disambiguation purposes keywords are added to the queries when obtaining the hit counts. Matsuo et al. [16] apply the same method to find connections between members of a closed community of researchers. They gather person names from conference attendance lists to create the nodes of the network. The affiliations of each person are added to the queries as a crude form of named entity disambiguation. When a connection is found, the relation is labeled by applying minimal rules, based on the occurrence of manually selected keywords, to the contents of websites where both entities are mentioned. A more elaborate approach to network mining is taken by Mika [17] in his presentation of the Flink system. In addition to Web co-occurrence counts of person names, the system uses data mined from other—highly structured—sources such as email headers, publication archives, and so-called Friend-Of-A-Friend (FOAF) profiles. Co-occurrence counts of a name and different interests taken from a predefined set are used to determine a person’s expertise and to enrich their profile. These profiles are then used to resolve named entity co-reference and to find new connections. Elson et al. [9] use quoted speech attribution to reconstruct the social networks of the characters in a novel. Though this work is most related through the type of data used, their method can be considered complementary to ours: where they relate entities based on their conversational interaction without further analysis of the content, we try to find connections based solely on the words that occur in the text. 4 Efforts in more general relation extraction from text have focused on finding recurring patterns and transforming them into triples (RDF). Relation types and labels are then deduced from the most common patterns [20, 8]. These approaches work well for the induction and verification of straightforwardly verbalized factoids, but they are too restricted to capture the multitude of aspects that surround human interaction; a case in point is the kind of relation between two persons, which people can usually infer from the text mentioning thw relation, but is rarely explicitly described in a single triple. 3.2. Sentiment Analysis Sentiment analysis is concerned with locating and classifying the subjective information contained in a text. Subjectivity is inherently dependent on human interpretation and emotion. A machine can be taught to mimic these aspects, given enough examples, but the interaction of the two is what makes humans able to understand, for instance, sarcasm or irony [21]. Although the general distinction between negative and positive is intuitive for humans to make, subjectivity and sentiment are very much domain and context dependent. Depending on the domain and context, a single sentence can have opposite meanings [18]. The sentiment or tone of a text can be measured with varying granularities: at the word level, in clauses, sentences [23], paragraphs [1], or even at the level of the text as a whole [5]. Many of the approaches to automatically solving tasks like these involve using lists of positively and negatively polarized words or phrases to calculate the overall sentiment [19]. As shown by Kim and Hovy [13], the order of the words potentially influences the interpretation of a text. Pang et al. [19] also found that the presence of a word is more important than the number of times it appears. Word sense disambiguation can be a useful tool in determining polarity. Turney [22] proposed a simple yet effective way to determine polarity at the word level. He calculates the difference between the mutual information gain of a phrase and the word ’excellent’ and of the same phrase and the word ’poor’. Though much of the research in sentiment analysis aims at finding and labeling explicit mentions of sentiment, in most cases sentiment or emotion is expressed only implicitly. With this in mind Balahur et al. [2] try to find situation descriptions that trigger certain emotional reactions based on common sense knowledge and gather models of these situations in a knowledge base. Most related to our own research is the work of Van Atteveldt et al. [23]. They classify connections between politicians and political issues as either positive or negative, based on news coverage of the Dutch parliamentary elections of 2006, 5 using syntactic analysis and word similarity measures. They compare the predictive power of sentence-based features, predicate-based features, and a combination thereof, and find that the combination yields the best results. 4. Data and Annotation 4.1. Data Much of the research performed on social network extraction starts with an explicit record of the network and attempts to find evidence to confirm this record. In contrast, the only explicit information we have in this study is a person’s biography. We use the Biographical Dictionary of Socialism and the Workers’ Movement in the Netherlands (BWSA) as input for our system. This digital resource consists of 574 biographical articles, in Dutch, relating to the most notable actors within the domain. The texts are accompanied by a database that holds such metadata as a person’s full name and known aliases, dates of birth and death, and a short description of the role they played within the Workers’ Movement. The articles were written by over 200 different authors, thus the use of vocabulary varies greatly across the texts. The length of the biographies also varies: the shortest text has 308 tokens while the longest has 7,188 tokens. The mean length is 1,546 tokens with a standard deviation of 784. The biographies are all available online2 and are accessible either through an alphabetized index or by querying the articles using string search with minimal Boolean operators. A query result links to the full article in which the search terms are highlighted. Each first mention of a person that is not the main entity and that also has a biography in the BWSA links to that person’s biography. These links were all added manually. To do this for all mentions would be an arduous task. Still, since a biography spans an entire life and relationships are not static, we need to locate every mention of every person in all biographies and classify the context in which they are mentioned to determine their position within the social network at different times. Apart from the mention of a person’s name, the personal relation itself is, in most cases, not explicitly mentioned. In rare cases words like “friend” or “opponent” are used. The subtle use of language can make it difficult even for humans to judge whether someone is more positively or negatively related to someone else. Against these odds we aim to investigate how well we can solve the task of recreating the social network of Dutch socialism from biographical material using as little external knowledge as possible. 2 http://www.iisg.nl/bwsa/ 6 We create two manually annotated subsets from the BWSA: one that is specifically centered around a single entity, namely Ferdinand Domela Nieuwenhuis (FDN set), and one that contains data on randomly selected entities (Generic set). Each instance in these data sets consists of five sentences of which the middle one, the focus sentence, contains the entity mention that we want to relate to the main entity of the biography. We aim to investigate which of these sets is the most representative when applied to the entire data set. Domela, as he is commonly known, was a controversial key figure in the starting period of the Dutch social movement. He started his career as a Lutheran preacher, but soon parted with the church to become the pioneer of Dutch social democracy. When a growing division between socialists and anarchists divided the party, he turned his back on politics and became an anarchist himself. These changes in ideology combined with his renown may provide sufficiently rich and varied data to form a representative sample of the relations we aim our classifiers to discover in unseen data. However, since interpersonal relations are inherently linked with personalities, placing the focus on one particular entity might skew the data in unpredictable ways. Using randomly selected instances of random people mentioned in random biographies should avoid this personality bias. 4.2. Annotation The FDN set was annotated by two human annotators; the generic set was annotated by one annotator. The task description for both sets was the following: Given the fragment under consideration: (a) Does it describe a relation between the main entity and the entity mentioned in the focus sentence? (b) If so, is the described relation of a negative, neutral or positive nature? Figure 1 shows some example fragments. The focus sentences are displayed in bold text. The first fragment clearly describes an active collaboration between person A (Ansing) and person B (Domela), which is classified as a positive relation. Though the relation is, in this case, obviously two-sided, we only annotate the relation from A to B, since the current context is A’s biography. If the relation from B to A is of enough importance in the context of B’s life, it will most likely also be mentioned in B’s biography. By extracting all relations from all biographies in this manner, we will be able to reconstruct the entire network. From the second fragment we can conclude beyond doubt that Paris and Baart were aware of each other and moved within the same circles. However, nowhere 7 in the fragment are they directly linked to one another and no hint is given whether their connection was positive or negative. The third fragment is different in the sense that person A (Mannoury) and person B (Wijnkoop) are never mentioned in the same sentence. The focus sentence tells us of an opposition against persons B and C (Van Ravesteyn), but the fragment does not give us any information about person A’s connection to that opposition. The last sentence does place him opposite the “party leaders”, which actually refers to Wijnkoop and Van Ravesteyn, but it is impossible to infer this without exact domain knowledge. When looking at the entire biography, we see that ten sentences before this fragment it is said that “during World War I he (Mannoury) and, among others, H. Gorter formed an opposition against the two party leaders D.J. Wijnkoop and W. van Ravesteyn” (our translation). The information needed to tie all the hints together is not included in the fragment itself. Due to these intricacies in the data and our choice to present fragments of only five sentenes, it is difficult for humans to perform the task at hand in many cases. Unsurprisingly, the inter-annotator agreement for the FDN set was fairly low. The annotators agreed on the existence of a relation in 74.9% of the cases. On the negative, neutral, and positive classes they agreed at 60.8%, 24.2%, and 66.5%, respectively. All disagreements were resolved through discussion. The neutral class proved to be most difficult to resolve. A logical explanation for this is that while positive and negative judgments can both have varying intensities, neutral is a point midway between the extremes. Any diversion from this point makes it not neutral, and a point is much more difficult to capture than a range. A similar issue surfaced when looking at the disagreements for the unrelated class. The mere fact that a person is mentioned in another person’s biography is often enough to assume that they are in some way connected, even though it may not be explicitly said that they are. Therefore, we decided to group together the neutral and unrelated classes into a single class of instances with unknown polarity. This approach allows us to treat every mention of a named entity as a connection of which we only have to determine whether the polarity leans to being negative or positive. The resulting class distributions for both sets are listed in Table 1. They are roughly the same for both data sets and seem to be a fair representation of human relations. In general, people are more likely to approach someone in a positive manner, if only to avoid unnecessary conflict, which is reflected by the larger positive class. The unknown class possibly represents people we are somehow associated with, but as of yet have no real opinion of. Once we do form an opinion about them, they move either to the positive or the negative class. Since in general people tend to avoid conflict and strive for social bonds, it is plausible to assume that negative relations are lower in number and shorter in duration than positive 8 AnsingP ER−A and Domela NieuwenhuisP ER−B were in written contact with each other since August 1878. Domela NieuwenhuisP ER−B probably wrote uplifting words in his letter to AnsingP ER−A , which was not preserved, after reading Pekelharing’sP ER−C report of the program convention of the ANWV in Vragen des Tijds, which was all but flattering for AnsingP ER−A . In this letter, DomelaP ER−B also offered his services to AnsingP ER−A and his friends. Domela NieuwenhuisP ER−B used this opportunity to ask AnsingP ER−A several questions about the conditions of the workers, the same that he had already asked in a letter to the ANWV in 1877, which had been left unanswered. AnsingP ER−A answered the questions extensively. ParisP ER−A joined a group of ceramists around Servaas BaartP ER−B . In 1892, he belonged to one of the seven founders of the ceramists association Loon naar Werk, the first trade union in Maastricht. In 1897 he became secretary and soon he took over a part of the daily work of trade union chairman BaartP ER−B , who was occupied by other activities within the socialist movement. Around 1900 ParisP ER−A succeeded BaartP ER−B , who had become administrator of the cooperation Het Volksbelang, as chairman of Loon naar Werk. ParisP ER−A was among the small group of socialists who revived the ailing SDAP department in Maastricht at the end of last century. When the office found out that the police was informed of the meeting, they continued the discussions at his home. In 1925 MannouryP ER−A played an important role in the crisis within the CPN. For some time there had been an opposition within the party against WijnkoopP ER−B and Van RavesteynP ER−C . The contrasts became clearer during preparations for the parliamentary elections in July 1925. When the party leaders wanted to defer from a Comintern resolution regarding the candidacy at the party congress in May, MannouryP ER−A saw it as a breach of discipline. Figure 1: English translations of some example fragments. 9 Class negative positive unknown total Generic No. % 86 16.1 238 44.6 210 39.3 534 100 FDN No. % 74 16.7 212 48.0 156 35.3 442 100 Table 1: Class distribution ones, which would explain the small size of the negative class. 5. Relation Extraction and Classification 5.1. Relation Extraction First, all biographies are lemmatized, POS-tagged and parsed using Frog, a morpho-syntactic analyzer for Dutch [25]. Next, we use a custom made NER module to locate all person names in the data. The NER module consists of a classifier-based sequence processing tool trained on contemporary newspaper data and a subset of 70 biographies of the BWSA, all fully annotated with named entity categories Person, Organization, Location and Miscellaneous. In testing, the module only reached a score of 48.3% on the Person category. Therefore a considerable amount of noise exists in the NER output. To identify the person to which a named entity refers, the name is split into chunks representing first name, initials, infix and surname. These chunks, as far as they are included in the string, are then matched against the BWSA database. If no match is found, the name is added to the database as a new person. For now, however, we treat the network as a closed community in testing, by only extracting those fragments in which person B is one that already has a biography in the BWSA. At a later stage, biographies of people from outside the BWSA can be gathered and used to determine their position within the network. This step filters out most of the NER errors. 5.2. Relation Classification We define several sets of features to train the classifiers. They can be divided into two categories: lexical features that are based on the lemmata of the fragments, and co-occurrence features that give an initial indication of the relatedness between person A and B. Almost all of the feature sets are based on the text itself, in order to keep the dependence on external information sources to a minimum. The feature sets and classification process are described in detail below. 10 5.2.1. Lexical Features We define five sets of lexical features. Preliminary investigations revealed that keeping only verbs and nouns works best for the classification of the relations [24]. Person names were found to be of less significance, but their inclusion does not hurt the classification either, so we decided to include them as well for the formation of dependency triples. All person names that are mentioned in each fragment are anonymized by replacing them with ’PER-X’ where X is a letter from A to Z. The main person of the biography is always denoted with ’PER-A’ and the person we want to relate that person to is denoted with ’PER-B’. The maximum number of persons mentioned in a biography is always below 26. • lemmata - The first set consists of only the verbs and nouns in their canonical form plus the anonymized person names. This set is the simplest in its form and will serve as a second baseline next to the majority class baseline. There are 5184 lemmata in this set. The dependency parse allows us to use more structured features. We create two separate sets from the parse. We expect that the triples will reveal patterns that give more structured information to the classifier about how the person mentions are related to one another within the fragment and thus help the classification. • triples - All dependency triples consisting of head-relation-dependent that contain a verb, a noun or an anonymized person name as either the head or the dependent. This set has a total of 16,000 features. • tuples - For this set, we split the dependency triples into two tuples: headrelation and relation-dependent. It has a total of 12,011 features. Since our data sets are of a relatively small size, we expect that the lemmabased dependency features will not cover the entire data and therefore that they will not perform well in the classification. In an attempt to make the data more general, we use Brouwers thesaurus for Dutch (1989), which is similar in structure to Roget’s Thesaurus for English. The same thesaurus is used by Van Atteveldt et al. [23] for word sense disambiguation in their classification of relations in Dutch political texts. The thesaurus contains 10 primary classes, which are further divided into 41 subdivisions and 997 sections. A single lemma can fall into different categories on each level, depending on its sense. For each of the dependency triples of the sets described above we look up its main category. We create two different feature sets in this manner. They combine the structure of the dependency parse with a more general representation of the lemmata in the text and are the smallest in size. We expect these feature sets to perform best. 11 • thesaurus - A selection of the dependency triples from the triples set described above of which both the head and the dependent are found in the thesaurus. The head and dependent are replaced with their associated main class in the thesaurus. If for either there exists more than one entry with a different main class, multiple triples are created until all combinations of senses of both words are included. This set contains 1,896 features. • thesaurus + WSD - This set is the same as thesaurus, except that only the first, most common sense of both head and dependent are included. This can be seen as a crude form of word sense disambiguation. The set contains a total of 913 features. 5.2.2. Co-occurrence Features Previous research has shown that co-occurrence features do not significantly enhance the classification when combined with lemmata [24]. However, combined with the more informative dependency triples they might help to better distinguish between positive and negative relations. We define four different measures of cooccurrence. For all of the measures the counts are divided over the sum of all occurrences of either person A or B. • fragment - The number of sentences within the fragment that contain a mention of both person A and person B. • document - The number of sentences in the current biography in which both person A and B are mentioned. • BWSA - The number of biographies in which both person A and B are mentioned. • Yahoo! - The number of hits for a query containing the full names of persons A and B on the search engine Yahoo!. 5.2.3. Classification We take a supervised machine learning approach to classifying the relations. To perform the experiments, we make use of Weka, a Java-based machine learning workbench [11]. We test which of the following classification algorithms performs best on the task: Naive Bayes; JRip, a Java implementation of Cohen’s Ripper algorithm [6], with both the number of optimizations and seeds set to 5; LibSVM [3] with a linear kernel and a cost factor of 0.5; and the k-nearest neighbor algorithm with k set to 1 [7]. Support vector machines and k-nearest neighbor algorithms are especially suited for text classification, therefore we expect them to do better than 12 Naive Bayes or the rule based JRip algorithm. However, we decided to include these methods to investigate the degree of robustness of the data. All features are binary except for the co-occurrence, which are real numbers. For each algorithm, we implement two different setups: • pipe - A cascading pipeline which first classifies and filters out the positive instances, then classifies the negative. The instances that have received no label at the end of the pipeline are classified as unknown. • joint - A joint learning setup where all three classes are classified simultaneously. 6. Results 6.1. Cross Validation A 90–10 split is made of both data sets, creating two large training sets, and two smaller sets which we set aside for testing. As a first experiment and to set a baseline with the simplest type of features, we compare performance for all five classifiers on the lemma features. The results are displayed in the top row of Tables 2 and 3. All reported scores are macro averaged F-measures. For both data sets all classifiers score above the majority class baseline on the lemma feature set. LibSVM gives the highest score on the Generic set, while JRip gives the highest score on the FDN set, both in a pipeline setup. When we look at the F-measures for the different classes we see that LibSVM only reaches a score of 2.2% on the negative class for the Generic set. JRip on the FDN set attains 10.3% on this same class. JRip’s better performance could be due to the fact that the FDN set is more homogenous when it comes to the people and issues it describes, since the set is focused around a single entity, thus making it easier to separate the classes using simple rules. Next, we consecutively replace the lemmata with the triples and the tuples, to test whether adding minimal grammatical structure improves the classification. These results are displayed in the middle rows of Tables 2 and 3. In most cases, the tuples set outperforms the triples set. This is most likely due to its smaller size. Since there are less features, there is bound to be more overlap between instances, making the classes more easily separable. The tuples set, however, is more than twice the size of the lemmata set. This has a detrimental effect on performance and thus the scores rarely exceed the lemmata baseline, which is in line with our expectations. The JRip classifier, however, is not affected by this explosion of features due to its strong built-in feature selection capabilities. It attains its highest scores using the tuples set for both data sets in both setups. 13 lemmata triples tuples thesaurus th. + WSD Naive Bayes pipe joint 31.2 28.0 22.1 23.0 25.7 25.3 29.9 33.2 30.9 34.2 JRip pipe joint 27.7 27.7 25.7 23.9 32.5 31.4 31.5 28.9 29.0 28.0 LibSVM pipe joint 34.2 32.4 31.7 32.2 33.5 30.5 35.7 31.6 33.9 35.4 1-NN pipe joint 31.6 32.9 24.8 26.4 28.7 32.1 33.9 30.0 32.8 31.9 Table 2: Macro averaged F-measures for 10-fold cross-validation on 90% of the Generic set; majority class baseline 20.6%. The highest score for each classifier and setup over all feature sets is marked in bold font. lemmata triples tuples thesaurus th. + WSD Naive Bayes pipe joint 31.6 22.6 26.0 21.8 29.4 23.3 29.3 31.5 30.3 28.1 JRip pipe joint 33.5 30.3 28.1 28.2 33.5 32.2 31.1 31.4 30.7 27.3 LibSVM pipe joint 32.7 32.9 31.8 31.6 29.6 29.1 33.7 32.1 32.6 33.3 1-NN pipe joint 33.1 32.7 34.0 29.3 33.2 31.6 31.7 34.0 33.0 31.4 Table 3: Macro averaged F-measures for 10-fold cross-validation on 90% of the FDN set; majority class baseline 21.6%. The highest score for each classifier and setup over all feature sets is marked in bold font. The results on the thesaurus feature sets are displayed in the bottom rows of Tables 2 and 3. As hypothesized, these sets produce the best performance with most classifiers. The crude word sense disambiguation of the second thesaurus set does not consistently improve the overall classification scores. When looking at the individual classes’ scores, we see that it mostly has a negative effect on the negative class. Overall, LibSVM attains the highest scores for the Generic set with the ambiguous thesaurus set in a pipeline setting. The F-scores for the positive, negative and unknown classes are, respectively, 48.2%, 13.7%, and 45.3%. The joint learning setup for the same data set and algorithm achieves the second best score with the disambiguated thesaurus set, with scores of 48.6%, 14.8%, and 42.8% for the respective classes. When we look at the precision and recall scores we see that in the joint learning setup precision slightly decreases for the positive and negative classes, in favor of recall. For the FDN data set, 1-NN scores the best. It achieves the same F-score in the pipeline setup with the triples feature set, as it does in the joint setup with the undisambiguated thesaurus features. The F-scores for the positive, negative and unknown classes in the triples pipeline are 60.6%, 10.7%, and 30.7%, respectively. 14 Generic LibSVM thesaurus + WSD 35.4 35.4 + fragment + document 34.1 36.7 + BWSA + Yahoo! 34.9 + all co-occ. 35.0 1-NN 31.9 31.4 31.4 32.9 32.8 33.1 FDN LibSVM thesaurus 32.1 + fragment 33.9 + document 33.6 + BWSA 32.1 + Yahoo! 33.0 + all co-occ. 35.3 1-NN 34.0 32.8 33.1 34.0 34.0 32.1 Table 4: Macro averaged F-measures for 10-fold cross-validation on 90% of the Generic and FDN sets. The highest score for each data set, classifier and setup over all feature sets is marked in bold font. For the thesaurus joint learning classifier they are 52.5%, 20.3%, and 29.2%, respectively. In the joint learning setup, again, we see a decrease in precision, this time for the negative and unknown classes. For the negative class it drops from 44.4% to 17.4%, while recall increases from 6.1% to 24.2%. For the positive class, precision slightly increases, while recall drops from 80.0% to 56.3%. This indicates that in the pipeline setup the first classifier tends to give priority to the positive class in its classification. As a consequence few instances get passed on to the second classifier, which leads to a high precision but low recall on the small negative class. The task of the system, ultimately, is to judge whether connections drawn from a neutral or non-polarized network are either more positive, or more negative. Thus, it is more important to correctly classify an instance than to find all connections belonging to each class. We therefore favor precision over recall, especially with regards to the large positive class, and choose the joint learning setup over the pipeline. As a final experiment we pair the best performing feature set for each data set with all or one of the co-occurrence features in a joint learning setup. For the Generic data set, the best performing feature set is the disambiguated thesaurus set; for the FDN data set it is the ambiguous thesaurus set. The results are displayed in Table 4. LibSVM is the best scoring algorithm for both data sets. There seems to be no consistency as to which co-occurrence feature works best. For the Generic set, the BWSA co-occurrence measure produces the highest score; for the FDN set all co-occurrence measures together perform best. 6.2. Evaluation on Held Out Data Overall, the performance of the Generic set is consistently better than performance on the FDN set in cross validation. To assess their performance on unseen 15 Generic (thesaurus + WSD + BWSA) Precision Recall F-measure positive 48.5 66.7 56.1 negative 0.0 0.0 0.0 33.3 23.8 27.8 unknown macro avg. 27.3 30.2 28.0 FDN (thesaurus + all co-occ.) Precision Recall F-measure positive 47.4 40.9 43.9 negative 28.6 25.0 26.7 unknown 20.0 25.0 22.2 macro avg. 32.0 30.3 30.9 Table 5: Left: Results when training on 90% and testing on 10% of the Generic set; majority class baseline 20.5%. Right: Results when training on 90% and testing on 10% of the FDN set; majority class baseline 21.6%. FDN, tested on Generic (thesaurus + all co-occ.) Precision Recall F-measure positive 45.8 45.8 45.8 0.0 0.0 0.0 negative unknown 35.7 47.6 40.8 macro avg. 27.2 31.1 28.9 Generic, tested on FDN (thesaurus + WSD + BWSA) Precision Recall F-measure positive 57.9 50.0 53.7 negative 16.7 12.5 14.3 unknown 33.3 43.8 37.8 macro avg. 36.0 35.4 35.3 Table 6: Left: Results when training on 90% of the FDN set and testing on 10% of the Generic set; majority class baseline 20.5%. Right: Results when training on 90% of the Generic set and testing on 10% of the FDN set; majority class baseline 21.6%. data, we test the best LibSVM joint learning classifier for each data set on a held out test set. The results are listed in Table 5. Unfortunately the classifier trained on the Generic set is unable to correctly classify any instances as negative. The accuracy on this class is 72.2%, which means that the classifier does not completely resort to majority class voting and does select some instances as negative (though not the correct ones). The FDN classifier performs better on the negative class, but consequently has a lower recall on the positive class, leading to a comparably low average F-measure. To test whether classifiers trained on the two data sets generalize well to their mutual test sets, we test the Generic classifier on the held out test set of the FDN data set, and vice versa. The results are listed in Table 6. Again, when testing on the held out Generic set, we see that no instances are ever classified as negative. The macro averages of precision, recall and F-measure are roughly the same as when we train and test on the Generic set. The Generic classifier performs better when tested on the FDN set. In fact, it outperforms the FDN classifier for both the positive and the unknown class, leading to F-measures that are 10 to 15 points higher. Its performance on the negative class 16 70 60 Mean Average Precision positive negative Frequency 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 10 11 12 No. of Instances 100 90 80 70 60 50 40 30 20 10 0 positive negative 0 1 2 3 4 5 6 7 8 9 10 11 12 No. of Instances Figure 2: Left: Frequency distribution of the number of instances per connection. Right: Mean average precision of the number of instances per connection. is worse than when trained on FDN, but higher than when trained and tested on Generic. We conclude from this that the task of classifying generic relationships from a large pool of connections between different people is more difficult than classifying relationships that only pertain to one specific entity, such as Domela. 7. Expert Evaluation In order to evaluate the system’s performance in a real world setting we train a classifier on the full Generic data set with the best selected features and use this to classify all remaining person mentions in the BWSA. We extract top-20 lists of both negative and positive relations for four main characters in the data: Domela, Eduard Douwes Dekker (1820–1887), Franc van der Goes (1858–1939), and Henriette van der Schalk (1869–1952). These persons are chosen based on the above-average length of their biography, which we take as an indication of their importance throughout the data, to make sure enough connections are found to create the top-20 lists. The lists are ranked by frequency, placing the connection for which most instances were found at the top. In total, the ranked lists are made up of 329 instances: 119 negative instances, with an average of 1.5 (± 0.1) instances per connection, and 210 positive instances, with an average of 2.6 (± 0.9) instances per connection. The left graph in Figure 2 shows the frequency distribution of the number of instances returned per relation for both the positive and negative classes over all ranked lists. The maximum number of instances returned for the negative class is 6, for the positive class the maximum is 11. Besides this, the most notable difference we see is that the connections 17 Figure 3: Validated network of positive relations of Ferdinand Domela Nieuwenhuis. 18 for which only one instance is returned constitute over 65% of the negative results, but only around 25% of the positive results. We asked an expert in the field of Dutch social history to judge these ranked lists of friends and foes. The task for the negative relations was to indicate whether the person mentioned ever had a (major or minor) conflict with the main person. For the positive relations the task was to indicate whether the person mentioned ever had a (long or short) positive encounter with the main person. The intensity of the relation is not factored in. Since it is unrealistic to expect one person to know everything there is to know about the domain in such detail, instead of only ’yes’ or ’no’, the expert could also choose ’unknown’ in case of doubt or unfamiliarity with the person(s) in question. Of the positive rankings, the expert judged 63.8% (± 18.0) to be correct and 7.5% (± 9.6) to be incorrect. 28.8% (± 8.5) of the connections were marked as unknown. For the negative connections, these numbers are: 21.3% (± 9.5) correct, 37.5% (± 11.9) incorrect and 41.3% (± 10.3) unknown. The unknown relations are counted as irrelevant results when calculating precision over the ranked lists. The mean average precision on the positive class is 71.3%. For the negative class the mean average precision is 36.8%. The right graph in Figure 2 shows, per class, how the mean average precision changes when the results are cut off at a certain number of instances. For example, if only positive relations are returned for which the number of instances is at least 3, the mean average precision is 87.6%, which is the highest score for this class. Overall, the classifier performs extremely well on classifying positive relations. Figure 3 shows the extracted network of people that are positively related to Ferdinand Domela Nieuwenhuis. Correct connections are signified by a green node, incorrect connections by a red node. The ranked positive list for Domela actually contained the least amount of instances and the most incorrect and unknown relations of all four lists. This could be due to the low performance of the named entity recognition module: it recognizes only about half of the person names, so many mentions of Domela or connections to him might not have been included. Also, there are several people in the data with the same last name, Nieuwenhuis. It is not unlikely that the named entity disambiguation module might have attributed some entity mentions to the wrong person, resulting in misplaced or missing connections in the network. Clearly, the classification of negative relations is very unreliable. The negative class reaches its peak at a cut-off of 2 instances, with a score of only 44.1%, while keeping all found negative instances still results in a score of 43.0%. Since the 1-instance relations constitute more than half of the returned negative connections, filtering them would greatly reduce the number of relevant results. However, including them will increase the amount of noise in equal measure, making the 19 results unusable in a user-oriented context. 8. Discussion and Future Research We have presented a system for the classification of personal relationships in biographical texts, which can be visualized as personal social networks of historical figures. We showed that our classifiers are able to label these relations above a majority class baseline score. We find that a training set containing relations surrounding multiple persons produces more desirable results than a set that focuses on one specific entity. We have tested whether adding minimal syntactic structure to text-based features improves the classification and have found this to be the case when they are reduced to a more generalized representation using thesaurus categories. When selecting a classifier, it is debatable whether to choose precision over recall, or vice versa. Giving precedence to recall will result in more noise in the output. On a small scale, as with the current data, this can be partially filtered out by setting a threshold for the probability or frequency needed to introduce a new connection into the network. When the amount of data increases, however, the amount of noise does as well. With regards to the overall goal of the encompassing project – the construction of a browsable knowledge base – this is a very important issue to consider. Stating only proven facts will limit serendipity, but making too many guesses will harm the system’s credibility. To make sure that only information with a certain degree of certainty is presented to the user, we give priority to precision. Support vector machines and 1-NN algorithms are known to do well on textclassification tasks. Indeed these algorithms achieve the highest scores. Ultimately, we chose LibSVM in a joint learning setup as the best classifier. The classifier proves to be better at classifying positive relations than negative relations. This is no surprise as the negative class makes up only 16% of the training data. Adding training data would likely boost performance, though the annotation of these relationships in this type of unstructured data has proven to be a difficult task. Alternative to adding more information of the same type, making better use of metadata and other sources of information, such as publication records or personal letters, might be a valid avenue to improve results. Improvement of the named entity recognition and disambiguation modules will help to extract more and more accurate relations. Including other types of entities, such as organizations and locations in the analysis of the text fragments could be another way to add information. In future work we intend to include organizational entities in the social networks as they go hand in hand with the people that formed them. 20 Acknowledgments The HiTiME project is funded by the Netherlands Organisation for Scientific Research (NWO) as part of the Continuous Access To Cultural Heritage (CATCH) programme. We would like to thank the International Institute of Social History for making their data available for this research, and Marien van der Heijden, for supporting us with his expert knowledge and feedback. References [1] Anand, P., Walker, M., Abbott, R., Fox Tree, J. E., Bowmani, R., Minor, M., June 2011. Cats rule and dogs drool!: Classifying stance in online debate. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Association for Computational Linguistics, Portland, Oregon, pp. 1–9. [2] Balahur, A., Hermida, J. M., Montoyo, A., June 2011. Detecting implicit expressions of sentiment in text based on commonsense knowledge. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Association for Computational Linguistics, Portland, Oregon, pp. 53–60. [3] Chang, C.-C., Lin, C.-J., 2001. LIBSVM: a library for support vector machines. Software available at: http://www.csie.ntu.edu.tw/ cjlin/libsvm. [4] Chen, C. C., Tseng, Y.-D., 2011. Quality evaluation of product reviews using an information quality framework. Decision Support Systems 50, 755–768. [5] Clarke, D., Lane, P., Hender, P., June 2011. Developing robust models for favourability analysis. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Association for Computational Linguistics, Portland, Oregon, pp. 44–52. [6] Cohen, W. W., 1995. Fast effective rule induction. In: In Proceedings of the Twelfth International Conference on Machine Learning. pp. 115–123. [7] Cover, T., Hart, P., 1967. Nearest-neighbour classification. IEEE Transactions on Information Theory 13. 21 [8] Culotta, A., McCallum, A., Betz, J., 2006. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL), 296–303. [9] Elson, D. K., Dames, N., McKeown, K. R., 2010. Extracting social networks from literary fiction. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), 138–147. [10] Ganley, D., Lampe, C., 2009. The ties that bind: Social network principles in online communities. Decision Support Systems 47, 266–274. [11] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H., 2009. The weka data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18. [12] Kautz, H., Selman, B., Shah, M., 1997. The hidden web. AI Magazine 18, 27–36. [13] Kim, S.-M., Hovy, E., 2006. Automatic identification of pro and con reasons in online reviews. Proceedings of the COLING/ACL Main Conference Poster Sessions, 483–490. [14] Kiss, C., Bichler, M., 2008. Identification of influencers–measuring influence in customer networks. Decision Support Systems 46, 233–253. [15] Li, F., Du, T. C., 2011. Who is talking? an ontology-based opinion leader identification framework for word-of-mouth marketing in online social blogs. Decision Support Systems 51, 190–197. [16] Matsuo, Y., Tomobe, H., Hasida, K., Ishizuka, M., 2004. Finding social network for trust calculation. European Conference on Artificial Intelligence (ECAI). [17] Mika, P., 2005. Flink: Semantic web technology for the extraction and analysis of social networks. Web Semantics: Science, Services and Agents on the World Wide Web 3 (2–3), 211–223. [18] Pang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2 (1–2), 1–135. [19] Pang, B., Lee, L., Vaithyanathan, S., 2002. Thumbs up? Sentiment classification using machine learning techniques. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 79–86. 22 [20] Ravichandran, D., Hovy, E., 2002. Learning surface text patterns for a question answering system. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 41–47. [21] Reyes, A., Rosso, P., June 2011. Mining subjective knowledge from customer reviews: A specific case of irony detection. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Association for Computational Linguistics, Portland, Oregon, pp. 118–124. [22] Turney, P. D., 2002. Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 417–424. [23] van Atteveldt, W., Kleinnijenhuis, J., Ruigrok, N., Schlobach, S., 2008. Good news or bad news? conducting sentiment analysis on dutch text to distinguish between positive and negative relations. Journal of Information Technology Politics 5, 73–94. [24] van de Camp, M., van den Bosch, A., 2011. A link to the past: Constructing historical social networks. Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), 61–69. [25] van den Bosch, A., Busser, B., Canisius, S., Daelemans, W., 2007. An efficient memory-based morphosyntactic tagger and parser for Dutch. Computational Linguistics in the Netherlands: Selected Papers from the Seventeenth CLIN Meeting, 99–114. 23