Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Multi-Agent System with Reinforcement Learning Agents for Biomedical Text Mining Michael Camara Oliver Bonham-Carter Janyl Jumadinova Allegheny College Department of Computer Science 520 North Main Street Meadville, PA University of Nebraska at Omaha College of Information Science and Technology 6001 Dodge Street Omaha, NE Allegheny College Department of Computer Science 520 North Main Street Meadville, PA [email protected] [email protected] [email protected] ABSTRACT Due to the expanding growth of information in the biomedical literature and biomedical databases, researchers and practitioners in the biomedical field require efficient methods of handling and extracting useful information. We present a novel framework for biomedical text mining based on a learning multi-agent system. Our distributed system comprises of several software agents, where each agent uses a reinforcement learning method to update the sentiment of a relevant text from a particular set of research articles related to specific keywords. Our system was tested on the biomedical research articles from PubMed, where the goal of each agent is to accrue utility by correctly determining the relevant information that is communicated with other agents. Our results tested on the abstracts collected from PubMed related to muscular atrophy, Alzheimer’s disease, and diabetes show that our system is able to appropriately learn the sentiment score related to specific keywords by parallel and distributed analysis of the documents by multiple software agents. Categories and Subject Descriptors J.3 [Applied Computing]: Life and medical sciences General Terms Design, Experimentation Keywords Multi-agent, reinforcement learning, biomedical text mining 1. INTRODUCTION With the exponential growth of scientific information over the last several centuries [21], the volume of published biomedical research has been expanding at a particularly increasing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. BCB’15, September 09–12, 2015, Atlanta, GA, USA. Copyright 2015 ACM ISBN 978-1-4503-3853-0/15/09 ...$15.00. DOI: http://dx.doi.org/10.1145/2808719.2812596 rate. As of 2015, the PubMed database contains over 24 million records, and between 1986 and 2010 the total number of citations in PubMed has been growing annually at a 4% rate [11]. Due to this explosive growth, it is very challenging to keep up to date with all of the new research discoveries and methods within biomedical research. Poor communication creates a disconnection between highly specialized fields and subfields [5], and it is a challenge to share the wealth of knowledge from discovery between the research disciplines. With the current increasing rate of growth in published biomedical research, it is very likely that important connections between individual elements of biomedical knowledge, that could lead toward practical use in the forms of diagnosis, prevention and treatment, are not being created. Text mining is an automated approach to harvesting knowledge from a wide and diverse corpus of records. Text mining and information extraction have been gaining popularity as a way to aid researchers with a large volume of information [12]. Biomedical text mining allows researchers to identify needed information more efficiently and discover relationships obscured by the large volume of available information that exists in the literature. Information extraction and text data analysis can be particularly relevant and useful in biomedical research, in which current information about complex processes involving genes, proteins and phenotypes is crucial. Various data mining tools and methods have been recently established to help researchers and physicians in making more efficient use of the existing research [11, 15]. Most of the existing tools operate as a centralized data mining process and do not incorporate learned information during this process. However, it may be necessary to collect and analyze different distributed sources of voluminous data, potentially stored across different databases. Distributed data mining (DDM) enables mining data sources regardless of their source [19]. Many distributed data mining systems adopt the multi-agent system architecture [6, 8]. A multiagent system is a collection of intelligent software agents that interact with each other. An intelligent agent is an autonomous entity or software program that automatically performs tasks on behalf of the user to accomplish some goal. A multi-agent system is able to achieve an efficient, flexible, and adaptable system that allows for better communication, interaction and management among different components of a distributed data mining system [3]. In this paper, we present a multi-agent based framework with learning that processes PubMed research article abstracts in a distributed way. Our system allows for the integration of the learned information from the extracted data by the individual agents throughout the information extraction process. Software agents within our system utilize a reinforcement learning algorithm, which provides a reliable learning solution, despite the lack of existing labeled data used in our experiments. We tested our framework on the PubMed research articles related to muscular atrophy, Alzheimer’s disease, and diabetes. In our experiments, the software agents in our system were tasked with learning the sentiment information as they processed and analyzed each abstract from the assigned data set. Abstracts are short relevant texts to describe a larger body of work. Since they must be specific, the knowledge we extract from them is likely to be factual and non-ambiguous. Our experiments are based on this factual corpus. Our results demonstrate that the proposed multi-agent text mining framework is able to appropriately learn the sentiment score related to specific keywords by parallel and distributed analysis of the documents by multiple software agents. In summary, the contributions of this paper are as follows: 1. A multi-agent system for distributed text mining of biomedical information. 2. An application of the reinforcement learning algorithm by each software agent to correlate information. 3. The development of a novel scientific workflow involving data extraction and processing of the data. 4. An experimental study which proves the concept of our idea: an automated text mining framework of data analysis, driven by a multi-agent system. 2. RELATED WORK Text mining is becoming a more prominent research tool to cope with the increasing availability of large data. Previous text mining tools have concentrated on extracting information for human consumption (text summarization, document retrieval), assessing document similarity (document clustering, key-phrase identification), and extracting structured information (entity extraction, information extraction) [20, 21]. Important applications for text mining during the recent years have been in biomedical research due to its data which is now more readily available. Research in text mining for biomedical use has concentrated on several different aspects. Some research studies concerned with named entity recognition that can identify, within a collection of text, all of the instances of a specific name [5, 23]. Other researchers developed techniques for text classification that attempt to automatically determine whether a document, or part of a document, has particular characteristics of interest [10]. Other types of tools were created for relationship extraction that detect occurrences of a specified type of relationship between a pair of entities of given types [22]. There are a number of articles that survey the existing search tools for biomedical literature to elucidate the differences between search tools. For example, Kim et al. [9] examine the underlying steps and categorization of several biomedical literature search tools, and give an overview of the input and output formats used in these tools. The authors then classify the tools by their behavior patterns, namely: starting, chaining, browsing, differentiating, monitoring and extracting. They conclude that although different tools perform better in these different facets, there is a need to develop better solutions for differentiating documents. In another review article, Lu [11] examines 28 unique web tools for searching the biomedical literature, similar to the National Center for Biotechnology Information’s (NCBI) PubMed web service. All of the reviewed tools search through a large collection of biomedical literature based on a user specified query, returning selected documents that, ideally, closely match the user’s request. Each of the selected tools was placed into one of four categories based on their most notable features: ranking search results, clustering results into topics, extracting and displaying semantics and relations, or improving the search interface and retrieval experience. After analyzing each in turn, Lu highlights some key features; for instance, the usefulness of document clustering to group together similar documents using biomedical controlled vocabularies/ontologies as representative topic terms. Recently, some researchers used a multi-agent system for text mining in order to create a distributed data mining system [3, 8]. Di Fatta and Fortino [7] propose a multi-agent system framework for distributed data mining based on a peer-to-peer model. They highlight the importance of such distributed systems due to the overwhelming increase in the amount of data and high performance costs. The authors use their model for data collection on molecular structures, and based on their results they claim that the framework is further suitable for large-scale, heterogeneous computational environments such as grids. Chaimontree et al. [4] describe a generic multi-agent data mining (MADM) framework that is able to handle large amounts of data, provides adequate security, and yields the best clusters possible through multiagent interactions. Their framework uses JADE (Java Agent Development Environment) along with five different types of agents, each with their own distinct purpose. By comparing the F-Measure of several published clustering algorithms for given data sets, the best clustering algorithm is identified. In [18], Tong et al. describe a data mining multiagent system (DMMAS) to model chronic disease data, with type II diabetes as their primary study domain. They focus on improving the efficiency of data mining using parallel and distributed data mining algorithms. While Chao and Wong [6] propose a multi-agent data mining workbench to assist physicians in making objective and reliable diagnoses. Their proposed multi-agent learning paradigm, called DiaMAS, makes use of heterogeneous intelligent data mining agents, each being responsible for a specific and independent task. In this paper, we present a learning multi-agent system for mining biomedical research articles. In contrast to the previous works that utilize multi-agent system framework for data mining, we consider exploring and finding relations between the text in biomedical research articles. Another contribu- Figure 1: The Overview of our Multi-Agent System Workflow tion our paper makes is an application of the reinforcement learning by the individual agents to learn appropriate relations better through sentiment. Finally, we present a completely automated workflow, starting with data collection and storage, and finishing with the analysis. 3. TEXTMED: A LEARNING MULTI-AGENT SYSTEM 3.1 Data Collection and Preprocessing In order to collect the data necessary for our multi-agent system, we first created a framework called Lister, which automatically downloads and decompresses every article available through the PubMed database [14] into NXML format. Lister then parses all of the available articles with a given keyword using a bag of words approach, producing the abstract of each article containing that keyword at least once. Further, all forms of punctuation were removed from these abstracts, in addition to irrelevant stop words that could potentially interfere with subsequent analysis. This process can be repeated indefinitely for any number of unique keywords, producing a subset of relevant abstracts after each execution. At the end of the data collection, the data set D obtained from Lister is divided into a subset of data sets Di , so that each subset can be processed by one software agent i. Another aspect of preprocessing the data involves the creation of a list of keywords to isolate in the corpus. This list conforms to the MeSH (Medical Subject Headings) descriptors created and maintained by the U.S. National Library of Medicine, containing over 27,000 unique entries relevant specifically to the biomedical field in 2015 [13]. The software agents learn relevant sentiment as related to the secondary keyword obtained from the list of MeSH keywords and the primary keyword used by the Lister. Figure 1 shows the general workflow for our multi-agent system. All of the results obtained after the analysis by our system are stored locally in a database for quick and convenient access. 3.2 Learning Intelligent Agents Our multi-agent system, called TextMed, begins by creating separate agents for each subset of abstracts created by Lister. Figure 2 describes the workflow of one individual agent i. Since the agents in our system are homogeneous, the same process applies to all of the agents. Each agent starts by parsing its designated data set using the same list of keywords created previously. Each agent first determines whether a given keyword, or its pluralized form, appears in the abstract at least once. If the keyword does not appear in the abstract then the next keyword is tested; otherwise, the agent begins the sentiment analysis of each match using a variable that we call proximity. The proximity variable indicates the number of words to the left and to the right of a keyword match that will be included in the subsequent sentiment analysis. Sentiment analysis is performed by a program called SentiStrength [17]. This program was originally designed to perform sentiment analysis on short social web texts, such as Twitter and YouTube comments; but we have leveraged its capabilities for use with biomedical texts [16]. In its simplest form, SentiStrength accepts a string of text as input and produces both positive and negative sentiment scores as output: the higher the magnitude of the score, the higher the positive or negative correlation of the text. By combining the positive and negative scores we create a composite score in the range of -4 (strong negativity) to 4 (strong positivity), with a score of 0 indicating neutral sentiment. If multiple matches of a keyword are found in a document, then sentiment analysis is performed on each match individually and then averaged together to generate the local calculated sentiment, lsk,d of the keyword k in the document d for the proximity value tested. After agent i processes each document and calculates the local sentiment for keyword k in that document, it also updates the global sentiment value, gsk of keyword k across all documents that all agents have processed so far. Initially, when a particular keyword is found for the first time in one of the documents, the corresponding keyword object records the local calculated sentiment as the new global sentiment for the given proximity value (gsk = lsk,d ). However, if the keyword has been found by one of the agents in previous documents, then the agent uses a reinforcement learning algorithm to produce a new learned global sentiment. Reinforcement learning has been frequently used and stud- Figure 2: Individual Agent Workflow, demonstrated on agent i ied in multi-agent systems [1, 2]. This learning paradigm relies on agents choosing an action and then receiving a scalar reward based on how the state of the environment changes. In our experiments, the state of the environment correlates to the global sentiment score for each keyword. By maintaining historical data on the reward given for each action performed, we can design our agents to seek the most optimal reward possible based on both short-term and longterm goals. This type of agent learning was chosen due to the lack of existing sentiment data for keywords tested in the scope of biomedical research. Without such reference data, we needed a way to determine the validity of each calculated sentiment score; and reinforcement learning provided a reliable solution to the problem. The reinforcement learning algorithm used by each agent is presented as Algorithm 1 and is described in more detail below. Data: lsk,d ∀ k, d Result: gsk ∀ k initialization; for each document d ∈ D do for each keyword k do choose policy p that gives the best local reward; if random variable > small value then choose a random policy pr ; end else if ∃ a better policy p0 in the Q-table than p then choose policy p0 ; end choose policy p; update reward using Equation 1; update Q-table using Equation 2; end update utility using Equation 3; end Algorithm 1: Agent Reinforcement Learning Algorithm In our system, the reward, Rk for keyword k after N number of documents have been processed is calculated as the sum of the magnitude of the difference between the anticipated next state (global sentiment) and the local sentiment for each document divided by the total number of processed documents so far containing the keyword as shown in Equa- tion 1. Rk = N X |gsk − lsk,d | N (1) i=d Further, smaller numbers are favored more than large numbers since they show smaller variance with the calculated sentiment in previously parsed documents. During reinforcement learning, agent i chooses one of 6 policies to maximize the potential reward for the agent. These policies are designed to apply a simple change to the local sentiment to produce the anticipated next state, as shown in Table 1. The expected reward is calculated for each policy using the formula shown in Equation 1, and the policy yielding the smallest, most favorable reward is chosen. This reward is recorded, along with the corresponding action needed to change the local sentiment to the next state chosen by the policy. While the previous calculations attempt to maximize the short-term reward, we use a Q-learning reinforcement learning approach (a model-free reinforcement learning technique) to allow consideration for long-term reward as well. Each keyword object contains a Q-table: a two dimensional array representing historical rewards for each state-action pair, with each index initialized to the largest, least favorable reward possible. The learning algorithm iterates through all possible pairings in the Q-table given the current local sentiment, eventually finding the smallest, most favorable reward and the corresponding action needed to attain the indicated next state. The reward values obtained from both of these approaches are compared, and the smaller, more favorable number is chosen as the optimal reward, and its corresponding action is marked as the optimal action. Finally, in order to address the exploration-exploitation trade-off, we introduce an infrequent random chance for the optimal action to increase or decrease by 1, allowing potentially better states to be reached and recorded. The next state is determined by summing the local sentiment with the optimal action, and this value is used to explicitly adjust the global learned sentiment for the keyword. The Q-table for the keyword is then updated following the formula in Equation 2. Note that Table 1: Policies used in the reinforcement learning algorithm Policy # Description of the global sentiment calculation 0 gsk = lsk,d 1 If lsk,d < gsk , then gsk = lsk,d + 1. Otherwise, gsk = lsk,d − 1. 2 If lsk,d < gsk , then gsk = lsk,d · 2. Otherwise, gsk = lsk,d /2. ls 3 If gsk > 0, then gsk = gsk,d . k k 4 If lsk,d > 0, then gsk = lsgsk,d . 5 If lsk,d < gsk , then gsk = lsk,d + 2. Otherwise, gsk = lsk,d − 2. random Choose randomly to increase or decrease gsk by 1. δ ∈ (0, 1) is a learning rate and γ ∈ (0, 1) is a discount factor, these values are arbitrary set to 0.5 by default in our experiments. By using the reward-based learning process each agent incorporates feedback from its own historical experience through the local sentiment and also from the other agents’ by considering the global sentiment when calculating the reward. Q[ls][gs] = (1 − δ)Q[ls][gs]+ δ · (calculatedBestReward + γ · optimalReward) (2) Once an agent has finished parsing one of its documents, it calculates and stores the amount of utility it has earned from that document. Utility, Ui , for agent i is calculated as the average reward given among all keywords found in a particular document as shown in Equation 3, with smaller utility values indicating smaller, more optimal rewards. numKeywords Ui,d = X Rk,d (3) k=1 Match, and Learning. This allows greater flexibility for identifying correlations between different keywords across multiple documents. Table 2 describes these tables in greater detail. 4. EXPERIMENTAL RESULTS During our experimental analysis we analyzed the performance of various parameters within our system, such as the proximity parameter, number of agents, and we evaluated the results concerning the output of our reinforcement learning algorithm including the relationship between the sentiment values and the reward values. We produced heatmap graphs to display our results in addition to reward and utility graphs. A heatmap is a color-coded matrix of numerical values which have been clustered across the top and the side. The heatmaps convey three distinct parameters as represented by the x-axis, y-axis, and scale that ranges between blue and red, cooler and warmer colors, respectively. Heatmaps are useful in comparing several different large sets of data together in terms of their numeric information. Table 3: The data sets used Data Set Primary Keyword Muscular Atrophy Alzheimer’s Diabetes 4.1 Figure 3: The schema describing our database structure After all of the agents have finished parsing their assigned documents, the data collected from each keyword object is exported into an SQL database. Figure 3 shows the schema for our database. The data is divided into five distinct tables: Primary Keyword, Secondary Keyword, Document, in our experiments. Number of Abstracts 400 2700 10000 Data Sets We conducted experiments on three data sets of varying sizes for 10, 50 and 100 agents, as shown in Table 3, consisting of research article abstracts that were obtained from the PubMed database. All three data sets were obtained by using our Lister tool, which first downloaded and decompressed all of the articles available through the PubMed database into NXML files. We used the PubMed data from June 2015. Lister then collected abstracts containing one of the keywords from the downloaded articles. We arbitrarily chose and used the keywords, muscular atrophy, Alzheimer’s, and diabetes, separately to obtain abstracts containing these words, which were further converted into plain text format by Lister. We then ran our TextMed system on each of these data sets by specifying the number of agents to implement during execution. Upon completion, TextMed outputs the data into an SQL database, and also transforms the data into visual graphs for analysis. Figure 4 shows a heatmap describing each data set that we tested, with each experiment implementing 50 agents. The x-axis represents the top 50 keywords by global count, indicating which of the MeSH keywords appeared most fre- Table Name Primary Keyword Secondary Keyword Document Match Learning A: Muscular atrophy Table 2: Description of our database tables Description Contains all of the keywords used by Lister to generate the documents parsed by each agent. Contains all of the MeSH keywords used by each agent to parse its subset of documents. Further contains global information for each keyword across all documents parsed, including total count, average calculated sentiment (prior to learning), final learned sentiment, and accumulated reward. Contains all documents that have at least one secondary keyword in their body of text. Further contains local information for each keyword that occurs in a particular document, including local count and average calculated sentiment (prior to learning). Contains the matching text for every secondary keyword match in each document, delimited based on the proximity value tested. Further contains the calculated sentiment for each match (prior to learning). Contains the learning trials performed for every secondary keyword in each document in which it appears at least once. Order is maintained based on a parseOrder parameter (a ranking metric) in each keyword object, which increments by one after a learning trial is performed. Data is stored during each learning trial to show progression as the parseOrder increases. This data includes: local calculated sentiment (prior to learning), average global calculated sentiment (prior to learning), global learned sentiment (immediately after learning), and reward given per trial. B: Alzheimer’s disease C: Diabetes Figure 4: Heatmap showing the number of occurrences of the top 50 keywords with sentiment values for the three data sets used in our experiments with 50 agents. The x-axis represents the top 50 keywords, indicating which of the MeSH keywords appeared most frequently in the documents that were parsed. The y-axis represents global learned sentiment. The scale is a log normalized representation of a global count, with warmer colors indicating many keyword matches and cooler colors indicating few keyword matches. The dark blue color (the coolest color) under a specific sentiment value indicates the non-applicability of the reward, since a keyword can have only one sentimental value associated with it. quently in the documents that were parsed. The y-axis represents global learned sentiment, indicating the final sentiment value for a keyword after all documents have been parsed and all reinforcement learning has completed. To improve the clarity of our result we used a log normalized representation of a global count, with warmer colors indicating many keyword matches and cooler colors indicating fewer keyword matches. Higher counts allow for more reinforcement learning to occur, causing agents to more accurately adjust the global sentiment as more matches are found. Thus, the heatmaps in Figure 4 indicate an estimated reliability of the sentiment score for the keywords listed, with warmer color shaded cells showing more reliable scores for a given keyword and cells with cooler shades of color showing less reliable scores. Further, they show that most of the frequently occurring keywords have negative or neutral sentiment values. We posit that this is due to the scope of the documents being analyzed, all of which are related to the biomedical field. When SentiStrength calculates a sentiment score for a string of text, words that are generally considered bad or unfavorable are given high negative values. Such negative words might include pain, disease, or death, among many others. Given the propensity of these words appearing in texts concerning ailments like Alzheimer’s disease or muscular atrophy, it seems rational to expect lower sentiment scores in general. 4.2 Proximity Parameter Figure 6: Proximity comparison for Alzheimer’s disease data set for top 50 keywords used in a system with 50 agents. The x-axis represents the top 50 most frequently found keywords, the y-axis represents the proximity values tested, while the scale represents the average reward given to a keyword at a particular proximity value. The reward was negated for better visual representation, so that warmer colors represent more optimal reward. For our next set of experiments, we analyzed the proximity parameter to assess its effect on the results. Figure 5 shows the effects of proximity by comparing sentiment and reward values across three distinct proximity values. In these graphs, the x-axis represents the top 50 keywords by global count, the y-axis represents the sentiment values, while the scale represents the average reward given to a keyword. This reward value is calculated by dividing the accumulated reward of the keyword by the number of documents that keyword appears in. In this and the following heatmaps (Figures 5,6, and 10) the reward value was negated so that the warmer colors could represent better reward. All three heatmaps were generated by running the Alzheimer’s data set on TextMed with 50 agents, for each; however, Figure 5(A) uses a proximity of 5, Figure 5(B) uses a proximity of 10, and Figure 5(C) uses a proximity of 20. By comparing these three heatmaps, one can see how the sentiment scores vary considerably. Figure 5(A) has generally neutral sentiment scores with a moderate amount of -1 sentiment scores and a few -2 sentiment scores. Conversely, Figure 5(B) has generally -1 sentiment scores, with some neutral and -2 sentiment scores as well. Figure 5(C) also has generally -1 sentiment scores, but it has the highest number of -2 sentiment scores and only one keyword with a neutral sentiment score. As explained previously, these generally negative sentiment scores are likely secondary to our biomedical data source combined with the way SentiStrength assigns positive and negative scores to certain words. Further, the fluctuation of scores follows from the number of words entered into sentiment analysis, as determined by proximity. Proximities of 5, 10, and 20 allow for 10, 20, and 40 words surrounding a keyword match to be analyzed by SentiStrength. This allows for considerable variability for the sentiment score, which could be influenced by both positive and negative words in small or large magnitudes. Now, considering the reward values (colors of the heatmaps), we observed that, in general, the reward values corroborate for individual keywords across the three tested proximity values. That is, if a specific keyword had a warmer color, indicating a higher reward, in one heatmap corresponding to a specific proximity value, it will, in general, had a warmer color in a heatmap corresponding to another proximity value. Upon a closer examination, we found that the sentiment values matched in all three tested proximity heatmaps in 25% of the keywords, they matched 2 out of 3 times in 63% of the keywords, and they did not match at all (with different sentiment value being assigned for a keyword for each tested proximity value) in 12% of the keywords. Thus, we concluded that the proximity is an important parameter to consider. In order to further analyze the proximity parameter, we ran our TextMed program while testing proximity values between 1 and 25, which determines the amount of text around a keyword match to include in SentiStrength’s local sentiment calculation. All global values, such as global learned sentiment and accumulated reward, were stored individually between proximity values so that they did not influence each other. The heatmap in Figure 6 shows the relationship between the various proximity values and the average reward given to certain keywords. As with the previous heatmaps, the x-axis represents the top 50 keywords by global count. The y-axis represents the proximity values tested, while the scale represents the average reward given to a keyword at the proximity value tested. Again, the reward was negated, so that warmer colors correspond to more optimal reward. The highest concentration of warmer colors in Figure 6 occurs along the row where proximity equals 1, while the highest concentration of cooler colors appears along the row with the largest proximity value tested, 25. There further appears to be a gradual transition between proximities 1 and 25, with incremental values exhibiting lighter red hues that eventually turn blue. One can rationalize this result by first examining the way by which SentiStrength calculates local sentiment. SentiStrength uses different parts of speech in a string of text to grant positive and negative scores, which are then added together to form a composite score that TextMed records. When the proximity value was very small, then there was a smaller chance for the string of text to contain any of the parts of speech that SentiStrength was looking for when calculating its score. Therefore, the calculated sentiment will likely remain around the same value for small proximity values, and conversely there is a larger possibility for variance as the proximity becomes larger. Further, using our reinforcement learning algorithm, smaller reward values were given when the difference between the current calculated local sentiment and historical local sentiment values were smaller. Thus, when the proximity value was small A: proximity 5 B: proximity 10 C: proximity 20 Figure 5: Heatmap showing the relationship between sentiment values and the reward for the top 50 keywords for the Alzheimer’s disease data set used in our experiments with 50 agents for three different proximity values. In all of these heatmaps, the x-axis represents the top 50 most frequently found keywords, the y-axis represents the sentiment values, while the scale represents the average reward given to a keyword. The reward was negated for better visual representation, so that warmer colors represent more optimal reward. and there was little fluctuation in the local calculated sentiment, the reward given during reinforcement learning was also small, resulting in the bright red rows in this heatmap. 4.3 Learning Parameters Table 4: Individual agent’s accumulated utility in a system containing 10, 50, and 100 agents for the Alzheimer’s disease data set. Number of agents Accumulated Utility in TextMed per single agent 10 165 50 33 100 15 For our next set of experiments, we analyzed the sentiment, reward and utility values obtained by TextMed. Figure 7 shows the individual agent utility per document for a system with (A) 10, (B) 50 and (C) 100 agents with proximity value of 10. We observed that with more agents, each agent being responsible for processing and analyzing fewer documents, there was less fluctuation in the agent’s utility value. We also noted that with a larger number of agents, each agent received less accumulated utility, as displayed in Table 4. In Figure 8 we can see that the reward for a keyword gets smaller (better) with each processed document, indicating that the learning process is able to improve over time. We also observe that the accumulated reward, calculated by adding rewards received after processing each individual document, is increasing over time as expected. From Figure 9 we observed that the local sentiment (returned by SentiStrength) had high fluctuations, while the global learned sentiment obtained after the agents completed the reinforcement learning stage after processing each document, stabilized better with time. This indicates that the agents are able to learn the correct sentiment together while processing the documents. During our final set of experiments, we further analyzed the relationship between sentiment and the reward values. The heatmap in Figure 10 shows the relationship magnitude of the reward for different sentiment values assigned to various keywords. As before, the x-axis represents the top 50 keywords by global count. The y-axis represents the proximity values tested, while the scale represents the average reward given to a keyword at the proximity value tested, which was again negated, and thus warmer colors represent a more optimal reward. We observed that the keywords with more negative sentiment values tend to get better reward values. Since the reward indicates the learning progress, or how strongly the agents believe the keyword has been assigned a correct sentiment value, we assume that the agents were able to learn and stabilize their decision on the sentiment values for more negative words sooner. We also note that the final sentiment values obtained for 10, 50, and 100 agents were similar, and thus we do not produce all of the results from our experiments in this paper. In summary, we conclude that our multi-agent system with learning scales well with the number of agents and that in general, it is able to efficiently and correctly learn the sentiment values of the keywords. 5. CONCLUSION We developed a multi-agent system with reinforcement learning that is able to intelligently text mine and analyze the biomedical research articles. Software agents comprising our system collectively learn the sentiment pertaining to specific keywords as each agent processes its assigned subset of data. We conducted an experimental study to prove the concept of our system using the PubMed research articles for the keywords related to muscular atrophy, Alzheimer’s disease, and diabetes. In the future, we plan to modify some of the details of Sen- A: 10 agents B: 50 agents C: 100 agents Figure 7: Individual agent utility for each document in a system with varying number of agents with proximity value of 10 for the Alzheimer’s disease data set. A B Figure 8: (A) Reward after processing each document and (B) accumulated reward across documents for the Alzheimer’s disease data set for the amyotrophic lateral sclerosis secondary keyword in a system with 100 agents. A: local sentiment B: global sentiment Figure 9: (A) Local sentiment after processing each document and (B) global learned sentiment for the Alzheimer’s disease data set for the amyotrophic lateral sclerosis secondary keyword in a system with 100 agents. tiStrength to make it more appropriate for biomedical data, or as an alternative develop our own sentiment analysis tool aimed at biomedical data. We will also conduct more indepth experiments, with each agent being tasked with processing and analyzing a data set from a different database, potentially stored on different physical machines. Finally, we would like to conduct experiments on other biomedical data, for example involving clinical data. 6. REFERENCES [1] S. Abdallah and V. Lesser. Multiagent reinforcement learning and self-organization in a network of agents. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS ’07, pages 39:1–39:8, New York, NY, USA, 2007. ACM. [2] S. Abdallah and V. Lesser. Non-linear dynamics in multiagent reinforcement learning algorithms. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems Volume 3, AAMAS ’08, pages 1321–1324, 2008. [3] S. W. Baik, J. Bala, and J. S. Cho. Agent based distributed data mining. In Parallel and Distributed A: Muscular atrophy B: Diabetes Figure 10: Heatmap showing the relationship between sentiment values and the reward for the top 50 keywords with proximity 10 for different data sets in a system with 50 agents. [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Computing: Applications and Technologies, pages 42–45. Springer, 2005. S. Chaimontree, K. Atkinson, and F. Coenen. Multi-agent based clustering: Towards generic multi-agent data mining. 6171:115–127, 2010. J. T. Chang, H. Schütze, and R. B. Altman. GAPSCORE: finding gene and protein names one word at a time. Bioinformatics (Oxford, England), 20(2):216–225, Jan. 2004. S. Chao and F. Wong. A multi-agent learning paradigm for medical data mining diagnostic workbench. Data Mining and Multi-agent Integration, 1:177, 2009. G. Di Fatta and G. Fortino. A customizable multi-agent system for distributed data mining. In Proceedings of the 2007 ACM Symposium on Applied Computing, SAC ’07, pages 42–47, 2007. C. Giannella, R. Bhargava, and H. Kargupta. Multi-agent systems and distributed data mining. pages 1–15, 2004. J.-j. Kim and D. Rebholz-Schuhmann. Categorization of services for seeking information in biomedical literature: a typology for improvement of practice. Briefings in Bioinformatics, 9(6):452–465, 2008. F. Liu, T.-K. Jenssen, V. Nygaard, J. Sack, and E. Hovig. FigSearch: a figure legend indexing and classification system. Bioinformatics, 20(16):2880–2882, Nov. 2004. Z. Lu. Pubmed and beyond: a survey of web tools for searching biomedical literature. Database, 2011, 2011. C. Nédellec, R. Bossy, J.-D. Kim, J.-J. Kim, T. Ohta, S. Pyysalo, and P. Zweigenbaum. Overview of bionlp shared task 2013. In Proceedings of the BioNLP Shared Task 2013 Workshop, pages 1–7, 2013. U. N. L. of Medicine. Mesh keywords http://www.nlm.nih.gov/mesh/, June 2015. [14] N. PubMed. http://www.ncbi.nlm.nih.gov/pubmed/, June 2015. [15] T. C. Rindflesch, H. Kilicoglu, M. Fiszman, G. Rosemblat, D. Shin, H. Kilicoglu, M. Fiszman, G. Rosemblat, and D. Shin. Semantic medline: An advanced information management application for biomedicine. Information Services & Use, 31(1-2):15–21, 2011. [16] M. Thelwall. Heart and soul: Sentiment strength detection in the social web with sentistrength. Proceedings of the CyberEmotions, pages 1–14, 2013. [17] M. Thelwall. Sentistrength sentistrength.wlv.ac.uk/, June 2015. [18] C. Tong, D. Sharma, and F. Shadabi. A multi-agents approach to knowledge discovery. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Volume 03, pages 571–574, 2008. [19] G. Tsoumakas and I. Vlahavas. Distributed data mining. Encyclopedia of Data Warehousing and Mining, 2009. [20] I. H. Witten. Text mining. Practical handbook of Internet computing, pages 14–1, 2005. [21] A. S. Yeh, L. Hirschman, and A. A. Morgan. Evaluation of text data mining for database curation: lessons learned from the kdd challenge cup. CoRR, cs.CL/0308032, 2003. [22] H. Yu and E. Agichtein. Extracting synonymous gene and protein terms from biological literature. Bioinformatics, 19(suppl 1):i340–i349, 2003. [23] G. Zhou, J. Zhang, J. Su, D. Shen, and C. Tan. Recognizing names in biomedical texts: a machine learning approach. Bioinformatics, 20(7):1178–1190, 2004.