Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Jillian Green 1 Volca-Net: A Collaborative Learner Network System Abstract Mount Erebus, located in Antarctica, is a perfect volcano for study because of its exposed lava lake constantly swirling with fresh magma. A network of sensors has been set up to record events at the volcano, but because of its environment, the sensors often record not only volcanic happenings, but also icequakes and other events occurring at the South Pole. The effort required for human classification of the data has led scientists to turn to software for automatic event classification. Previous work, however, has only analyzed the events at a central offsite data archive, after the events have occurred. Our program has the potential for correct event analysis on-the-spot, meaning that each sensor’s data collection rates can be adjusted as the event is occurring. Our project consists of a network of these sensors, each running machine learning clustering algorithms, collaborating with each other for information regarding their data. Our goal was to evaluate the usefulness of collaboration among sensors and determine whether by querying neighbors each sensor can reach peak accuracy faster. Introduction The study of volcanoes has proven to be quite valuable, not only leading scientists to fascinating discoveries about Earth’s formation and current state, but also offering significant information regarding the nature and hazards of volcanoes here on Earth as well as on other planets. Mount Erebus in Antarctica is an excellent volcano for study because of its harsh environmental surroundings, consistent activity, and exposed lava lake (Aster, et al, 2004). Scientists have been drawn to Erebus for its likeness to other volcanoes, such as Vesuvius, Nyiragongo, and Erta’Ale, as well as its lava lake that constantly bubbles with fresh magma. Because of this, a network of seismometers, broadband sensors, infrasonic microphones, infrared sensors, tiltmeters, and video are currently monitoring Mount Erebus and its lake at all times (Aster, et al, 2004). One of the questions that emerge from this rich collection of data is exactly what we can determine from it. One area that would prove extremely beneficial is the ability to classify different events that are taking place at the volcano based on their characteristics. Understanding what features distinguish an explosive eruption from an ashvent eruption, for example, would prove highly useful. This would increase our ability to predict such events at any given time, and identify changes that take place prior to these events occurring. If we could determine the type of event taking place as it is happening or even before it occurs, we might not only adjust the sensors’ responses accordingly, but also save the many hours it takes for humans to classify these events manually. Machine learning algorithms are ideal for such event classification as mentioned above, because they work to analyze data in order to construct automated classifiers or identify trends and patterns in Jillian Green 2 data. In general, machine learning systems are trained on a data set to either learn an appropriate model from which to extract information, or learn to distinguish between certain relevant characteristics from the data for classification. After training, the system is usually tested on a testing set to evaluate accuracy and performance before being used on new data. Only one machine learning algorithm, neural networks, has been applied to volcanic data (Langer et al, 2003 and Scarpetta, et al, 2005). These experiments developed for classifying volcanic happenings have always taken place post-event, on a collection of data previously gathered. Part of what makes our project so valuable is that it offers event classification while the event is taking place, and possibly even before if there is enough pre-activity. This not only saves the time and resources it takes to manually classify data, it also supplies extra knowledge of the event early enough to adjust data collection techniques and aid with other event-related experiments. The ability to correctly classify events taking place at Erebus before or during their onslaught would open up a whole new realm of experimental capabilities for the scientists currently studying its behavior. Our project aims to correctly identify volcanic happenings through collaborative clustering amongst the network of sensors at Erebus. The sensors will query each other for data relationships as they are recording data, receiving information about the event that is currently taking place. We developed methods for the learners to choose two of these events, query its neighbors for a relationship between them, and create pairwise constraints for these events based on the responses received. The motivation behind this collaboration is that perhaps with more information provided from neighboring sensor locations, the sensors may more accurately analyze their own data, and thus reach correct conclusions faster than when working in isolation. Overall, improvement is made with the incorporation of collaboration into a network of clusterers. The rate at which the network advances from collaborating depends largely on the correctness of the constraints adopted, which is heavily dependent on the accuracy of neighboring clusterers and the points actively selected for query. Collaboration has proven to be a new, unique, and helpful tool for machine learning algorithms, and further improvement within a network of clusterers collaborating is probable with the addition of variations on this method. Theoretical Background Java Java is a programming language developed by Sun Microsystems that derives much of its syntax from C and C++. It is most useful for its platform portability, which makes it ideal for networking, as well as other projects that extend to multiple platforms. Most Java technologies are free software under the GNU General Public License, making it easy to obtain and sustain. We chose to use Java to construct a simulation of the sensor network currently in place at Mount Erebus because of its cross-platform compatibility, and also Weka’s machine learning Java implementations. Jillian Green 3 Weka Weka is an open source Java software library developed by the University of Waikato for machine learning and data mining (Witten and Frank, 2005). It has algorithms that can be applied standalone on an inputted dataset, or by being called from your own written Java code. Weka has methods incorporated into it that support clustering, visualization, pre-processing, classification, and other machine learning methods. Weka was helpful for us as a tool for representing the volcano sensor data as a set of Instances with the ability to manipulate certain features. Its graphical user interface was also useful for determining the most pertinent data features to extract, as being able to observe the feature space helps make certain data relationships apparent. The Data Weka machine learning algorithms typically take in data organized into an .arff file (attributerelation file format) that contains any number of attributes about the data and, if it is training data, a class attribute. The class attribute is the true class of the data item. In the case of supervised machine learning algorithms, the class attribute assists in training, but in unsupervised algorithms (such as clustering) the class attribute is unknown to the algorithm and is used only for performance evaluation. We developed and tested our project using benchmark UCI data sets (Asuncion and Newman, 2007), to get an idea of performance before working on the actual volcanic data. We were able to obtain some data from the sensor nodes at Erebus by working with The New Mexico Institute of Mining and Technology. They offered us a set of data collected over the course of two months at the volcano, containing data recorded that relates to 36 eruptions and 6 icequakes, already painstakingly classified manually by their researchers. We used this labeled data set to on our network of machine learner clusterers using measures of agreement between data partitions, which will be explained in further detail later. We analyzed the data from the volcano in order to extract the most relevant features and use them for clustering. Figure 1 shows an explosive eruption from January 1st, 2006, as recorded by six different stations. One can observe how each sensor node recorded the event from its own perspective, as each graph differs to some degree even though they were all documenting the same event. These differences come from the sensor’s location relative to the event, and also its method of data collection. Based on these perspectives, we tried to localize where the events were happening, and also how an event’s features and impact differed at different areas of the volcano. Jillian Green 4 Figure 1: Seismic data of a January 1st, 2006 explosive eruption as recorded at the six different stations CON, EIS, LEH, NKB, RAY, and HOO. The x-axis is time, recorded in seconds. Each sensor records the event differently, depending on the event’s location in relation to the sensor. We discovered by analyzing the data that certain attributes seemed to most clearly define the difference between an icequake and an eruption (the two primary events we were trying to distinguish), such as peak seismic frequency and offset (elapsed time from detection at E1S to detection at the current node). However, these attributes are not always uniform, as not all of the sensors measure the same thing. We received data that came predominantly from four seismic sensors and one infrasonic (acoustic) sensor, with the former recording fluctuations in seismic frequencies, and latter documenting pitch variations. One of the goals for our project was to determine the value of the acoustic sensor, by running multiple clustering rounds with and without its data contributing. We wanted to determine whether the acoustic data helped improve the ability of the network to distinguish between eruptions and icequakes, or if it actually just confused the rest of the sensors in the network because of its dissimilar data. We expect eruptions to have an acoustic signal but icequakes to be silent. Algorithm Descriptions and Implementations Our project focuses on collaborative machine learning and topology exploitation (Lane and Wagstaff, 2007). The largest contribution of this project is experimenting with collaborative learners, as previous work has only analyzed data from a single station in isolation (Langer, et al, 2003 and Scarpetta, et al, 2005). A machine learning system is generally trained on a portion of the data before reaching peak accuracy, when it is then tested for correctness. The idea behind this collaboration of machine learners is that each sensor node at Erebus will be an individual machine learner, running its algorithm on the data it Jillian Green 5 collects at its location, but maintaining contact with the other sensors (learners) at the volcano. Ideally, each node will query the other learners in the network, obtain information regarding current events occurring at the volcano from them, and individually reach peak accuracy faster because of this added input. Another major contribution of this project is extending evaluation of volcanic data to include a broader class of machine learning algorithms, particularly clustering. Clustering is a machine learning algorithm that clusters data together based on similarity, even when pre-existing labels are not available. This algorithm is especially useful for extracting trends, for it determines natural relationships between the data, identifying populations that tend to group together. It is expected that data relating to similar event types will cluster together based on their input similarities. Since clustering algorithms group data together without prior labels, they have the potential to single out irregular types of events, as well as new and different ones that we have never identified before. We chose to incorporate two different versions of k-means clustering, which is the most straightforward of clustering algorithms. The way k-means clusters is as follows: First, the number of clusters is chosen, and k cluster center locations are randomly placed in the feature space, or the area where all of the data is plotted. Next, each data point is assigned to the cluster of the centroid that it is closest to, and the new centroid is calculated, which is the center of all of the points it owns. In the final step of the iteration, each center jumps to the new center (the center of all its data points). This process is repeated until no item assignments change, and therefore convergence is reached. Figure 2: K-means clustering objective function The two different versions of k-means clustering are called PCKMeans and MPCKMeans. Both of these incorporate different types of constraints into k-means, and provide significant improvement on the original algorithm (Bilenko, et al, 2004). A pairwise constraint takes two data items and assigns either a “must-link” or a “cannot-link” relationship to them. Pairwise constraints are usually developed based off of any pre-existing knowledge of the data, but in our project they were developed by each neighboring sensors’ answers to queries issued about the data. For example, sensor node 1 may make a query to the other sensors in the network, “what kind of relationship did you obtain for data items 36 and 45?” The other nodes will respond with either “I grouped them together in the same cluster,” or “I put them in different clusters,” which translates to a must-link or a cannot-link, respectively. We implemented some different unification strategies for how the original node will deal with this new information. One unification strategy is majority vote (MV), where a node receives pairwise relationships from all other nodes in the network, and adopts the constraint associated with the majority of the other learners. So, Jillian Green 6 continuing on the above example, if sensor node 2 responds with a must-link, but nodes 3 and 4 both respond with a cannot-link, then node 1 (the node who originally issued the query) will create a cannotlink constraint between items 36 and 45, since it got more votes. With majority vote, there is the chance for a tie if there are an odd number of sensors (which means an even number of voting neighbors). In the case of a tie, the querying learner abstains for that round, and does not adopt a new constraint at all. Another unification strategy is called consensus vote (CV), where the querying learner only adopts a new constraint if every other learner in the network agrees on the type of link associated with the two data points. The motivation behind consensus vote is that while CV learners may abstain more often and thus accept fewer constraints, the constraints they do adopt are more likely to be correct. PCKMeans makes use of these pairwise constraints for clustering as well as centroid initialization. It clusters with respect to the constraints, and calculates a penalty associated with any constraint that must be violated. Sometimes constraints must be violated in order to achieve convergence, and PCKMeans chooses the violations that will result in the smallest penalty, and still give the least variance within its clusters. PCKMeans also uses the given pairwise constraints for choosing better initial cluster centers (instead of random initialization as in regular k-means clustering), which can lead to faster convergence and higher accuracy. For example, if there is a cannot-link constraint between two data points, these two points may be initialized as two of the centroids themselves. In the same way, if there is a must-link constraint between two data items, a cluster center may be initialized to be the point exactly between these two points. Both of these scenarios are already an improvement on random initialization, as we may have already cut out a couple of iterations, or even saved a centroid from being terribly misplaced randomly by using this method (Bilenko, et al, 2004). What if we knew that these 2 points should be in the same cluster? Figure 3: Example of how a pairwise constraint would force the clusters to be arranged differently from the original cluster assignments obtained during unconstrained clustering. Jillian Green 7 Similarly, MPCKMeans uses pairwise constraints, but also incorporates different distance metrics into the algorithm. Metric-learning generally adjusts the distance metric within the feature space to satisfy the training data, or the data that has been correctly labeled (Bilenko, et al, 2004). Since our learners are not given any labeled data to begin with, MPCKMeans instead adjusts the distance metric based off of the pairwise constraints and the unlabeled data. For example, if the clusterer has a must-link constraint between two points very far apart in the feature space, this suggests that the metric should be adjusted in some way to make these points naturally lie closer together. The same idea occurs with cannot-link constraints, where very near points that should not be in the same cluster suggests a stretching of the feature space. Metric learning has been proven as an effective form of semi-supervised clustering, as not all attributes affect distance measurements in the same way. In this report however, we will focus primarily on the results obtained from PCKMeans. It has also been shown that actively choosing certain data points to query for constraints is a significant improvement over random selection of these points (Basu, et al, 2004 and Xu et al, 2005). For example, if a clusterer receives constraints about two items that are already in or near the centers of its clusters, the constraint might not even change its clustering at all, but if it received a constraint regarding two points that it was less confident about, it might rearrange its centers to adapt to the new information. We developed some active selection strategies that a clusterer would use to choose which data points they would like to query from their neighbors. The goal of this active selection is to choose points that are most unknown to the clusterer, presumably the points that lie on the cluster borders, and not near a centroid. The tricky part about this active selection within our network is that if the query is too difficult, the neighboring sensors are more likely to get the link wrong, but if the query is too easy (or known), it will not have any effect on the clustering. To implement this active selection, after initial clustering, the clusterer calculates the distances from all of the points to their nearest centroids. After taking the difference between the distances of the two closest cluster centers, the point with the smallest difference of these distances means that it is a point situated fairly evenly between two cluster centers. We developed a few forms of active selection, one of them being actively choosing two of these in-between points (which we called “limbo” points), and another form being choosing one “limbo” point, and pairing it with one point that is very near to a cluster center. The motivation for this choice is that perhaps if a “limbo” point is paired with a more known point (a point whose confidence of which cluster it belongs to is very high), the pairwise relationship between the two may result in being very helpful to determining the cluster of which the “limbo” point should belong to. Even with three or more clusters, this form of active selection would still inform the clusterer of some important data relationships. These forms of active selection were compared with Jillian Green 8 random selection, in which the clusterer randomly chooses two of its data instances to query its neighbors about. There is also another option incorporated into our experiment, called broadcast. When a learner puts together a new constraint based off of its neighbors’ answers to its query, it can either keep the constraint for itself only or broadcast it to the other learners. Broadcasting can be very effective in boosting performance if the constraints being passed around are correct, or have a very negative effect if the constraints are noisy. Performance Evaluation Clustering evaluation is usually done in terms of the Adjusted Rand Index, which measures similarity between data partitions through pairs of data items (Hubert and Arabie, 1985). Since clustering is done without labels, one cannot evaluate it in terms of the percent of items correctly classified. The Adjusted Rand Index can be used to compare similarities between two partitions, or, if the true labels of the data are known, it can be used to calculate similarity to the correct partition of the data. To calculate the Rand Index, you must go through every pair of items in the data and see whether those two items were put into the same cluster or put into different clusters. Next, you increment the appropriate variable according to how the other clusterer being compared clustered those items. If both clusterers put them together or apart, the pair increments a or b, which hold the pairs of items that were clustered similarily by both clusterers. If you are comparing your clusterer against the true data labels, you would simply see if the pair of items fall in the same or in different classes. After going through all of the pairs of data items, the amount of similarly paired items (a+b) is divided by the total number of pairs (see Figure 4 below). a = same cluster X, same cluster Y b = diff cluster X, diff cluster Y c = same cluster X, diff cluster Y d = diff cluster X, same cluster Y Figure 4: Upper left, variables defined for calculating the Rand Index between two clusterers, X and Y. Lower, the Rand Index, and Upper Right, the Adjusted Rand Index, which compensates for agreement by chance. The Adjusted Rand Index compensates for agreement by chance, while the regular Rand Index simply calculates the percentage of correctly grouped pairs. While the Rand Index gives a number between zero (not similar at all) and one (the exact same partitions), the Adjusted Rand Index has the Jillian Green 9 ability to go below one if the agreement is less than that expected by random chance. We used the Adjusted Rand Index to evaluate our network, comparing our partitions against the true classes of the data in the original data set from New Mexico Tech. Figure 5 compares the performance of oracle, which provides an idea of potential performance for every selection strategy because every data pair queried is provided the correct link, majority vote (MV), consensus vote (CV), and self, in which the clusterer does not collaborate and only uses constraints from its own clustering. In this experiment, one pairwise constraint is adopted by one clusterer at each round. One can see by observing these results that collaboration is indeed useful; the clusterers who only have themselves to rely on (instead of their neighbors) cannot improve at all after initial clustering. These results are from a four learner network clustering the data from Mount Erebus, using the activeLimboClose selection strategy, where they query a pair consisting of one ”limbo“ point (a point that falls directly between two cluster centers, creating doubt about which cluster it belongs to) and one point that is close to a cluster centroid. It can be seen from Figure 5 that there is significant room for improvement as far as majority and consensus votes are concerned, however, the fact that the oracle can improve so much shows hope for collaborative machine learning in general. It can be seen that the collaboration still improves the performance of MV (even if only for certain rounds), although not as much as oracle. It should also be noted that the data collected from Mount Erebus is a particularly difficult dataset. Figure 5: PCKMeans with a 4 Learner Network. Each unification strategy (oracle, MV, CV, and self) is shown using the activeLimboClose selection strategy. Average over 10 trials. Jillian Green 10 One might also notice that majority vote tends to peak at some point, then go down lower than where it began. This is possibly due to the network as a whole converging towards a wrong solution, due to incorrect constraints. As each learner adopts new constraints based off of its neighboring sensors’ responses, it will learn according to their solutions, regardless if they are correct or not. The different selection strategies had a great impact on the results, as can be seen in Figure 6. Even with only three different strategies, oracle and MV both differ considerably in performance. It is interesting to note how random and activeTwoLimbos perform the best with oracle, whereas activeLimboFar does not do as well. But with MV (where sometimes the constraints are incorrect), activeLimboFar does much better than random selection. Figure 6: PCKMeans, MEVO Data Set, 4 Learner Network showing 3 different selection strategies: random, activeLimboFar, and activeTwoLimbos. Average over 10 trials. The results obtained from the different selection strategies caused us to consider the correctness of the constraints as a decisive factor in clustering performance. Because of this, we delved into constraint cleanup, or trying to ensure that the constraints adopted by clusterers in the network do not contradict each other. We developed a method that goes through each clusterer’s list of constraints and calculates all the implied constraints from them. An implied constraint is a constraint that follows naturally based on other constraints. For example, if a clusterer has a must-link constraint between items 1 and 2, and a must-link constraint between items 2 and 3, then it follows that there is an implied must-link constraint between items 1 and 3 as well. In this method, we calculate all of these implied constraints associated Jillian Green 11 with each newly generated constraint and delete the older constraint in the case of a conflict. (It is assumed that because there is learning happening, the newer constraints are more accurate). Figure 7 shows PCKMeans running a 10 learner network on the Mount Erebus data, emphasizing the differences made by constraint cleaning. The figure shows that the tidying up of the constraint pools for each learner does make a difference, at times keeping the average per-learner ARI in the network just a bit higher than it would be, and also smoothing out some dips in ARI. It is also interesting to note how the cleaning has more or less of an impact at different times during the experiment according to the selection strategy. For random selection, constraint cleaning barely changes any performance until at least 450 rounds, whereas for activeTwoLimbos it does indeed change the constraint pools enough to affect performance by the 300th round. This points out that certain selection strategies tend to generate more conflicting constraints than others, and also that there is a need to reconcile them. Figure 7: PCKMeans with a 10 Learner Network on the MEVO data. Shows the effects of constraint cleaning compared to without constraint cleaning. Average over 10 trials. The Mount Erebus data set is a very difficult one to cluster, as some of the events quite literally map directly on top of each other in the feature space. In order to get a more accurate idea of the contribution of collaborative machine learning, we experimented with removing a couple of events, thus making the data set slightly easier to cluster. We hypothesized that if the initial ARI was higher it would provide more room for learning, as a low start is particularly difficult to recover from. We developed this idea by studying the results for testing the network on the UCI data sets (Asunsion and Newman, 2007), Jillian Green 12 and noticed more of an improvement when the initial ARI was higher, most likely because the constraints are more likely to be correct. The modified data set caused the initial ARI (with no collaboration) to jump from approximately 17 to 20, and for majority vote the network was able to reach a peak ARI of about 49 instead of around 25. This shows the potential for collaborative clustering networks in general, as much learning is achieved within the network, without any outside source of information. Another variation on our techniques is the number of clusterers placed at each sensor node in the network. Even though they would be clustering the same data, because randomly selected cluster centers are the root of k-means clustering, their clustering might still differ, and the added clusterers might be able to offer more information to the network. Figure 8 shows the results for placing one, two, and three clusterers at each node. The contribution of these additional clusterers is minimal, but present nonetheless, and one can notice that peak ARI reached from majority vote is able to jump from roughly 48 to over 50 with the added clusterers. In these figures, active selection refers to the activeTwoLimbos strategy. 1 clusterer at each sensor 2 clusterers at each sensor 3 clusterers at each sensor Figure 8: Results for PCKMeans clustering with one, two, and three clusterers at each sensor node. Active selection refers to activeTwoLimbos. Average over 10 seeds. Conclusions and Future Work This project has shown that collaboration can be quite beneficial to a network of machine learners. Even if the constraints adopted are faulty, learning can and does still occur. Future research in this area should focus on ensuring correct constraints, perhaps by incorporating link or learner confidences. Further work can also be done on with querying as well: each clusterer might choose its own high confidence points to broadcast to the others in the network, rather than querying based on its own pool. This would be in effect the reverse of the type of constraint adoption strategy that we have developed, but might offer some different collaborative contributions. Other variations on constraint adoption might also be researched further. This experiment should also be run on a larger dataset from Mount Erebus, when it becomes available. The difficulty of the data being clustered has a large impact on the performance of the Jillian Green 13 collaborating clusterers, as can be seen when comparing the performance of the network on the Mount Erebus data with the UCI datasets and the slightly smaller Erebus data set. It is probable that with more data available for clustering, the more likely the clusterers might converge to better solutions as a network. Acknowledgements This research was partly carried out at the Jet Propulsion Laboratory, California Institute of Technology, and was partly sponsored by the Summer Undergraduate Research Fellowship program and the National Aeronautics and Space Administration. References Aster, R. et al. (2004). "Real-time data received from Mount Erebus volcano, Antarctica," Eos, 85:10, p. 97-104. Asuncion, A. and Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science. Basu, Sugato, Banerjee, Arindam, and Mooney, Raymond J. “Active Semi-Supervision for Pairwise Constrained Clustering” Proceedings of the SIAM International Conference on Data Mining (SDM2004), Lake Buena Vista, FL, April 2004. Bilenko, Mikhail, Basu, Sugato, and Mooney, Raymond J. (2004). “Integrating Constraints and Metric Learning in Semi-Supervised Clustering.” Proceedings of the 21st International Conference on Machine Learning (ICML-2004), p. 81-88, Banff, Canada. Hubert, L., and Arabie, P. “Comparing Partitions.” Journal of Classification, vol. 2, pp. 193-218, 1985. Lane, T. and Wagstaff, K. "Synergistic Machine Learning: Collaboration and Topology Exploitation in Dynamic Environments," NSF proposal to the Division of Information and Intelligent Systems, accepted July 24, 2007. Langer, H., Falsaperla, S., and Thompson, G. Application of Artificial Neural Networks for the classification of the seismic transients at Soufriere Hills volcano, Montserrat. Geophysical Research Letters, vol. 30, no. 21, 2003. Jillian Green MacQueen, J.B. (1967). "Some Methods for Classification and Analysis of Multivariate Observations", Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, p. 281-297. Scarpetta, S., Giudicepietro, F., Ezin, E. C., Petrosino, S., Del Pezzo, E., Martini, M., and Marinaro, M. “Automatic Classification of Seismic Signals at Mt. Vesuvius Volcano, Italy, Using Neural Networks.” Bulletin of the Seismological Society of America, Vol. 95, No. 1, pp. 185–196, 2005. Witten, Ian H. and Frank, Eibe. “Data Mining: Practical Machine Learning Tools and Techniques.” 2nd Edition, Morgan Kaufmann, San Francisco, 2005. Xu, Quinjun, desJardins, Marie, and Wagstaff, Kiri L. “Active Constrainted Clustering by Examining Spectral Eigenvectors.” Proceedings of the Eighth International Converence on Discovery Science, p. 294-307, 2005. 14