Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
School of Computer and Information Science Minor Thesis Analyzing the fragmentation of coselection data due to volatile search results Nathan Ronald Williams 2012 Abstract Initial investigations have indicated coselections are an effective way to cluster web pages under a shared meaning. The idea is that URLs coselected under the search term tend to be the result of the same objective by the user. Though there are some variances, it has been shown to be strongly effective at generating sense-singular results given a high enough threshold. While the clusters may be sense-singular, there are frequently numerous clusters generated for the same sense. Approximately one sense-singular cluster per sense should be expected and hence counting clusters would indicate ambiguity in search terms. However, in many of the cases, search terms appear to be ambiguous because they have multiple clusters in the results, even though that should not be the case. A key factor speculated is the effect of time on the top results as they are subject to change. This could be causing temporal fragmentation of clusters since there is only a certain window of opportunity for two URLs to both be high enough in the search results to be selected together – they have to be in the top N (usually 10) results to be coselected. Through using the time stamp associated with the data, we aim to uncover the evolution of clusters in time. The initial proposal of this paper was to first to analyse the effect of time and whether it is having a big effect on the segregation of clusters. The activity of links and clusters were plotted out with analysis of whether there was enough activity in common between clusters to suggest they have had sufficient coselection chance. Expanding on that was a proposal of a potential solution by first performing clustering and then second joining disparate clusters by lowering the threshold for clusters that have few URLs active at the same time. While results were inconclusive due to a lack of data to collect, it is hoped that the methodologies formed will be relevant to future studies as greater data is collected from various sources. Table of Contents 1 2 Introduction .......................................................................................................................................... 1 1.1 Motivation..................................................................................................................................... 2 1.2 Research Questions ...................................................................................................................... 2 Literature Review .................................................................................................................................. 3 2.1 2.1.1 Clickthrough .......................................................................................................................... 3 2.1.2 Detecting ambiguity .............................................................................................................. 5 2.2 3 Related Work ................................................................................................................................ 8 2.2.1 Coselections .......................................................................................................................... 8 2.2.2 Search Engine Results ........................................................................................................... 8 Methodology......................................................................................................................................... 9 3.1 TimeStamping ............................................................................................................................... 9 3.2 Measuring Activity ...................................................................................................................... 10 3.2.1 Loss of activity ..................................................................................................................... 10 3.2.2 Loss of URLs......................................................................................................................... 10 3.3 4 Background ................................................................................................................................... 3 Cluster Disparate......................................................................................................................... 11 Results and Discussion ........................................................................................................................ 12 4.1 Analysis of Data ........................................................................................................................... 12 4.2 Measurement of activity ............................................................................................................. 14 4.3 Measurement of Loss of URLs .................................................................................................... 14 4.4 Cluster Disparate......................................................................................................................... 14 5 Conclusion ........................................................................................................................................... 15 6 References .......................................................................................................................................... 16 7 Appendix ............................................................................................................................................. 20 7.1 Appendix A: Coselection Count of terms with at least one cluster ............................................ 20 7.2 Appendix B: Accuracy of cluster disparate on URLs with an existing cluster ............................ 21 7.3 Appendix C: URL distribution ...................................................................................................... 23 1 Introduction Trails of data generated from users interacting with search engines provide a significant resource for classifying information on the World Wide Web. The patterns of user behaviour found in the search logs help indicate the context a user is applying to a search term. It has been proposed that this information can aid in ambiguity and synonym detection (Ashman et al, 2011) amongst other useful tools. Initial progress began with clickthrough data which proved a useful source for clustering resources together (Beeferman & Berger 2000). The process involves gathering URLs selected by users under a search term. Though it was useful to find which search terms had URLs in common, ultimately coselection data would provide a more useful metric for indicating the relevance of URL to URL by wrapping up URLs selected together in the one query. While the exact relationship between URLs can vary depending on user intent, Ashman et al (2011) have found that users generally search on a term with a single semantic purpose in mind. Though users may occasionally choose something irrelevant to an objective, the majority of coselected URLs seem to indicate a strong mutual relevance, much more so than many other selection methodologies. This process overcomes two significant hurdles to past terminology detection. The first one is that the process exploits the important factor that users are making a direct judgement on the information that full fills their needs for the terms they have specified. By contrast, semantic and lexical analysis has struggled from being unguided and lacking human involvement (Tamir & Rapp 2003). Meanwhile the oldest method of using human judgement is a very time consuming to get a complete picture (Riloff 1993). Coselection overcomes these two hurdles by providing user relevance judgement from an activity people perform ubiquitously in their daily lives. This thesis proposes to detect ambiguity by counting the number of clusters generated. This method of detecting ambiguity first involves using coselections as a similarity measure to aggregate semanticallysimilar collections of URLs. This is created by first forming a term graph of URLs for each term where edge weights indicate how many times a URL is selected in common with another URL. Clusters are then formed by aggregating URLs that are regularly coselected together enough to indicate they are of the same sense. Each cluster should therefore represent part or all of a sense a search term can be used in. 1 Figure 1-1 Term graph for "pernstejn" with vertices corresponding to Web resources (Asman et al 2011) Experiments so far indicate that clusters can be successfully resolved to sense singularity, however, currently there are also many clusters for a single sense. This research aims to address issues that can reduce the number of clusters to something more meaningful. One major issue that appears to be creating more clusters is the effect of volatile top search results over time. If changes happen too abruptly old URLs can’t be coselected with new URLs which create division in clusters. The true extent of this effect will be measured and research possible solutions to reduce the effect. 1.1 Motivation Too many clusters for the one meaning results in a large amount of disparate chunks of data that are difficult to analyse, however that they were still mostly sense singular indicated a solid platform to build on. Ideally, by aggregating those semantically-similar clusters, it will become feasible to judge whether the term is ambiguous or not, by counting the number of clusters. It was postulated in Ashman et al. (2011) that more clusters indicated more potential ambiguity although the data they investigated did not confirm or deny this. Ashman et al (2011) speculated that more data would bring together bigger clusters. Though such a quantity of data would likely be met by major search engines for common terms, it is not however available to current research. An issue speculated to be breaking up the results is the effect of volatile top search results. Since user’s mainly select from the first page (Jansen & Spink et al 2006), if there aren’t smooth gradual transitions over time, there will be breaking in clusters due to lack of coselections to join them together. To find the effect of time, this paper proposes to time stamp the data to uncover the evolution of clusters in time, thereby discovering whether there are broken links between large clusters of the same meaning at a certain period in time. This paper therefore aims to discover the effect this phenomenon has on the clustering and look into possible solutions for overcoming this problem. 1.2 Research Questions QUESTION 1 Does the effect of volatile top results create separation in clusters? If users mostly select only from the first page of results (Jansen & Spink et al 2006), we expect to find that rapid change means the opportunity for coselections to be created drops significantly as soon as one of the participating pair leaves. QUESTION 1.1 Can clusters separated by time be brought together and retain sense-singularity? We plan to compare the outputs of standard clustering against aggregation of temporallyseparate clusters to see whether it marks an improvement on the whole-dataset cluster aggregation process. QUESTION 2 Given a set of these aggregated clusters, is it feasible to determine via cluster cardinality whether or not a given term is ambiguous? 2 We postulate that an improved clustering mechanism will successfully aggregate at least some formerlydistinct clusters with the same semantics. This should result in a significant correlation between the number of clusters for a term and a ground-truthed value for the ambiguity of that term. 2 Literature Review 2.1 Background 2.1.1 Clickthrough Clickthrough data provides a key resource on how search engines are used. It records a history of a user’s interactions however doesn’t resolutely state what the user’s intentions are and how successful they were. Nevertheless, it has been speculated that in the great quantities that are possible on the World Wide Web, a good analysis of the data can be even more accurate than using explicit feedback from the user (Dou et al 2008). As a result the data has been a key area of interest since search engines became a widespread means to browse the internet. Applications The first recorded use was when Lieberman (1995) applied the resource to dynamic personalized tools for browsing the web in an application called Letizia. Since then it has proven popular for a wide range of different applications, of which, one of the most attractive has been to improve the output of the search rankings. Many researchers have tackled this issue (Joachims 2002)(Agichtein et al 2006)(Carterette & Jones 2007)(Gao et al 2009)(Dupret & Liao 2010), all keen to increase the accuracy of ranking automation to a universal range of knowledge that is so large that it cannot be done completely manually. This field has been of great value to creating accurate interpretations of clickthrough data that are otherwise mixed by fuzzy interpretations of user decisions. Numerous other components of search engines have also benefited from feeding back search log data. Sun et al (2005) found they could improve the captions of search results by determining significant words in regularly clicked documents. That query’s with terms that are in the caption are more helpful is also supported by Clarke et al (2007) who performed a wide analysis of what makes a good caption, supported by the click through data they applied. More broadly the information collated can deconstruct how search engines are used. Pass et al (2006) derived a wide set of statistics that help create a picture of user behavior, while Ashkan et al (2008) went into more detail to uncover the intent of a user’s search as transactional, navigational or commercial means which was proposed to assist associated advertisements. Clickthrough data also has value as a means the extract meaningful associations between URLs and categorize the broad range of information on the Internet. For instance Xu et al (2009) performed Named Entity Mining on the data, compiling the links from specific search terms into various information types. Another idea has been to cluster resources together where a number of links are commonly selected for multiple search terms (Beeferman and Berger 2000). This idea of clustering links 3 into useful groups is a core foundation of the work Ashman et al (2011) have used that is the ground work this paper is derived from. Clustering The idea of clustering URLs and search terms was first initiated by Beeferman and Berger (2000). Search terms that had many of the same URLs selected were grouped together along with their associated URLs. This resulted in clumps of information consisting of common ground. Its main limitation was that it required top search results to have the same URLs available for multiple terms, but nonetheless was capable of grouping a vast array of terms for many data sets. A useful application for the results was proposed to provide search term recommendations. As a direct expansion to Beeferman and Berger, Chan et al (2004) noted the algorithm was subject to noise as a link only using very few clicks in common could be included. They proposed that a cut off proportional to how many times the links have been chosen under those terms against the overall number of document selections for those terms. This effectively eliminated noise as the major source of error. Further, by considering each user’s actions separately Leung et al (2008) found they could disambiguate a user’s intentions. If a user’s choice fit the mould of a certain group of users that chose the fruit apple, they could receive results on that specific option over apple computers which fit a different group of users. Similarly Gao et al (2010) suggest search terms that are clustered together can be considered synonyms. To do so they followed a particularly novel approach where queries are considered similar if the titles of its main URLs regularly selected in clickthrough data feature the same bi-phrases. This helped extend on Beeferman and Berger’s concept by being able to cluster multiple search terms even if they don’t feature common URLs in the top search results while also aiding a common language metric between search terms and titles/documents. A unique form of clustering was further introduced by Ashman et al (2011), which utilized coselections as a similarity metric between URLs. The binding together under the one search sense was much stronger in this scenario than that of using the entire search history of individual users as a similarity metric. Greater detail can be found in 2.2.1 Coselections. Acccuracy A key factor in clustering information sources is the presumption that documents selected tend to be relevant to the user’s original search sense. It has been speculated the usefulness of abstracts may be an issue but they were verified to be helpful 82.6% (Joachims et al 2007). Moreover the larger issues found are the user’s decision making and their tendency to browse, which account for the total accuracy of documents being relevant to the search sense at only 52% (Scholer et al 2008). Nevertheless Dou et al (2008) asserts that with a good analysis of the data combined with the huge quantities possible on the World Wide Web, it may produce better results that of direct user feedback. 4 One of the most common methodologies to smooth out the results is the idea that the higher the document in the search results, the more likely it is to be picked. The main factor is the trust bias the user has in the search engine that the higher results are more accurate. Therefore the quality of the retrieval system is a big factor in influencing which documents are selected. Further Granka et al (2004) verified through eye tracking that users have a tendency to analyze results from top to bottom most of the time. As an extension of these ideas Joachims et al (2002) created a methodology for analysis that asserted any document selected is more relevant than all the documents above it that weren’t selected by the user in the one session. This represents the main methodology that has been extended upon to determine relevance of the document to the search term. Another consideration suggests that image search may be substantially more accurate than text search. The theory is that images are a more direct description than what captions are, which proposed to lessen the problem of bad user judgment. These results were found to be extraordinarily accurate to about 88% (Smith & Ashman 2009). However recent research has suggested that text search are in fact just as good and the lesser quantities of image search data is a challenge to resolve. 2.1.2 Detecting ambiguity In 2007 Ashman et al proposed a Global Perpetual Dictionary of everything. A key component was that search log data would be able to run an automatic scan for ambiguous terms without the need for human involvement. This research represents a shift towards involving user implicit judgments that are carried out through a ubiquitous activity rather than analyzing the structure of a discourse. Disambiguation Disambiguation has been of interest to the field of computing since its early days in the 1950s. The first identified field for disambiguation was in the context of machine translation of languages. Ambiguous terms were one of the key constraints to otherwise providing one to one translation. The early philosophy of the task was that one word at a time cannot determine the meaning of an ambiguous term, but given the context it can be resolved (Weaver 1955). Thus disambiguation involved forming a methodology of picking the right meaning of each ambiguous term based on what a sentence implied. A key aspect of disambiguation is that it relies on a resource for the representations of each meaning in ambiguous terms. The enormous knowledge base required across the entire diction has been a significant hurdle to assigning a word its implied sense from a sentence (Gale et al 1992). One popular source has been Machine-Readable Dictionaries (MRDs) but so far has not successfully accomplished the automatic extraction of large knowledge bases (Amsler 1980). WordNet is the only MRD that is widely available today and is limited by its hand creation. Finding the ambiguous term senses would be the first step towards a comprehensive MRD, as is the case with WordNet which uses a synset tree of words to represent the meanings (Tamir & Rapp 2003). Ashman et al’s (2007) proposal represents a way of discovering ambiguous terms and potentially their meanings in a more comprehensive fashion by making valid interpretations on the way they are used in an implicit activity. Machine Readable Dicitonaries 5 Machine readable dictionaries are currently the most common resource for the structure of word senses. WordNet, the most fully formed today, features words grouped into synsets of the same sense referencing other synsets with key relationships. Ambiguity is represented where a term is found in multiple synsets (senses). The biggest difference between each sense is they should reference a different set of hypernyms implying they are a different kind of entity. Nevertheless, using MRDs as a resource for ambiguity has been limited by the unclearly defined boundaries of senses. There have been complaints MRDs often make unnecessary and difficult "forcedchoices" (Dolan 1994). Attempts have been made to address this such as clustering with the aid of a thesaurus to help eliminate distinctions that are unnecessarily fine grained (Chen et al 1998). These tough distinctions made in an MRD can lead to too many unimportant senses which clutter up tasks such as disambiguation (McCarthy et al 2004). It has been suggested ranking of sense relevance can therefore be of value to distinguish which are most useful. These sorts of findings pose challenges to clustering coselection data as it is not clear whether users will mainly coselect on major senses or if minor ones will be distinguished. Most likely it is presumed the minor senses will build into major ones by the nature of uses selecting even small similarities, but the nature of the boundaries need to be investigated. Traditionally all the crafting of MRDs has been done manually by hand and research, but increasingly there have been ways of finding relationships through more automated forms. A common methodology for determining hypernyms has been to look through discourse for commonly occurring patterns: “Bruises[,] wounds[,] broken bones [or other] injuries . . .”: where the nouns are implied to be a type of injury (Hearst 1992) “Boeing[, a] defense contractor": where “defense contractor” is an appositive of Boeing (Caraballo 1999). By finding hypernyms this way, nouns sharing the same hypernyms likely indicate a shared sense allowing somewhat of a synset to form (Caraballo 1999). In another scenario, senses and their description can be found more directly using a complex set of trigger words, though is limited to specialized topics of content by its less generic nature (Riloff 1993). Meanwhile Agirre et al (2000) have used the world wide web to enrich and refine the current content in WordNet. Despite these efforts, it is still a long way in breadth and accuracy from seeing the complete automation of further MRD construction, beyond providing possible guides. This paper embarks on adding to the knowledge of ambiguous terms with the clusters of coselected URLs ideally representing the major senses. Word Sense Induction The most direct field of finding ambiguity is word sense induction since the best way to find ambiguous terms is to find the distinct senses. In word sense induction, these senses are typically found by assigning words commonly used with the target word to clusters which are formed by their usage in 6 discourse (Pantel and Lin 2002) (Dorrow & Widdows 2003). Thus far an accuracy of 72% that a cluster correlates to a correct sense has been achieved. Increasingly, the World Wide Web represents a large scale resource that is easy to access for gathering word senses. The use of Google to mine the web for senses was suggested by Tamir & Rapp (2003), they were inspired by the work of both Gale et al (1992) and Yarkowsky (1995) which suggest ambiguous words only occur in one sense in a given document and that words close to a term give some indication of the sense of the target word. This leads to the assumption that a good indicator of ambiguity is when two words commonly occur with the target word but rarely occur together with the target word. Ultimately the resolution of finding two words that represent two different meanings of an ambiguous term was successful, but often the association of the words to the meaning is weak since regularly used words are rare. In spite of these efforts, the biggest difficulty so far is the need to narrow down terms and search senses, the processes so far are too complex for a full breadth. Since clustering coselection search data does not require analysis of masses of discourse in the same way, the use of implicit judgments as a one to one relationship provides a way to streamline the complexity of such a task. Disambiguation in IR Disambiguation has also featured in the context of Information Retrieval. The philosophy has been that the results provided by IR can be more successful when all entire resources are cleared of ambiguity. The first step involves replacing all the ambiguous terms in every discourse with words that correlate to the more specific meaning implied by its context (Voorhees 1993) (Sussna 1993). However this proved to be a futile attempt as it only produced worse search results than the original unedited recourse. Conversely Sanderson (1997) has measured the effect of increased ambiguity within discourses on IR, the results found that by appending random words to those in the discourse, the added ambiguity did not hinder the success of an IR system as much as expected. Disambiguation of resources was therefore a challenge to find small gains, and a difficult one due to the low accuracy of current disambiguators. A knock on effect found by Krovetz & Croft (1992) on the weakness of disambiguation in IR related to the way search terms were used. They found two major challenges that were causing in effective results: 1. Most ambiguous terms have a dominant meaning so most results feature the resources that most users were searching for. 2. Search terms often over come disambiguation by collation, the more words found in the search term, the more likely other words will imply their meaning like a sentence would. With all search terms being searched on, more searches are quickly funneled into the context of them all. These two examples highlight that queries need to be short for ambiguity to be an issue in both the term and the discourse. Further, by the nature of a search engine, small search terms can overcome their ambiguity by being applied back in with added words for more specific results. 7 2.2 Related Work 2.2.1 Coselections A special use of clickthrough data immerged in collecting it as co selections where multiple URLs are chosen in the one session (Smith et al 2009). It provides a direct similarity metric between URL to URL under the assumption that users usually select with a specific search sense in mind. By extension of this idea, each separate cluster should ideally be unambiguous and represent an individual meaning for the search terms. Through the DBSCAN algorithm, Ashman et al (2011) successfully reduced clusters to single sense however multiple clusters for the same sense regularly immerged. A major factor in the lack of accuracy was the small amount of data available. Additionally it was speculated that changes in top results would cause fragments in clusters since two URLs need to be there at the same time to have a good chance of being coselected. This paper aims to address this issue by discovering how significantly time is affecting clusters and provide some potential solutions to overcome it. Caon et al (2012) further expanded on this work by utilizing a cluster by overlap method to link similar clusters over different search terms. These relationships should indicate that two terms with a cluster in common have a similar usage which suggests they are synonyms. Through tweaking the DBSCAN algorithm parameters, these results were successfully resolved to a large number of positives that were all verified in accuracy. 2.2.2 Search Engine Results For two results to be coselected together, their position in the results is important. Users only tend to view the first page of results about 85.2% of the time, which has been increasing over the years as search engine accuracy and the trust invested in them by users has grown (Silversteen et al 1999) (Jansen & Spink 2006). URLs therefore have a much smaller chance of being coselected if they aren’t both on the first page. This issue becomes more apparent with top results frequently evolving. New pages are regularly being added and the measurements used to rank them also fluctuate. With this evolution, many URLs have a limited time to appear together on the first page before they lose their main chance of being coselected, while many others will never spend time together. Often it is the case results will show very little consistency over time and can jump up and down with very little to suggest major change occurred (BarIlan 1999). As with the case with Google, there are many different measurements used for rankings such as anchor pages that link to the main page. These anchor pages add to a page’s significance but also feature additional descriptions helpful for its relevance (Brin & Page 1998). With so many variables, results can change for many different reasons. Nevertheless, Bar-Ilan et al (2006) plotted changes on daily intervals and found Google to be mostly stable with small changes happening often but usually only in increments and rarely making big jumps (see figure 2-1). The effect of these changes have on clusters can be mainly be deduced down to how often URLs occur together in the first page irrespective of how many times it has gone in and out, as this indicates it’s coselection chance. It is important to analyze the effect this has on the construction of clusters. 8 Figure 2-1 The top-ten results of the query "DNA eveidence" on google (Bar-Ilan et al 2006) An additional challenge posed to clustering coselections is that the vast majority of search terms are used very few times. Silversteen et al (1999) found, out of a very large set of 154 million queries, only 13.4% of them occurred more than 3 times. By contrast, the top 100 query terms tend to be used as much as 20% of the time (Spink et al 2002). Good results are therefore constrained to a smaller set of terms as many don’t even have a chance of being clustered, let alone having sufficient detail to be successful. For broader results, clustering needs to be as efficient as possible and make best use of as little data as possible. 3 Methodology 3.1 TimeStamping In order to analyze when clusters are broken due to volatile top search results, each coselection was time stamped. All the data from the log files was reprocessed since the extracted coselections currently do not contain the time information. Without the old parser this involved doing it all over again. The log files were extracted from a server for a workgroup of computers where users are more likely to have similar search query. Out of all the interactions possible through the server, queries from Google, which is by far the most popular search engine (ebiz 2012), are extracted. Within each transaction exists a time stamp which is used as a record of when each interaction occurred. This allows tracking the evolution of clusters in time to determine how clusters have been broken. Though it is impossible to back track to say when and where the URLs were part of the top results, the activity of the URLs provide some observations. 9 Figure 3-1 An example of server log data 3.2 Measuring Activity The primary methodology is formed from the conjecture that URLs forming part of a coselection are in or very close to the top search results and when no longer selected, may have fallen outside. The major concern is how disparate clusters are, which is measured by how often URLs selected in one cluster are selected at the same time as URLs in another. By measuring disparateness this way, the activity of any two clusters can be correlated to one of three main outcomes: Completely disparate clusters: These clusters were likely impossible to join by the coselection metric as they have not had two URLs selected in the same time frame. Too many of these indicate volatility of results is causing disruption in the clustering. Slightly disparate clusters: These clusters still have some URLs selected at the same time but only very rarely. Therefore they have less chance of gathering enough coselections to join and with less chance, a smaller epsilon value may be more appropriate. Non-disparate clusters: These clusters frequently overlap in activity but due to few coselections in common, they are two distinct clusters. This is the ideal scenario which likely indicates a strong correlation to clustering success. If multiple clusters still represent the same meaning in this scenario, then it is a limitation of clustering coselections itself. Furthermore, we speculate in this scenario the more coselections; the more likely it is to be accurate so how often two nondisparate clusters represent the same meaning should be measured against the coselection size. Since clusters representing the same meaning are the main factor being evaluated, individual URLs are of less significance however the same can be applied to them, particularly for small amounts of data to indicate where they are situated. 3.2.1 Loss of activity Since many URLs are active on and off over a period of time, a measurement of complete inactivity would then likely have to provide leeway of a small time period, allowing for a URL to chain together activity over some time. Scenarios where there have been regular activity on a URL followed by a complete halt, provide our biggest indication that a popular URL has completely dropped off for good. This is an alternative to disparateness that is slightly clearer cut a URL has likely fallen outside of the top results. The amount of URLs completely segregated between two clusters then provides an idea of how many potential joins were greatly affected by volatile results. 3.2.2 Loss of URLs An overall measurement of the loss of URLs may also be formed for search terms with a lot of activity, which may provide on average an indicator of how often URLs drop out of the top search results. To analyze the results, information is gathered in time fragments that may be analyzed weekly, fortnightly or monthly. For each segment of time there should be a fairly similar number of URLs chosen, mainly those in the top results but also a small number outside. As time increases the total number of URLs selected over all time increases while the number selected specifically in each time period should remain roughly the same. This provides a key correlation to how many URLs drop outside the top 10 results. Additionally in rare cases some URLs will come in and out of the top results, when they arrive back, they will also be added provided they reach the threshold of inactive iterations to be considered completely inactive. An average number of new URLs per iteration over all the significant search terms would then determine how often top results change. 3.3 Cluster Disparate In order to improve clusters, a cluster disparate function has been proposed to find clusters that are rarely active at the same time and allow a looser epsilon value. To determine disparateness, a timeline of URL usage is required where an individual iteration lasts a small period of time like a fortnight or a month, the exact time being determined by how often URLs tend to drop off the top results discussed earlier. Each pair of clusters that have few iterations selected in common with each other are considered disparate. Since these clusters have much less chance to gather the coselections necessary to join, the epsilon value would therefore be reduced to make clustering easier. Measuring Disparateness (Amount of times urls are selected) Completely Disparate Clusters (No coselections possible) Clusters URLs Cluster1 url1-1 url1-2 Cluster2 url2-1 url2-2 url2-3 1 2 3 3 4 4 1 4 4 3 5 6 6 2 1 7 4 7 Iterations (Months) 8 9 10 11 12 1 3 1 2 3 4 5 13 14 2 1 3 13 14 2 1 3 15 16 17 18 19 15 16 17 18 19 Disparate Clusters (lower epsilon may be helpful) Clusters URLS Cluster1 url1-1 url1-2 Cluster2 url2-1 url2-2 url2-3 1 2 3 3 4 4 1 4 5 6 7 Iterations (Months) 8 9 10 11 12 2 1 4 3 6 2 1 7 4 1 3 1 2 3 4 5 1 2 Figure 3-2 An example of disparateness To measure whether a pair is in fact disparate, there are two major cut offs. The first one is measuring the aggregating the number of URLs active in the cluster with fewer URLs active for each iteration. ie 11 totalTimeIterations α = ∑ ( Min(clusterA.URLsActive(ti) ,clusterB.URLsactive(ti))) ti=0 The reason not every URL pair is measured is that it would increase the count by multiple which is not proportional to the coselection chance that is dependent on the amount of URLs available to co-select. This measurement is referred to as α and is the maximum cut off possible for a pair of clusters to be considered disparate. The second measurement is a finer grained cut off that is proportional to the total number of URLs featured in the smaller cluster. It is only for small clusters that require a smaller cut off than the maximum possible, since disparateness is less meaningful for smaller clusters that are much more likely to have less activity than larger ones. The measurement finds the amount of α per number of URLs in the smaller cluster and determines the rate it needs to increase since the more URLs, the less need to make the distinction. If a pair of clusters passes these two checks, they are determined to be disparate. In this scenario a smaller epsilon threshold for joining may be considered appropriate. The key is determining what values constitutes disparateness and whether the lower epsilon value is valid for not adding clusters of the same sense. 4 Results and Discussion 4.1 Analysis of Data Unfortunately the server log data did not produce enough coselections as was speculated, only approximately 1 Google URL request was found per 750 lines. The vast amount of interactions through the server due to various applications and other tasks unexpectedly dominated web requests. As a result for the barest minimum of epsilon 3 and minimum nodes 2, only one term formed 2 clusters and 55 just one cluster. In most of these cases there was far less sufficient coselection information to make valid judgements, at most only just meeting the epsilon and/or minimum nodes criteria. The primary issue seems to be the quantity of searches available, the data set only uncovered a total of 57,920 searches, as a result the proportion of unique searches that occurred at least twice was very small, 11.6%. This contrasts with the findings of Silversteen et al (1999) who had a much larger data set available by approximately 10,000 times, with 36.3% of unique searches occurring more than once. We speculated that by having the data come from a select workgroup of computers with users having similar tasks to perform, the repetition of results should be much higher for the amount of data available. While this may still be the case, 57,920 searches are still insufficient to gain enough repetition. The challenge posed by such a small amount of data becomes accentuated when using coselections. Coselections only account for 23.9% of total searches. With searches rarely being selected multiple times, the chance that it will be a second coselection is even rarer. 12 Coselection Searches Total (non unique) Unique With at least two occurrences With at least three occurrences With at least four occurrences Count 13858 12545 739 (5.89%) 191 (1.52%) 98 (0.78%) Table 4-1 Coselection data collected Searches Total (non unique) Unique With at least two occurrences With at least three occurrences With at least four occurrences Count 57920 39525 4582 (11.59%) 1845 (4.67%) 1115 (2.82%) Table 4-2 Search Data collected Coselections are also hampered by the necessity for the same two URLs to be selected in two different searches in order for the relationship to gain greater significance. Just having one of those URLs in a second coselection search does not increase the relationship for any of those by more than one. This becomes an issue as out of the top 10 results there are 45 coselection relationships possible. The problem posed is somewhat offset by large coselections that cover many of the relationships as well as the tendency of users to more likely select from the top and that the most relevant URLs are more likely to be selected. However it is particularly significant since 60.0% of coselections only feature 2 URLs and the lack of data does not lessen its influence. Coselection relationships Coselection searches Aggregate of URLs in each coselection query Total coselection edges Total coselection weights Count 13858 38908 50712 51528 Table 4-3 Coselection Relationships Data Since coselection relationships grow triangular against the amount of URLs in a query, often the terms with the highest aggregate coselection-weights were those with many coselection relationships of just a weight of 1 each caused by a single query with many URLs. These results typically embodied a type of content like “luxury holiday” (406 coselections, 28 URLs in 1 search) or “military aircraft map textures”, (351 coselections, 26 URLs in 1 search), where a user selects a category like a browse function and multiple items are compared for the most appropriate. In such instances a single coselection search would far exceed the top 10 results typical of a first page. Very few examples featured very high weights, one of the best results was “free sound effects” with 125 coselection-weights, but had 97 edges between 23 URLs. That leaves only leaves 28 coselections between 97 edges to increase their weights above 1. Rarely were these proportions exceed, but with more data such occurrences are expected to be more regular and bigger in size as the effect of one or two user clicking loads of links are outweighed by the majority. 13 By far the biggest activity over an extended period of time occurred on “teeside” and “teeside university”, however very few of these included coselections. Approximately 3-4 links had consistent activity over 3 years with other links being used only on occassions. Likely this is due to its navigational sense in only needing to find one of the 3-4 major portals for Teeside University. 4.2 Measurement of activity Little could be deduced about the activity due to such a small results set. With only one search term featuring more than one cluster, nothing was conclusive on cluster disparateness or URL segregation. Using the search term with 2 cluster, it indicated disparateness may be a factor as it only measured an α of 4 with 2 URLs in the smallest cluster, but conversely that may not indicate much as there were not any months where the less active cluster was active at a different time. URLs individually provided more significant findings as they are always expected to be low on data, since as coselections grow, they would otherwise already be in clusters. Remarkably nearly all URLs were active at the same time as another URL indicating a coselection chance with the URLs featured, though this rarely meant a coselection did occur. For terms where a cluster was found, only 54 out of 432 URLs did not have a month active in common with a cluster and only 15 were active greater than 2 times with the same cluster. Most of the URLs therefore fall into the slightly disparate range where a lower epsilon value may be valid, though this disparateness may not be a great indication as most of these URLs were never active at times the clusters weren’t. Cluster Disparate on URLs (α<= 2) Search Terms with a cluster Potential URLs URLs with > 2 months in common URLs with no months in common Count 56 363 15 54 Table 4-4 Disparateness of URLs to existing cluster 4.3 Measurement of Loss of URLs Since to gather the average loss of URLs in the way mentioned required an enormous amount of data for a select few search terms, this was unable to be met. 4.4 Cluster Disparate Since nothing of influence could be found on the disparateness of clusters, findings are yet to show its effectiveness. Of the one search term, “free sound effects” with more than two clusters, it did highlight some potential in merging clusters since the two clusters were distinctly of the same meaning and the measurement of α was only 4 for 2 URLs in the smallest cluster. By halving the epsilon from 3 to 2, a join was found between the 2 clusters which would indicate success of the function, however many more results are needed to find it’s true significance. 14 With 56 search terms featuring at least one cluster, we proceeded to trial using cluster disparate on merging individual URLs with a current cluster to gather its accuracy where there are small amounts of coselections. In actual results, the disparate function proved to be of little significance as out of 432 URLs only 15 were discarded for having more than 2 months in common with the cluster. In spite of this, only 1 false positive URL was added to a cluster compared with 47 true positives. Cluster Disparate on URLs (α<= 2 ⇒ epsilon=2) True positive URLs added to a cluster False positive URLs added to a cluster Unknown pages added to a cluster Search Terms with a cluster Count 47 1 1 56 Table 4-5 Accuracy of Cluster Disparate Function on URLs Such results would seem to suggest the strength of clustering coselections is so strong that a lower epsilon of 2 rather than 3 may be sufficient for joining URLs irrespective of disparateness. However, these results may not be indicative of broader findings due to 3 key reasons: Most of these queries featured were unambiguous; therefore all URLs are likely to point to the same meaning regardless. Most queries that are ambiguous tend to show the dominant meaning in most of the top results, therefore the other meanings rarely get clicked on (Krovetz & Croft 1992). Google is renowned for its URLs consistently being strongly relevant to the search term, so completely irrelevant URLs are rarely a factor, let alone coselected multiple times. For these same reasons, none of the 15 URLs discarded for having too many months in common with a cluster were of a different search sense. Moreover, it is apparent most searches aren’t coselections, so many of the URLs with most activity are still yet to have a solid chance at joining other URLs. It may then be applicable to scale the months in common value proportional to the overall amount of coselections in the search term, allowing for a bigger margin where there are fewer relationships. 5 Conclusion The biggest challenge posed for future study is to gather many more coselections. Even with data being collected from the same workgroup of computers, the amount of repetition of results was not great enough to form significant clusters regularly. By being used less than a quarter of the time than single selections, the chance a search term will feature coselections on multiple occasions is even less. Nevertheless, the methodologies of measuring disparateness and segregation based on URL activity appear to be sound as a means to determine coselection chance between two clusters. These are key indicators to what extent volatility of results may be posing a problem for different clusters forming of the same sense. 15 With coselections being difficult to accumulate, clustering needs to improve effectiveness even as the data set grows. A cluster disparate function was suggested for drawing together clusters that were fragmented by evolution in time, though no conclusions could be drawn on its effectiveness. While drawing URLs onto a cluster through this methodology appeared successful, it was offset by major limitations in the data set available. The ideal that coselections can determine ambiguity from cluster cardinality remains elusive, though it appears difficult to attain full accuracy using coselections as the lone similarity metric due to sparsity of activity. 6 References Agichtein, E, Brill, E, and Dumais, S, 2006, Improving web search ranking by incorporating user behavior information, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. Amsler, R, 1980, The Structure of Merriam-Webster Pocket Dictionary, Ph. D. thesis, University of Texas at Austin, Austin. Ankerst , M, Breunig, M, Kriegel, H & Sander, J, 1999, OPTICS: Ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pp 49-60. Ashkan, A, Clarke, C, Agichtein, E, & Guo, Q, 2008, Characterizing query intent from sponsored search clickthrough data, In SIGIR Workshop. Ashman, H, Antunovic, M, Chaprasit, S, Smith, G & Truran, M, 2011, Implicit association via crowdsourced coselection, Proc. Hypertext 2011, June 2011, 7-16, ACM. Ashman, H, Zhou, D, Goulding, J, Brailsford, T, & Truran, M, 2007, The Global Perpetual Dictionary of Everything, Proc. Ausweb, http://ausweb.scu.edu.au/aw07/papers/ refereed/ashman/paper.html. Beeferman, D & Berger, A, 2000, Agglomerative clustering of a search engine query log, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 407-416. Birant, D & Kut, A, 2007, ST-DBSCAN: An algorithm for clustering spatial–temporal data, Data & Knowledge Engineering, Vol. 60, Jan 2007, pp 208–221. Caon, G, Antunovic, M, Truran, M & Ashman, H, 2012, Finding synonyms and other semantically-similar terms from coselection data, UniSA, SA. 16 Carterette, B, & Jones, R, 2007, Evaluating search engines by modeling the relationship between relevance and clicks, Computer Science Department Faculty Publication Series. Chan, W, Leung, W & Lee, D, 2004, Clustering search engine query log containing noisy clickthroughs, 2004 International Symposium on Applications and the Internet. Chen, J, & Chang, J, 1998, Topical clustering of MRD senses based on information retrieval techniques, Computational Linguistics. Clarke, C, Agichtein, E, Dumais, S, & White, R, 2007, The influence of caption features on clickthrough patterns in web search, In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 135-142, ACM. Dorow, B & Widdows D, 2003, Discovering corpus-specific word senses, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, Vol. 2, Stroudsburg, PA, pp 79-82. Dou, Z, Ruihua, S, Xiaojie, Y, & Ji-Rong W, Are click-through data adequate for learning web search rankings?, In Proceeding of the 17th ACM conference on Information and knowledge management, pp. 73-82, ACM. Dupret, G, & Ciya, L, 2010, A model to estimate intrinsic document relevance from the clickthrough logs of a web search engine, Proceedings of the third ACM international conference on Web search and data mining, ACM. Gale, W, Church, K & Yarowsky, D, 1992, A Method for Disambiguating Word Senses in a Large Corpus, Computers and the Humanities, 26, pp 415-439. Gao, J, Wei, Y, Xiao, L, Kefeng, D, and Jian-Yun, N, 2009, Smoothing clickthrough data for web search ranking, In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 355-362, ACM. Gao, J, Xiaodong H, & Jian-Yun, N, 2010, Clickthrough-based translation models for web search: from word models to phrase models, Proceedings of the 19th ACM international conference on Information and knowledge management, ACM. Granka, L, Joachims, T, & Gay, G, 2004, Eye-tracking analysis of user behavior in WWW search, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. Guha, S, Rastogi, R & Shim, K, 1998, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp 73-84. Joachims, T, 2002, Optimizing search engines using clickthrough data, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. 17 Joachims, T, Granka, L, Pan, B, Hembrooke, H, Radlinski, F, and Gay, G, 2007, Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search, ACM Trans. Inf. Syst., vol. 25, pp 7. Karypis, G, Eui-Hong, H & Kumar, V, 1999, Chameleon: hierarchical clustering using dynamic modeling, Computer, Vol. 32, pp 68-75. Leung, K, Wilfred N, & Dik L, 2008, Personalized concept-based clustering of search engine queries, Knowledge and Data Engineering, IEEE Transactions. Lieberman, H, 1995, Letizia: An agent that assists web browsing, International Joint Conference on Artificial Intelligence, Vol. 14, Lawrence Erlbaum Associates Ltd. Pantel, P & Lin, D, 2002, Discovering word senses from text, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 613-619. Pass, G, Abdur, C, & Cayley, T, 2006, A picture of search, Proceedings of the 1st international conference on Scalable information systems. Riloff, E, 1993, Automatically constructing a dictionary for information extraction tasks, Proceedings of the National Conference on Artificial Intelligence, John Wiley & Sons Ltd. Scholer, F, Shokouhi, M, Billerbeck, B & Turpin, A, Using Clicks as Implicit Judgments: Expectations Versus Observations, Advances in Information Retrieval, 2008, pp 28‐39. Smith, G, & Ashman, H, 2009, "Evaluating implicit judgements from image search interactions." Smith, G, Brailsford, T, Donner, C, Hooijmaijers, D, Truran, M, Goulding, J, & Ashman, H, 2005, Generating unambiguous URL clusters from web search, Proceedings of the 2009 workshop on Web Search Click Data, pp. 28-34, ACM. Sun, J, Hua-Jun, Z, Liu, H, Lu, Y, Chen, Z, 2005, CubeSVD: a novel approach to personalized Web search, Proceedings of the 14th international conference on World Wide Web, pp 382-390, ACM. Tamir, R, & Rapp, R, 2003, Mining the Web to discover the meanings of an ambiguous word, Data Mining, 3rd IEEE International Conference on, 19-22 Nov. 2003, pp 645- 648. Voorhees, E, 1993, Using WordNet to disambiguate word senses for text retrieval, Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. Weaver, W, 1955, Translation, Machine Translation of Languages, John Wiley & Sons, pp 15-23. Xu, G, Yang, Y, & Li, H, 2009, Named entity mining from click-through data using weakly supervised latent dirichlet allocation, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. 18 Yarowsky, D, 1995, Unsupervised word sense disambiguation rivaling supervised methods, ACL ’95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pp 189-196. Zhang, T, Ramakrishnan, R, Livny, M, 1996, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp 103-114. 19 7 Appendix 7.1 Appendix A: Coselection Count of terms with at least one cluster Note: Words are in most logical order but order is unimportant in clustering Most clusters (not above) data protection act c++ connect 4 free sound effects teeside python time difference pydoc teeside university bridge transporter wet n wild python dictionary spaceship .wav c++ xsi tutorials teeside internet sdl_close python string contains sci entertainment quote fur affinity sound effects gp2x game piracy games age rating “cavazza marc” or “marc cavazza” games piracy set xsl variable reference to undefined smashing magazine pakistan news boom toon tutorials pro gaming teams textures don’t stop me now midis pegi Coselecti on total weights 385 294 125 25 37 22 38 10 3 96 6 11 25 13 10 11 12 9 119 11 17 17 7 Coselecti on edges URLs Clusters 340 237 97 16 26 17 26 8 1 87 5 6 17 10 8 8 7 4 114 7 15 13 4 14 5 6 7 3 3 7 3 3 3 4 5 3 4 3 3 3 4 3 3 3 3 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 78 12 9 15 68 26 22 23 52 12 69 10 6 9 12 16 18 16 47 7 3 3 3 3 17 3 3 4 4 5 1 1 1 1 1 1 1 1 1 1 20 c++ ternary operator pound euro avfc c++ string imbd wwe spoilers tees.ac.uk teeside uni messenger web photo portfolio hotmail 1998 data protection act linux commands zero punctuation blackboard tees sdl sound teeside uni free textures sdl_mustlock c++ random int arm.linux.rules sdl SUM 24 3 6 15 4 6 47 19 14 43 45 24 9 8 5 9 13 24 24 24 5 9 2000 14 1 1 8 2 4 44 16 10 33 41 13 5 6 3 6 8 20 20 14 3 6 1561 6 3 6 6 3 3 4 3 4 3 3 5 3 3 3 3 4 3 3 3 3 3 227 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 56 7.2 Appendix B: Accuracy of cluster disparate on URLs with an existing cluster Search Term Amount of positive URLs Amount of false positive URLs Potential URLs 0 0 0 Amount of borderlin e/ unknow n URLs 0 0 0 free sound effect teeside python time difference pydoc teeside university bridge transporter 2 1 4 2 1 0 0 0 0 0 0 0 21 URLs with no months in common 16 9 6 URLs with too many months in common 0 2 0 5 17 6 0 2 0 1 0 0 0 0 0 wet n wild python dictionary spaceship .wav c++ xsi tutorials teeside intranet sdl_close python string contains sci entertainment quote fur affinity sound effects gp2x game piracy games age ratings sdl_init “cavazza marc” or “marc cavazza” games piracy xsl set variable reference to undefinied smashing magazine pakistan news boom toon tutorials data protection act pro gaming teams textures don’t stop me now midis pegi c++ ternary operator c++ connect 4 euro pound avfc c++ string imbd 0 3 0 1 1 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 2 4 9 9 3 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 5 0 1 0 1 1 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 25 4 5 6 3 3 1 0 1 0 0 0 0 0 0 0 0 0 3 1 1 0 1 0 0 0 0 0 0 16 4 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 1 6 0 0 0 2 2 0 0 17 0 20 1 1 0 0 0 0 0 0 0 8 7 10 0 0 0 0 0 0 0 1 0 0 0 0 5 4 0 0 0 0 6 0 0 1 0 0 0 0 0 0 0 0 0 0 0 23 0 0 6 2 1 0 0 0 0 0 0 1 0 0 22 wwe spoilers tees.ac.uk teeside uni web messenger photo portfolio hotmail data protection act 1998 linux commands zero punctuation tees blackboard sound sdl teeside uni free textures sdl_mustlock c++ random int arm.linux.rules sdl SUM 0 0 1 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 14 9 9 8 4 2 0 0 0 2 0 2 0 0 0 0 0 0 15 0 0 0 0 1 0 2 1 0 0 0 47 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 4 1 3 7 8 7 2 1 4 363 1 0 1 0 1 0 0 0 0 0 15 2 4 1 0 0 0 0 0 2 0 54 7.3 Appendix C: URL distribution Most coselections For links not in a cluster of (3 epsilon, 2 minimum nodes) No. No. No. No. No. No. No. No. of never active active active active active active links active with with with in in in with other other other only only more other links links links 1 2 than links in in in mont mont 2 only only more h h mont 1 2 than iterat iterat h mont mont 2 ion ions iterat h h mont ions iterat iterat h ion ions iterat ions 42 0 34 2 1 34 2 1 data protection act c++ connect 4 33 free sound effects 23 teeside 14 0 0 0 23 15 7 0 1 2 23 1 0 2 23 15 7 0 1 2 1 0 2 python time difference pydoc teeside university bridge transporter wet n wild python dictionary spaceship .wav c++ xsi tutorials teeside internet sdl_close python string contains sci entertainment quote fur affinity sound effects gp2x game piracy games age rating “cavazza marc” or “marc cavazza” games piracy set xsl variable reference to undefined smashing magazine pakistan news boom toon tutorials pro gaming teams textures don’t stop me now midis pegi c++ ternary operator pound euro avfc 8 0 2 4 0 2 4 0 8 22 8 0 0 0 4 17 6 2 0 0 0 2 0 4 17 6 2 0 0 0 2 0 2 27 4 7 13 11 6 6 0 0 0 0 0 0 0 0 0 21 2 4 7 8 4 3 0 3 0 1 3 1 0 1 0 0 0 0 0 0 0 0 0 19 2 4 7 8 4 3 0 4 0 0 3 1 0 1 0 0 0 1 0 0 0 0 8 1 5 0 0 5 1 0 4 28 7 7 8 4 0 0 1 0 0 1 1 20 2 5 4 3 0 5 1 0 2 0 1 1 1 0 0 0 1 20 3 5 4 3 0 5 1 0 2 1 1 1 1 0 0 0 18 6 4 0 0 0 16 3 1 0 1 1 0 0 0 16 3 1 0 1 1 0 0 0 5 0 2 0 0 2 0 0 6 11 0 1 1 7 0 0 0 0 1 8 0 0 0 0 10 10 13 0 0 0 8 5 10 0 2 0 0 0 0 8 5 10 0 2 0 0 0 0 7 7 0 0 4 4 1 0 0 0 4 4 1 0 0 0 2 4 0 1 0 0 0 0 0 0 0 0 0 0 0 1 24 c++ string imbd wwe spoilers tees.ac.uk teeside uni messenger web photo portfolio hotmail 1998 data protection act linux commands zero punctuation blackboard tees sdl sound teeside uni free textures sdl_mustlock c++ random int arm.linux.rules sdl SUM 8 4 4 16 12 13 10 23 6 0 0 0 0 0 1 0 1 0 5 2 2 13 9 8 8 14 1 1 0 0 1 0 0 0 1 1 0 0 0 0 1 2 0 5 0 5 2 2 13 9 9 8 15 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 2 0 5 0 6 10 5 5 11 10 9 6 5 6 572 1 0 0 0 0 0 0 0 0 0 8 1 7 1 2 6 6 6 2 3 3 357 0 0 1 1 1 0 1 0 0 1 42 2 1 1 0 1 2 0 0 0 0 24 2 7 1 2 6 6 6 2 3 3 360 0 0 1 1 1 0 1 0 0 1 43 1 1 1 0 1 2 0 0 0 0 26 25