Download Minor Thesis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
School of Computer and Information Science
Minor Thesis
Analyzing the fragmentation of coselection data
due to volatile search results
Nathan Ronald Williams
2012
Abstract
Initial investigations have indicated coselections are an effective way to cluster web pages under a
shared meaning. The idea is that URLs coselected under the search term tend to be the result of the
same objective by the user. Though there are some variances, it has been shown to be strongly
effective at generating sense-singular results given a high enough threshold.
While the clusters may be sense-singular, there are frequently numerous clusters generated for the
same sense. Approximately one sense-singular cluster per sense should be expected and hence counting
clusters would indicate ambiguity in search terms. However, in many of the cases, search terms appear
to be ambiguous because they have multiple clusters in the results, even though that should not be the
case.
A key factor speculated is the effect of time on the top results as they are subject to change. This could
be causing temporal fragmentation of clusters since there is only a certain window of opportunity for
two URLs to both be high enough in the search results to be selected together – they have to be in the
top N (usually 10) results to be coselected. Through using the time stamp associated with the data, we
aim to uncover the evolution of clusters in time.
The initial proposal of this paper was to first to analyse the effect of time and whether it is having a big
effect on the segregation of clusters. The activity of links and clusters were plotted out with analysis of
whether there was enough activity in common between clusters to suggest they have had sufficient
coselection chance. Expanding on that was a proposal of a potential solution by first performing
clustering and then second joining disparate clusters by lowering the threshold for clusters that have
few URLs active at the same time.
While results were inconclusive due to a lack of data to collect, it is hoped that the methodologies
formed will be relevant to future studies as greater data is collected from various sources.
Table of Contents
1
2
Introduction .......................................................................................................................................... 1
1.1
Motivation..................................................................................................................................... 2
1.2
Research Questions ...................................................................................................................... 2
Literature Review .................................................................................................................................. 3
2.1
2.1.1
Clickthrough .......................................................................................................................... 3
2.1.2
Detecting ambiguity .............................................................................................................. 5
2.2
3
Related Work ................................................................................................................................ 8
2.2.1
Coselections .......................................................................................................................... 8
2.2.2
Search Engine Results ........................................................................................................... 8
Methodology......................................................................................................................................... 9
3.1
TimeStamping ............................................................................................................................... 9
3.2
Measuring Activity ...................................................................................................................... 10
3.2.1
Loss of activity ..................................................................................................................... 10
3.2.2
Loss of URLs......................................................................................................................... 10
3.3
4
Background ................................................................................................................................... 3
Cluster Disparate......................................................................................................................... 11
Results and Discussion ........................................................................................................................ 12
4.1
Analysis of Data ........................................................................................................................... 12
4.2
Measurement of activity ............................................................................................................. 14
4.3
Measurement of Loss of URLs .................................................................................................... 14
4.4
Cluster Disparate......................................................................................................................... 14
5
Conclusion ........................................................................................................................................... 15
6
References .......................................................................................................................................... 16
7
Appendix ............................................................................................................................................. 20
7.1
Appendix A: Coselection Count of terms with at least one cluster ............................................ 20
7.2
Appendix B: Accuracy of cluster disparate on URLs with an existing cluster ............................ 21
7.3
Appendix C: URL distribution ...................................................................................................... 23
1 Introduction
Trails of data generated from users interacting with search engines provide a significant resource for
classifying information on the World Wide Web. The patterns of user behaviour found in the search logs
help indicate the context a user is applying to a search term. It has been proposed that this information
can aid in ambiguity and synonym detection (Ashman et al, 2011) amongst other useful tools.
Initial progress began with clickthrough data which proved a useful source for clustering resources
together (Beeferman & Berger 2000). The process involves gathering URLs selected by users under a
search term. Though it was useful to find which search terms had URLs in common, ultimately
coselection data would provide a more useful metric for indicating the relevance of URL to URL by
wrapping up URLs selected together in the one query. While the exact relationship between URLs can
vary depending on user intent, Ashman et al (2011) have found that users generally search on a term
with a single semantic purpose in mind. Though users may occasionally choose something irrelevant to
an objective, the majority of coselected URLs seem to indicate a strong mutual relevance, much more so
than many other selection methodologies.
This process overcomes two significant hurdles to past terminology detection. The first one is that the
process exploits the important factor that users are making a direct judgement on the information that
full fills their needs for the terms they have specified. By contrast, semantic and lexical analysis has
struggled from being unguided and lacking human involvement (Tamir & Rapp 2003). Meanwhile the
oldest method of using human judgement is a very time consuming to get a complete picture (Riloff
1993). Coselection overcomes these two hurdles by providing user relevance judgement from an activity
people perform ubiquitously in their daily lives.
This thesis proposes to detect ambiguity by counting the number of clusters generated. This method of
detecting ambiguity first involves using coselections as a similarity measure to aggregate semanticallysimilar collections of URLs. This is created by first forming a term graph of URLs for each term where
edge weights indicate how many times a URL is selected in common with another URL. Clusters are then
formed by aggregating URLs that are regularly coselected together enough to indicate they are of the
same sense. Each cluster should therefore represent part or all of a sense a search term can be used in.
1
Figure 1-1 Term graph for "pernstejn" with vertices corresponding to Web resources (Asman et al 2011)
Experiments so far indicate that clusters can be successfully resolved to sense singularity, however,
currently there are also many clusters for a single sense. This research aims to address issues that can
reduce the number of clusters to something more meaningful. One major issue that appears to be
creating more clusters is the effect of volatile top search results over time. If changes happen too
abruptly old URLs can’t be coselected with new URLs which create division in clusters. The true extent of
this effect will be measured and research possible solutions to reduce the effect.
1.1 Motivation
Too many clusters for the one meaning results in a large amount of disparate chunks of data that are
difficult to analyse, however that they were still mostly sense singular indicated a solid platform to build
on. Ideally, by aggregating those semantically-similar clusters, it will become feasible to judge whether
the term is ambiguous or not, by counting the number of clusters. It was postulated in Ashman et al.
(2011) that more clusters indicated more potential ambiguity although the data they investigated did
not confirm or deny this.
Ashman et al (2011) speculated that more data would bring together bigger clusters. Though such a
quantity of data would likely be met by major search engines for common terms, it is not however
available to current research. An issue speculated to be breaking up the results is the effect of volatile
top search results. Since user’s mainly select from the first page (Jansen & Spink et al 2006), if there
aren’t smooth gradual transitions over time, there will be breaking in clusters due to lack of coselections
to join them together. To find the effect of time, this paper proposes to time stamp the data to uncover
the evolution of clusters in time, thereby discovering whether there are broken links between large
clusters of the same meaning at a certain period in time.
This paper therefore aims to discover the effect this phenomenon has on the clustering and look into
possible solutions for overcoming this problem.
1.2 Research Questions
QUESTION 1 Does the effect of volatile top results create separation in clusters?
If users mostly select only from the first page of results (Jansen & Spink et al 2006), we expect to find
that rapid change means the opportunity for coselections to be created drops significantly as soon as
one of the participating pair leaves.
QUESTION 1.1 Can clusters separated by time be brought together and retain sense-singularity?
We plan to compare the outputs of standard clustering against aggregation of temporallyseparate clusters to see whether it marks an improvement on the whole-dataset cluster
aggregation process.
QUESTION 2 Given a set of these aggregated clusters, is it feasible to determine via cluster cardinality
whether or not a given term is ambiguous?
2
We postulate that an improved clustering mechanism will successfully aggregate at least some formerlydistinct clusters with the same semantics. This should result in a significant correlation between the
number of clusters for a term and a ground-truthed value for the ambiguity of that term.
2 Literature Review
2.1 Background
2.1.1 Clickthrough
Clickthrough data provides a key resource on how search engines are used. It records a history of a
user’s interactions however doesn’t resolutely state what the user’s intentions are and how successful
they were. Nevertheless, it has been speculated that in the great quantities that are possible on the
World Wide Web, a good analysis of the data can be even more accurate than using explicit feedback
from the user (Dou et al 2008). As a result the data has been a key area of interest since search engines
became a widespread means to browse the internet.
Applications
The first recorded use was when Lieberman (1995) applied the resource to dynamic personalized tools
for browsing the web in an application called Letizia. Since then it has proven popular for a wide range
of different applications, of which, one of the most attractive has been to improve the output of the
search rankings. Many researchers have tackled this issue (Joachims 2002)(Agichtein et al
2006)(Carterette & Jones 2007)(Gao et al 2009)(Dupret & Liao 2010), all keen to increase the accuracy of
ranking automation to a universal range of knowledge that is so large that it cannot be done completely
manually. This field has been of great value to creating accurate interpretations of clickthrough data that
are otherwise mixed by fuzzy interpretations of user decisions.
Numerous other components of search engines have also benefited from feeding back search log data.
Sun et al (2005) found they could improve the captions of search results by determining significant
words in regularly clicked documents. That query’s with terms that are in the caption are more helpful is
also supported by Clarke et al (2007) who performed a wide analysis of what makes a good caption,
supported by the click through data they applied.
More broadly the information collated can deconstruct how search engines are used. Pass et al (2006)
derived a wide set of statistics that help create a picture of user behavior, while Ashkan et al (2008)
went into more detail to uncover the intent of a user’s search as transactional, navigational or
commercial means which was proposed to assist associated advertisements.
Clickthrough data also has value as a means the extract meaningful associations between URLs and
categorize the broad range of information on the Internet. For instance Xu et al (2009) performed
Named Entity Mining on the data, compiling the links from specific search terms into various
information types. Another idea has been to cluster resources together where a number of links are
commonly selected for multiple search terms (Beeferman and Berger 2000). This idea of clustering links
3
into useful groups is a core foundation of the work Ashman et al (2011) have used that is the ground
work this paper is derived from.
Clustering
The idea of clustering URLs and search terms was first initiated by Beeferman and Berger (2000). Search
terms that had many of the same URLs selected were grouped together along with their associated
URLs. This resulted in clumps of information consisting of common ground. Its main limitation was that
it required top search results to have the same URLs available for multiple terms, but nonetheless was
capable of grouping a vast array of terms for many data sets. A useful application for the results was
proposed to provide search term recommendations.
As a direct expansion to Beeferman and Berger, Chan et al (2004) noted the algorithm was subject to
noise as a link only using very few clicks in common could be included. They proposed that a cut off
proportional to how many times the links have been chosen under those terms against the overall
number of document selections for those terms. This effectively eliminated noise as the major source of
error.
Further, by considering each user’s actions separately Leung et al (2008) found they could disambiguate
a user’s intentions. If a user’s choice fit the mould of a certain group of users that chose the fruit apple,
they could receive results on that specific option over apple computers which fit a different group of
users.
Similarly Gao et al (2010) suggest search terms that are clustered together can be considered synonyms.
To do so they followed a particularly novel approach where queries are considered similar if the titles of
its main URLs regularly selected in clickthrough data feature the same bi-phrases. This helped extend on
Beeferman and Berger’s concept by being able to cluster multiple search terms even if they don’t
feature common URLs in the top search results while also aiding a common language metric between
search terms and titles/documents.
A unique form of clustering was further introduced by Ashman et al (2011), which utilized coselections
as a similarity metric between URLs. The binding together under the one search sense was much
stronger in this scenario than that of using the entire search history of individual users as a similarity
metric. Greater detail can be found in 2.2.1 Coselections.
Acccuracy
A key factor in clustering information sources is the presumption that documents selected tend to be
relevant to the user’s original search sense. It has been speculated the usefulness of abstracts may be an
issue but they were verified to be helpful 82.6% (Joachims et al 2007). Moreover the larger issues found
are the user’s decision making and their tendency to browse, which account for the total accuracy of
documents being relevant to the search sense at only 52% (Scholer et al 2008). Nevertheless Dou et al
(2008) asserts that with a good analysis of the data combined with the huge quantities possible on the
World Wide Web, it may produce better results that of direct user feedback.
4
One of the most common methodologies to smooth out the results is the idea that the higher the
document in the search results, the more likely it is to be picked. The main factor is the trust bias the
user has in the search engine that the higher results are more accurate. Therefore the quality of the
retrieval system is a big factor in influencing which documents are selected. Further Granka et al (2004)
verified through eye tracking that users have a tendency to analyze results from top to bottom most of
the time. As an extension of these ideas Joachims et al (2002) created a methodology for analysis that
asserted any document selected is more relevant than all the documents above it that weren’t selected
by the user in the one session. This represents the main methodology that has been extended upon to
determine relevance of the document to the search term.
Another consideration suggests that image search may be substantially more accurate than text search.
The theory is that images are a more direct description than what captions are, which proposed to
lessen the problem of bad user judgment. These results were found to be extraordinarily accurate to
about 88% (Smith & Ashman 2009). However recent research has suggested that text search are in fact
just as good and the lesser quantities of image search data is a challenge to resolve.
2.1.2 Detecting ambiguity
In 2007 Ashman et al proposed a Global Perpetual Dictionary of everything. A key component was that
search log data would be able to run an automatic scan for ambiguous terms without the need for
human involvement. This research represents a shift towards involving user implicit judgments that are
carried out through a ubiquitous activity rather than analyzing the structure of a discourse.
Disambiguation
Disambiguation has been of interest to the field of computing since its early days in the 1950s. The first
identified field for disambiguation was in the context of machine translation of languages. Ambiguous
terms were one of the key constraints to otherwise providing one to one translation. The early
philosophy of the task was that one word at a time cannot determine the meaning of an ambiguous
term, but given the context it can be resolved (Weaver 1955). Thus disambiguation involved forming a
methodology of picking the right meaning of each ambiguous term based on what a sentence implied.
A key aspect of disambiguation is that it relies on a resource for the representations of each meaning in
ambiguous terms. The enormous knowledge base required across the entire diction has been a
significant hurdle to assigning a word its implied sense from a sentence (Gale et al 1992). One popular
source has been Machine-Readable Dictionaries (MRDs) but so far has not successfully accomplished the
automatic extraction of large knowledge bases (Amsler 1980). WordNet is the only MRD that is widely
available today and is limited by its hand creation. Finding the ambiguous term senses would be the first
step towards a comprehensive MRD, as is the case with WordNet which uses a synset tree of words to
represent the meanings (Tamir & Rapp 2003). Ashman et al’s (2007) proposal represents a way of
discovering ambiguous terms and potentially their meanings in a more comprehensive fashion by
making valid interpretations on the way they are used in an implicit activity.
Machine Readable Dicitonaries
5
Machine readable dictionaries are currently the most common resource for the structure of word
senses. WordNet, the most fully formed today, features words grouped into synsets of the same sense
referencing other synsets with key relationships. Ambiguity is represented where a term is found in
multiple synsets (senses). The biggest difference between each sense is they should reference a
different set of hypernyms implying they are a different kind of entity.
Nevertheless, using MRDs as a resource for ambiguity has been limited by the unclearly defined
boundaries of senses. There have been complaints MRDs often make unnecessary and difficult "forcedchoices" (Dolan 1994). Attempts have been made to address this such as clustering with the aid of a
thesaurus to help eliminate distinctions that are unnecessarily fine grained (Chen et al 1998).
These tough distinctions made in an MRD can lead to too many unimportant senses which clutter up
tasks such as disambiguation (McCarthy et al 2004). It has been suggested ranking of sense relevance
can therefore be of value to distinguish which are most useful. These sorts of findings pose challenges to
clustering coselection data as it is not clear whether users will mainly coselect on major senses or if
minor ones will be distinguished. Most likely it is presumed the minor senses will build into major ones
by the nature of uses selecting even small similarities, but the nature of the boundaries need to be
investigated.
Traditionally all the crafting of MRDs has been done manually by hand and research, but increasingly
there have been ways of finding relationships through more automated forms. A common methodology
for determining hypernyms has been to look through discourse for commonly occurring patterns:


“Bruises[,] wounds[,] broken bones [or other] injuries . . .”: where the nouns are implied to be
a type of injury (Hearst 1992)
“Boeing[, a] defense contractor": where “defense contractor” is an appositive of Boeing
(Caraballo 1999).
By finding hypernyms this way, nouns sharing the same hypernyms likely indicate a shared sense
allowing somewhat of a synset to form (Caraballo 1999). In another scenario, senses and their
description can be found more directly using a complex set of trigger words, though is limited to
specialized topics of content by its less generic nature (Riloff 1993). Meanwhile Agirre et al (2000) have
used the world wide web to enrich and refine the current content in WordNet.
Despite these efforts, it is still a long way in breadth and accuracy from seeing the complete automation
of further MRD construction, beyond providing possible guides. This paper embarks on adding to the
knowledge of ambiguous terms with the clusters of coselected URLs ideally representing the major
senses.
Word Sense Induction
The most direct field of finding ambiguity is word sense induction since the best way to find ambiguous
terms is to find the distinct senses. In word sense induction, these senses are typically found by
assigning words commonly used with the target word to clusters which are formed by their usage in
6
discourse (Pantel and Lin 2002) (Dorrow & Widdows 2003). Thus far an accuracy of 72% that a cluster
correlates to a correct sense has been achieved.
Increasingly, the World Wide Web represents a large scale resource that is easy to access for gathering
word senses. The use of Google to mine the web for senses was suggested by Tamir & Rapp (2003), they
were inspired by the work of both Gale et al (1992) and Yarkowsky (1995) which suggest ambiguous
words only occur in one sense in a given document and that words close to a term give some indication
of the sense of the target word. This leads to the assumption that a good indicator of ambiguity is when
two words commonly occur with the target word but rarely occur together with the target word.
Ultimately the resolution of finding two words that represent two different meanings of an ambiguous
term was successful, but often the association of the words to the meaning is weak since regularly used
words are rare.
In spite of these efforts, the biggest difficulty so far is the need to narrow down terms and search
senses, the processes so far are too complex for a full breadth. Since clustering coselection search data
does not require analysis of masses of discourse in the same way, the use of implicit judgments as a one
to one relationship provides a way to streamline the complexity of such a task.
Disambiguation in IR
Disambiguation has also featured in the context of Information Retrieval. The philosophy has been that
the results provided by IR can be more successful when all entire resources are cleared of ambiguity.
The first step involves replacing all the ambiguous terms in every discourse with words that correlate to
the more specific meaning implied by its context (Voorhees 1993) (Sussna 1993). However this proved
to be a futile attempt as it only produced worse search results than the original unedited recourse.
Conversely Sanderson (1997) has measured the effect of increased ambiguity within discourses on IR,
the results found that by appending random words to those in the discourse, the added ambiguity did
not hinder the success of an IR system as much as expected. Disambiguation of resources was therefore
a challenge to find small gains, and a difficult one due to the low accuracy of current disambiguators.
A knock on effect found by Krovetz & Croft (1992) on the weakness of disambiguation in IR related to
the way search terms were used. They found two major challenges that were causing in effective
results:
1. Most ambiguous terms have a dominant meaning so most results feature the resources that
most users were searching for.
2. Search terms often over come disambiguation by collation, the more words found in the search
term, the more likely other words will imply their meaning like a sentence would. With all search
terms being searched on, more searches are quickly funneled into the context of them all.
These two examples highlight that queries need to be short for ambiguity to be an issue in both the
term and the discourse. Further, by the nature of a search engine, small search terms can overcome
their ambiguity by being applied back in with added words for more specific results.
7
2.2 Related Work
2.2.1 Coselections
A special use of clickthrough data immerged in collecting it as co selections where multiple URLs are
chosen in the one session (Smith et al 2009). It provides a direct similarity metric between URL to URL
under the assumption that users usually select with a specific search sense in mind. By extension of this
idea, each separate cluster should ideally be unambiguous and represent an individual meaning for the
search terms.
Through the DBSCAN algorithm, Ashman et al (2011) successfully reduced clusters to single sense
however multiple clusters for the same sense regularly immerged. A major factor in the lack of accuracy
was the small amount of data available. Additionally it was speculated that changes in top results would
cause fragments in clusters since two URLs need to be there at the same time to have a good chance of
being coselected. This paper aims to address this issue by discovering how significantly time is affecting
clusters and provide some potential solutions to overcome it.
Caon et al (2012) further expanded on this work by utilizing a cluster by overlap method to link similar
clusters over different search terms. These relationships should indicate that two terms with a cluster in
common have a similar usage which suggests they are synonyms. Through tweaking the DBSCAN
algorithm parameters, these results were successfully resolved to a large number of positives that were
all verified in accuracy.
2.2.2 Search Engine Results
For two results to be coselected together, their position in the results is important. Users only tend to
view the first page of results about 85.2% of the time, which has been increasing over the years as
search engine accuracy and the trust invested in them by users has grown (Silversteen et al 1999)
(Jansen & Spink 2006). URLs therefore have a much smaller chance of being coselected if they aren’t
both on the first page.
This issue becomes more apparent with top results frequently evolving. New pages are regularly being
added and the measurements used to rank them also fluctuate. With this evolution, many URLs have a
limited time to appear together on the first page before they lose their main chance of being coselected,
while many others will never spend time together. Often it is the case results will show very little
consistency over time and can jump up and down with very little to suggest major change occurred (BarIlan 1999). As with the case with Google, there are many different measurements used for rankings such
as anchor pages that link to the main page. These anchor pages add to a page’s significance but also
feature additional descriptions helpful for its relevance (Brin & Page 1998). With so many variables,
results can change for many different reasons.
Nevertheless, Bar-Ilan et al (2006) plotted changes on daily intervals and found Google to be mostly
stable with small changes happening often but usually only in increments and rarely making big jumps
(see figure 2-1). The effect of these changes have on clusters can be mainly be deduced down to how
often URLs occur together in the first page irrespective of how many times it has gone in and out, as this
indicates it’s coselection chance. It is important to analyze the effect this has on the construction of
clusters.
8
Figure 2-1 The top-ten results of the query "DNA eveidence" on google (Bar-Ilan et al 2006)
An additional challenge posed to clustering coselections is that the vast majority of search terms are
used very few times. Silversteen et al (1999) found, out of a very large set of 154 million queries, only
13.4% of them occurred more than 3 times. By contrast, the top 100 query terms tend to be used as
much as 20% of the time (Spink et al 2002). Good results are therefore constrained to a smaller set of
terms as many don’t even have a chance of being clustered, let alone having sufficient detail to be
successful. For broader results, clustering needs to be as efficient as possible and make best use of as
little data as possible.
3
Methodology
3.1 TimeStamping
In order to analyze when clusters are broken due to volatile top search results, each coselection was
time stamped. All the data from the log files was reprocessed since the extracted coselections currently
do not contain the time information. Without the old parser this involved doing it all over again. The log
files were extracted from a server for a workgroup of computers where users are more likely to have
similar search query. Out of all the interactions possible through the server, queries from Google, which
is by far the most popular search engine (ebiz 2012), are extracted. Within each transaction exists a time
stamp which is used as a record of when each interaction occurred. This allows tracking the evolution of
clusters in time to determine how clusters have been broken. Though it is impossible to back track to say
when and where the URLs were part of the top results, the activity of the URLs provide some
observations.
9
Figure 3-1 An example of server log data
3.2 Measuring Activity
The primary methodology is formed from the conjecture that URLs forming part of a coselection are in
or very close to the top search results and when no longer selected, may have fallen outside. The major
concern is how disparate clusters are, which is measured by how often URLs selected in one cluster are
selected at the same time as URLs in another. By measuring disparateness this way, the activity of any
two clusters can be correlated to one of three main outcomes:



Completely disparate clusters: These clusters were likely impossible to join by the coselection
metric as they have not had two URLs selected in the same time frame. Too many of these
indicate volatility of results is causing disruption in the clustering.
Slightly disparate clusters: These clusters still have some URLs selected at the same time but
only very rarely. Therefore they have less chance of gathering enough coselections to join and
with less chance, a smaller epsilon value may be more appropriate.
Non-disparate clusters: These clusters frequently overlap in activity but due to few coselections
in common, they are two distinct clusters. This is the ideal scenario which likely indicates a
strong correlation to clustering success. If multiple clusters still represent the same meaning in
this scenario, then it is a limitation of clustering coselections itself. Furthermore, we speculate
in this scenario the more coselections; the more likely it is to be accurate so how often two nondisparate clusters represent the same meaning should be measured against the coselection size.
Since clusters representing the same meaning are the main factor being evaluated, individual URLs are
of less significance however the same can be applied to them, particularly for small amounts of data to
indicate where they are situated.
3.2.1 Loss of activity
Since many URLs are active on and off over a period of time, a measurement of complete inactivity
would then likely have to provide leeway of a small time period, allowing for a URL to chain together
activity over some time. Scenarios where there have been regular activity on a URL followed by a
complete halt, provide our biggest indication that a popular URL has completely dropped off for good.
This is an alternative to disparateness that is slightly clearer cut a URL has likely fallen outside of the top
results. The amount of URLs completely segregated between two clusters then provides an idea of how
many potential joins were greatly affected by volatile results.
3.2.2 Loss of URLs
An overall measurement of the loss of URLs may also be formed for search terms with a lot of activity,
which may provide on average an indicator of how often URLs drop out of the top search results. To
analyze the results, information is gathered in time fragments that may be analyzed weekly, fortnightly
or monthly. For each segment of time there should be a fairly similar number of URLs chosen, mainly
those in the top results but also a small number outside. As time increases the total number of URLs
selected over all time increases while the number selected specifically in each time period should
remain roughly the same. This provides a key correlation to how many URLs drop outside the top
10
results. Additionally in rare cases some URLs will come in and out of the top results, when they arrive
back, they will also be added provided they reach the threshold of inactive iterations to be considered
completely inactive. An average number of new URLs per iteration over all the significant search terms
would then determine how often top results change.
3.3 Cluster Disparate
In order to improve clusters, a cluster disparate function has been proposed to find clusters that are
rarely active at the same time and allow a looser epsilon value. To determine disparateness, a timeline
of URL usage is required where an individual iteration lasts a small period of time like a fortnight or a
month, the exact time being determined by how often URLs tend to drop off the top results discussed
earlier. Each pair of clusters that have few iterations selected in common with each other are
considered disparate. Since these clusters have much less chance to gather the coselections necessary
to join, the epsilon value would therefore be reduced to make clustering easier.
Measuring Disparateness (Amount of times urls are selected)
Completely Disparate Clusters (No coselections possible)
Clusters
URLs
Cluster1
url1-1
url1-2
Cluster2
url2-1
url2-2
url2-3
1
2
3
3
4
4
1
4
4
3
5
6
6
2
1
7
4
7
Iterations (Months)
8
9 10 11 12
1
3
1
2
3
4
5
13
14
2
1
3
13
14
2
1
3
15
16
17
18
19
15
16
17
18
19
Disparate Clusters (lower epsilon may be helpful)
Clusters
URLS
Cluster1
url1-1
url1-2
Cluster2
url2-1
url2-2
url2-3
1
2
3
3
4
4
1
4
5
6
7
Iterations (Months)
8
9 10 11 12
2
1
4
3
6
2
1
7
4
1
3
1
2
3
4
5
1
2
Figure 3-2 An example of disparateness
To measure whether a pair is in fact disparate, there are two major cut offs. The first one is measuring
the aggregating the number of URLs active in the cluster with fewer URLs active for each iteration. ie
11
totalTimeIterations
α = ∑ ( Min(clusterA.URLsActive(ti) ,clusterB.URLsactive(ti)))
ti=0
The reason not every URL pair is measured is that it would increase the count by multiple which is not
proportional to the coselection chance that is dependent on the amount of URLs available to co-select.
This measurement is referred to as α and is the maximum cut off possible for a pair of clusters to be
considered disparate.
The second measurement is a finer grained cut off that is proportional to the total number of URLs
featured in the smaller cluster. It is only for small clusters that require a smaller cut off than the
maximum possible, since disparateness is less meaningful for smaller clusters that are much more likely
to have less activity than larger ones. The measurement finds the amount of α per number of URLs in
the smaller cluster and determines the rate it needs to increase since the more URLs, the less need to
make the distinction.
If a pair of clusters passes these two checks, they are determined to be disparate. In this scenario a
smaller epsilon threshold for joining may be considered appropriate. The key is determining what values
constitutes disparateness and whether the lower epsilon value is valid for not adding clusters of the
same sense.
4 Results and Discussion
4.1 Analysis of Data
Unfortunately the server log data did not produce enough coselections as was speculated, only
approximately 1 Google URL request was found per 750 lines. The vast amount of interactions through
the server due to various applications and other tasks unexpectedly dominated web requests. As a
result for the barest minimum of epsilon 3 and minimum nodes 2, only one term formed 2 clusters and
55 just one cluster. In most of these cases there was far less sufficient coselection information to make
valid judgements, at most only just meeting the epsilon and/or minimum nodes criteria.
The primary issue seems to be the quantity of searches available, the data set only uncovered a total of
57,920 searches, as a result the proportion of unique searches that occurred at least twice was very
small, 11.6%. This contrasts with the findings of Silversteen et al (1999) who had a much larger data set
available by approximately 10,000 times, with 36.3% of unique searches occurring more than once. We
speculated that by having the data come from a select workgroup of computers with users having
similar tasks to perform, the repetition of results should be much higher for the amount of data
available. While this may still be the case, 57,920 searches are still insufficient to gain enough repetition.
The challenge posed by such a small amount of data becomes accentuated when using coselections.
Coselections only account for 23.9% of total searches. With searches rarely being selected multiple
times, the chance that it will be a second coselection is even rarer.
12
Coselection Searches
Total (non unique)
Unique
With at least two occurrences
With at least three occurrences
With at least four occurrences
Count
13858
12545
739 (5.89%)
191 (1.52%)
98 (0.78%)
Table 4-1 Coselection data collected
Searches
Total (non unique)
Unique
With at least two occurrences
With at least three occurrences
With at least four occurrences
Count
57920
39525
4582 (11.59%)
1845 (4.67%)
1115 (2.82%)
Table 4-2 Search Data collected
Coselections are also hampered by the necessity for the same two URLs to be selected in two different
searches in order for the relationship to gain greater significance. Just having one of those URLs in a
second coselection search does not increase the relationship for any of those by more than one. This
becomes an issue as out of the top 10 results there are 45 coselection relationships possible. The
problem posed is somewhat offset by large coselections that cover many of the relationships as well as
the tendency of users to more likely select from the top and that the most relevant URLs are more likely
to be selected. However it is particularly significant since 60.0% of coselections only feature 2 URLs and
the lack of data does not lessen its influence.
Coselection relationships
Coselection searches
Aggregate of URLs in each
coselection query
Total coselection edges
Total coselection weights
Count
13858
38908
50712
51528
Table 4-3 Coselection Relationships Data
Since coselection relationships grow triangular against the amount of URLs in a query, often the terms
with the highest aggregate coselection-weights were those with many coselection relationships of just a
weight of 1 each caused by a single query with many URLs. These results typically embodied a type of
content like “luxury holiday” (406 coselections, 28 URLs in 1 search) or “military aircraft map textures”,
(351 coselections, 26 URLs in 1 search), where a user selects a category like a browse function and
multiple items are compared for the most appropriate. In such instances a single coselection search
would far exceed the top 10 results typical of a first page.
Very few examples featured very high weights, one of the best results was “free sound effects” with 125
coselection-weights, but had 97 edges between 23 URLs. That leaves only leaves 28 coselections
between 97 edges to increase their weights above 1. Rarely were these proportions exceed, but with
more data such occurrences are expected to be more regular and bigger in size as the effect of one or
two user clicking loads of links are outweighed by the majority.
13
By far the biggest activity over an extended period of time occurred on “teeside” and “teeside
university”, however very few of these included coselections. Approximately 3-4 links had consistent
activity over 3 years with other links being used only on occassions. Likely this is due to its navigational
sense in only needing to find one of the 3-4 major portals for Teeside University.
4.2 Measurement of activity
Little could be deduced about the activity due to such a small results set. With only one search term
featuring more than one cluster, nothing was conclusive on cluster disparateness or URL segregation.
Using the search term with 2 cluster, it indicated disparateness may be a factor as it only measured an α
of 4 with 2 URLs in the smallest cluster, but conversely that may not indicate much as there were not
any months where the less active cluster was active at a different time.
URLs individually provided more significant findings as they are always expected to be low on data, since
as coselections grow, they would otherwise already be in clusters. Remarkably nearly all URLs were
active at the same time as another URL indicating a coselection chance with the URLs featured, though
this rarely meant a coselection did occur. For terms where a cluster was found, only 54 out of 432 URLs
did not have a month active in common with a cluster and only 15 were active greater than 2 times with
the same cluster. Most of the URLs therefore fall into the slightly disparate range where a lower epsilon
value may be valid, though this disparateness may not be a great indication as most of these URLs were
never active at times the clusters weren’t.
Cluster Disparate on URLs (α<= 2)
Search Terms with a cluster
Potential URLs
URLs with > 2 months in common
URLs with no months in common
Count
56
363
15
54
Table 4-4 Disparateness of URLs to existing cluster
4.3 Measurement of Loss of URLs
Since to gather the average loss of URLs in the way mentioned required an enormous amount of data
for a select few search terms, this was unable to be met.
4.4 Cluster Disparate
Since nothing of influence could be found on the disparateness of clusters, findings are yet to show its
effectiveness. Of the one search term, “free sound effects” with more than two clusters, it did highlight
some potential in merging clusters since the two clusters were distinctly of the same meaning and the
measurement of α was only 4 for 2 URLs in the smallest cluster. By halving the epsilon from 3 to 2, a join
was found between the 2 clusters which would indicate success of the function, however many more
results are needed to find it’s true significance.
14
With 56 search terms featuring at least one cluster, we proceeded to trial using cluster disparate on
merging individual URLs with a current cluster to gather its accuracy where there are small amounts of
coselections. In actual results, the disparate function proved to be of little significance as out of 432
URLs only 15 were discarded for having more than 2 months in common with the cluster. In spite of this,
only 1 false positive URL was added to a cluster compared with 47 true positives.
Cluster Disparate on URLs (α<= 2 ⇒
epsilon=2)
True positive URLs added to a cluster
False positive URLs added to a cluster
Unknown pages added to a cluster
Search Terms with a cluster
Count
47
1
1
56
Table 4-5 Accuracy of Cluster Disparate Function on URLs
Such results would seem to suggest the strength of clustering coselections is so strong that a lower
epsilon of 2 rather than 3 may be sufficient for joining URLs irrespective of disparateness. However,
these results may not be indicative of broader findings due to 3 key reasons:



Most of these queries featured were unambiguous; therefore all URLs are likely to point to the
same meaning regardless.
Most queries that are ambiguous tend to show the dominant meaning in most of the top
results, therefore the other meanings rarely get clicked on (Krovetz & Croft 1992).
Google is renowned for its URLs consistently being strongly relevant to the search term, so
completely irrelevant URLs are rarely a factor, let alone coselected multiple times.
For these same reasons, none of the 15 URLs discarded for having too many months in common with a
cluster were of a different search sense. Moreover, it is apparent most searches aren’t coselections, so
many of the URLs with most activity are still yet to have a solid chance at joining other URLs. It may then
be applicable to scale the months in common value proportional to the overall amount of coselections
in the search term, allowing for a bigger margin where there are fewer relationships.
5 Conclusion
The biggest challenge posed for future study is to gather many more coselections. Even with data being
collected from the same workgroup of computers, the amount of repetition of results was not great
enough to form significant clusters regularly. By being used less than a quarter of the time than single
selections, the chance a search term will feature coselections on multiple occasions is even less.
Nevertheless, the methodologies of measuring disparateness and segregation based on URL activity
appear to be sound as a means to determine coselection chance between two clusters. These are key
indicators to what extent volatility of results may be posing a problem for different clusters forming of
the same sense.
15
With coselections being difficult to accumulate, clustering needs to improve effectiveness even as the
data set grows. A cluster disparate function was suggested for drawing together clusters that were
fragmented by evolution in time, though no conclusions could be drawn on its effectiveness. While
drawing URLs onto a cluster through this methodology appeared successful, it was offset by major
limitations in the data set available.
The ideal that coselections can determine ambiguity from cluster cardinality remains elusive, though it
appears difficult to attain full accuracy using coselections as the lone similarity metric due to sparsity of
activity.
6 References
Agichtein, E, Brill, E, and Dumais, S, 2006, Improving web search ranking by incorporating user behavior
information, Proceedings of the 29th annual international ACM SIGIR conference on Research and
development in information retrieval, ACM.
Amsler, R, 1980, The Structure of Merriam-Webster Pocket Dictionary, Ph. D. thesis, University of Texas
at Austin, Austin.
Ankerst , M, Breunig, M, Kriegel, H & Sander, J, 1999, OPTICS: Ordering points to identify the clustering
structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pp
49-60.
Ashkan, A, Clarke, C, Agichtein, E, & Guo, Q, 2008, Characterizing query intent from sponsored search
clickthrough data, In SIGIR Workshop.
Ashman, H, Antunovic, M, Chaprasit, S, Smith, G & Truran, M, 2011, Implicit association via crowdsourced coselection, Proc. Hypertext 2011, June 2011, 7-16, ACM.
Ashman, H, Zhou, D, Goulding, J, Brailsford, T, & Truran, M, 2007, The Global Perpetual Dictionary of
Everything, Proc. Ausweb, http://ausweb.scu.edu.au/aw07/papers/ refereed/ashman/paper.html.
Beeferman, D & Berger, A, 2000, Agglomerative clustering of a search engine query log, Proceedings of
the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 407-416.
Birant, D & Kut, A, 2007, ST-DBSCAN: An algorithm for clustering spatial–temporal data, Data &
Knowledge Engineering, Vol. 60, Jan 2007, pp 208–221.
Caon, G, Antunovic, M, Truran, M & Ashman, H, 2012, Finding synonyms and other semantically-similar
terms from coselection data, UniSA, SA.
16
Carterette, B, & Jones, R, 2007, Evaluating search engines by modeling the relationship between
relevance and clicks, Computer Science Department Faculty Publication Series.
Chan, W, Leung, W & Lee, D, 2004, Clustering search engine query log containing noisy clickthroughs,
2004 International Symposium on Applications and the Internet.
Chen, J, & Chang, J, 1998, Topical clustering of MRD senses based on information retrieval techniques,
Computational Linguistics.
Clarke, C, Agichtein, E, Dumais, S, & White, R, 2007, The influence of caption features on clickthrough
patterns in web search, In Proceedings of the 30th annual international ACM SIGIR conference on
Research and development in information retrieval, pp. 135-142, ACM.
Dorow, B & Widdows D, 2003, Discovering corpus-specific word senses, Proceedings of the tenth
conference on European chapter of the Association for Computational Linguistics, Vol. 2, Stroudsburg,
PA, pp 79-82.
Dou, Z, Ruihua, S, Xiaojie, Y, & Ji-Rong W, Are click-through data adequate for learning web search
rankings?, In Proceeding of the 17th ACM conference on Information and knowledge management, pp.
73-82, ACM.
Dupret, G, & Ciya, L, 2010, A model to estimate intrinsic document relevance from the clickthrough logs
of a web search engine, Proceedings of the third ACM international conference on Web search and data
mining, ACM.
Gale, W, Church, K & Yarowsky, D, 1992, A Method for Disambiguating Word Senses in a Large Corpus,
Computers and the Humanities, 26, pp 415-439.
Gao, J, Wei, Y, Xiao, L, Kefeng, D, and Jian-Yun, N, 2009, Smoothing clickthrough data for web search
ranking, In Proceedings of the 32nd international ACM SIGIR conference on Research and development
in information retrieval, pp. 355-362, ACM.
Gao, J, Xiaodong H, & Jian-Yun, N, 2010, Clickthrough-based translation models for web search: from
word models to phrase models, Proceedings of the 19th ACM international conference on Information
and knowledge management, ACM.
Granka, L, Joachims, T, & Gay, G, 2004, Eye-tracking analysis of user behavior in WWW search,
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in
information retrieval, ACM.
Guha, S, Rastogi, R & Shim, K, 1998, CURE: an efficient clustering algorithm for large databases,
Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp 73-84.
Joachims, T, 2002, Optimizing search engines using clickthrough data, Proceedings of the eighth ACM
SIGKDD international conference on Knowledge discovery and data mining, ACM.
17
Joachims, T, Granka, L, Pan, B, Hembrooke, H, Radlinski, F, and Gay, G, 2007, Evaluating the accuracy of
implicit feedback from clicks and query reformulations in Web search, ACM Trans. Inf. Syst., vol. 25, pp 7.
Karypis, G, Eui-Hong, H & Kumar, V, 1999, Chameleon: hierarchical clustering using dynamic modeling,
Computer, Vol. 32, pp 68-75.
Leung, K, Wilfred N, & Dik L, 2008, Personalized concept-based clustering of search engine queries,
Knowledge and Data Engineering, IEEE Transactions.
Lieberman, H, 1995, Letizia: An agent that assists web browsing, International Joint Conference on
Artificial Intelligence, Vol. 14, Lawrence Erlbaum Associates Ltd.
Pantel, P & Lin, D, 2002, Discovering word senses from text, Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery and data mining, pp 613-619.
Pass, G, Abdur, C, & Cayley, T, 2006, A picture of search, Proceedings of the 1st international conference
on Scalable information systems.
Riloff, E, 1993, Automatically constructing a dictionary for information extraction tasks, Proceedings of
the National Conference on Artificial Intelligence, John Wiley & Sons Ltd.
Scholer, F, Shokouhi, M, Billerbeck, B & Turpin, A, Using Clicks as Implicit Judgments: Expectations
Versus Observations, Advances in Information Retrieval, 2008, pp 28‐39.
Smith, G, & Ashman, H, 2009, "Evaluating implicit judgements from image search interactions."
Smith, G, Brailsford, T, Donner, C, Hooijmaijers, D, Truran, M, Goulding, J, & Ashman, H, 2005,
Generating unambiguous URL clusters from web search, Proceedings of the 2009 workshop on Web
Search Click Data, pp. 28-34, ACM.
Sun, J, Hua-Jun, Z, Liu, H, Lu, Y, Chen, Z, 2005, CubeSVD: a novel approach to personalized Web search,
Proceedings of the 14th international conference on World Wide Web, pp 382-390, ACM.
Tamir, R, & Rapp, R, 2003, Mining the Web to discover the meanings of an ambiguous word, Data
Mining, 3rd IEEE International Conference on, 19-22 Nov. 2003, pp 645- 648.
Voorhees, E, 1993, Using WordNet to disambiguate word senses for text retrieval, Proceedings of the
16th annual international ACM SIGIR conference on Research and development in information retrieval,
ACM.
Weaver, W, 1955, Translation, Machine Translation of Languages, John Wiley & Sons, pp 15-23.
Xu, G, Yang, Y, & Li, H, 2009, Named entity mining from click-through data using weakly supervised
latent dirichlet allocation, Proceedings of the 15th ACM SIGKDD international conference on Knowledge
discovery and data mining, ACM.
18
Yarowsky, D, 1995, Unsupervised word sense disambiguation rivaling supervised methods, ACL ’95
Proceedings of the 33rd annual meeting on Association for Computational Linguistics, pp 189-196.
Zhang, T, Ramakrishnan, R, Livny, M, 1996, BIRCH: an efficient data clustering method for very large
databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp
103-114.
19
7 Appendix
7.1 Appendix A: Coselection Count of terms with at least one cluster
Note: Words are in most logical order but order is unimportant in clustering
Most clusters (not above)
data protection act
c++ connect 4
free sound effects
teeside
python time difference
pydoc
teeside university
bridge transporter
wet n wild
python dictionary
spaceship .wav
c++
xsi tutorials
teeside internet
sdl_close
python string contains
sci entertainment quote
fur affinity
sound effects
gp2x
game piracy
games age rating
“cavazza marc” or “marc
cavazza”
games piracy
set xsl variable
reference to undefined
smashing magazine
pakistan news
boom toon tutorials
pro gaming teams
textures
don’t stop me now midis
pegi
Coselecti
on total
weights
385
294
125
25
37
22
38
10
3
96
6
11
25
13
10
11
12
9
119
11
17
17
7
Coselecti
on edges
URLs
Clusters
340
237
97
16
26
17
26
8
1
87
5
6
17
10
8
8
7
4
114
7
15
13
4
14
5
6
7
3
3
7
3
3
3
4
5
3
4
3
3
3
4
3
3
3
3
4
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
78
12
9
15
68
26
22
23
52
12
69
10
6
9
12
16
18
16
47
7
3
3
3
3
17
3
3
4
4
5
1
1
1
1
1
1
1
1
1
1
20
c++ ternary operator
pound euro
avfc
c++ string
imbd
wwe spoilers
tees.ac.uk
teeside uni
messenger web
photo portfolio
hotmail
1998 data protection act
linux commands
zero punctuation
blackboard tees
sdl sound
teeside uni
free textures
sdl_mustlock
c++ random int
arm.linux.rules
sdl
SUM
24
3
6
15
4
6
47
19
14
43
45
24
9
8
5
9
13
24
24
24
5
9
2000
14
1
1
8
2
4
44
16
10
33
41
13
5
6
3
6
8
20
20
14
3
6
1561
6
3
6
6
3
3
4
3
4
3
3
5
3
3
3
3
4
3
3
3
3
3
227
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
56
7.2 Appendix B: Accuracy of cluster disparate on URLs with an existing
cluster
Search Term
Amount
of
positive
URLs
Amount
of false
positive
URLs
Potential
URLs
0
0
0
Amount
of
borderlin
e/
unknow
n URLs
0
0
0
free sound effect
teeside
python time
difference
pydoc
teeside university
bridge
transporter
2
1
4
2
1
0
0
0
0
0
0
0
21
URLs
with no
months
in
common
16
9
6
URLs
with too
many
months
in
common
0
2
0
5
17
6
0
2
0
1
0
0
0
0
0
wet n wild
python dictionary
spaceship .wav
c++
xsi tutorials
teeside intranet
sdl_close
python string
contains
sci entertainment
quote
fur affinity
sound effects
gp2x
game piracy
games age ratings
sdl_init
“cavazza marc” or
“marc cavazza”
games piracy
xsl set variable
reference to
undefinied
smashing
magazine
pakistan news
boom toon
tutorials
data protection
act
pro gaming teams
textures
don’t stop me
now midis
pegi
c++ ternary
operator
c++ connect 4
euro pound
avfc
c++ string
imbd
0
3
0
1
1
0
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
24
2
4
9
9
3
4
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
2
0
0
5
0
1
0
1
1
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
25
4
5
6
3
3
1
0
1
0
0
0
0
0
0
0
0
0
3
1
1
0
1
0
0
0
0
0
0
16
4
0
0
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
0
0
1
6
0
0
0
2
2
0
0
17
0
20
1
1
0
0
0
0
0
0
0
8
7
10
0
0
0
0
0
0
0
1
0
0
0
0
5
4
0
0
0
0
6
0
0
1
0
0
0
0
0
0
0
0
0
0
0
23
0
0
6
2
1
0
0
0
0
0
0
1
0
0
22
wwe spoilers
tees.ac.uk
teeside uni
web messenger
photo portfolio
hotmail
data protection
act 1998
linux commands
zero punctuation
tees blackboard
sound sdl
teeside uni
free textures
sdl_mustlock
c++ random int
arm.linux.rules
sdl
SUM
0
0
1
0
2
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
14
9
9
8
4
2
0
0
0
2
0
2
0
0
0
0
0
0
15
0
0
0
0
1
0
2
1
0
0
0
47
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
1
1
4
1
3
7
8
7
2
1
4
363
1
0
1
0
1
0
0
0
0
0
15
2
4
1
0
0
0
0
0
2
0
54
7.3 Appendix C: URL distribution
Most
coselections
For links not in a cluster of (3 epsilon, 2 minimum nodes)
No.
No.
No.
No.
No.
No.
No.
No.
of
never active active active active active active
links active with
with
with
in
in
in
with
other other other only
only
more
other links links links 1
2
than
links in
in
in
mont mont 2
only
only
more h
h
mont
1
2
than iterat iterat h
mont mont 2
ion
ions
iterat
h
h
mont
ions
iterat iterat h
ion
ions
iterat
ions
42
0
34
2
1
34
2
1
data protection
act
c++ connect 4
33
free sound effects 23
teeside
14
0
0
0
23
15
7
0
1
2
23
1
0
2
23
15
7
0
1
2
1
0
2
python time
difference
pydoc
teeside university
bridge
transporter
wet n wild
python dictionary
spaceship .wav
c++
xsi tutorials
teeside internet
sdl_close
python string
contains
sci entertainment
quote
fur affinity
sound effects
gp2x
game piracy
games age rating
“cavazza marc” or
“marc cavazza”
games piracy
set xsl variable
reference to
undefined
smashing
magazine
pakistan news
boom toon
tutorials
pro gaming teams
textures
don’t stop me
now midis
pegi
c++ ternary
operator
pound euro
avfc
8
0
2
4
0
2
4
0
8
22
8
0
0
0
4
17
6
2
0
0
0
2
0
4
17
6
2
0
0
0
2
0
2
27
4
7
13
11
6
6
0
0
0
0
0
0
0
0
0
21
2
4
7
8
4
3
0
3
0
1
3
1
0
1
0
0
0
0
0
0
0
0
0
19
2
4
7
8
4
3
0
4
0
0
3
1
0
1
0
0
0
1
0
0
0
0
8
1
5
0
0
5
1
0
4
28
7
7
8
4
0
0
1
0
0
1
1
20
2
5
4
3
0
5
1
0
2
0
1
1
1
0
0
0
1
20
3
5
4
3
0
5
1
0
2
1
1
1
1
0
0
0
18
6
4
0
0
0
16
3
1
0
1
1
0
0
0
16
3
1
0
1
1
0
0
0
5
0
2
0
0
2
0
0
6
11
0
1
1
7
0
0
0
0
1
8
0
0
0
0
10
10
13
0
0
0
8
5
10
0
2
0
0
0
0
8
5
10
0
2
0
0
0
0
7
7
0
0
4
4
1
0
0
0
4
4
1
0
0
0
2
4
0
1
0
0
0
0
0
0
0
0
0
0
0
1
24
c++ string
imbd
wwe spoilers
tees.ac.uk
teeside uni
messenger web
photo portfolio
hotmail
1998 data
protection act
linux commands
zero punctuation
blackboard tees
sdl sound
teeside uni
free textures
sdl_mustlock
c++ random int
arm.linux.rules
sdl
SUM
8
4
4
16
12
13
10
23
6
0
0
0
0
0
1
0
1
0
5
2
2
13
9
8
8
14
1
1
0
0
1
0
0
0
1
1
0
0
0
0
1
2
0
5
0
5
2
2
13
9
9
8
15
1
1
0
0
0
0
0
0
1
1
0
0
0
1
1
2
0
5
0
6
10
5
5
11
10
9
6
5
6
572
1
0
0
0
0
0
0
0
0
0
8
1
7
1
2
6
6
6
2
3
3
357
0
0
1
1
1
0
1
0
0
1
42
2
1
1
0
1
2
0
0
0
0
24
2
7
1
2
6
6
6
2
3
3
360
0
0
1
1
1
0
1
0
0
1
43
1
1
1
0
1
2
0
0
0
0
26
25