Download International Journal on Advanced Computer Theory and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________
INSTANCE LEVEL CONSTRAINT BASED USER CLUSTERING
1
Aniket Rangrej, 2Chandrashekar Muniyappa
Yahoo Research and Development
Embassy Golf Links, Domlur, Bangalore, India
Email : [email protected], [email protected]
In this paper, we cluster twitter users based on their
tweets which can be used in user recommendation or
tweet recommendation.
ABSTRACT - Clustering with constraints is fairly recent
trend of development which allows a domain expert’s
interference in the clustering process. We use modified
objective function of kmeans algorithm to incorporate
constraints to cluster twitter users. In this paper, we
generate instance level constraints using normalized cosine
measure, do basic filtering and apply domain expertise. We
can apply these techniques on any similar data sets like
Facebook posts, answers and comments for modeling and
clustering the users. By experiments we show that purity of
clusters can be improved by applying instance level
constraints.
We do consider textual contents in the tweet such as
tweet text, URL, hashtag. Similarity measure plays very
crucial role in document clustering.
II. BACKGROUND AND RELATED WORK
Clustering is an unsupervised learning method in which
labels are not known, unlike in the case of classification.
Document clustering is an interesting research area
which is mainly used to improve accuracy and
performance of search engines by pre-clustering the
documents. For any clustering mechanism, features play
a very crucial role since features represent a data point.
Thus, a document can be represented by a set of features
which can be modeled as a vector space model.
Similarity between data points can be calculated by
some similarity measure such as cosine distance or
jaccard similarity. In case of short text documents,
jaccard similarity performs better than cosine
measure[1]. But since, we are concatenating all tweets
of a user to make a data point, it would not be a short
text anymore. Thus, we have used cosine similarity for
experimentation purpose.
Categories and Subject Descriptors -H.4 [Information
Retrieval, Text Mining, Machine Learning]: Clustering
Algorithms
General Terms -Data Mining, Clustering, Machine
Learning, Must Link Constraint, Cannot Link Constraint
Keywords -Constraints based clustering, Text Mining
I. INTRODUCTION
Document clustering has been studied for quite a while
and has wide applications like search result grouping
and categorization. It also forms a base for applications
like topic extraction and content filtering. K-means and
hierarchical clustering and some of its variations are
some of the well known techniques for the same. With
the increasing popularity of micro-blogging and social
networking, the documents collected from the web are
becoming more and more condensed. Such data imposes
new challenges in applying pristine document clustering
techniques. Clustering such micro-blogs and discussions
could lead to new trends in web search and other such
applications.
Twitter has been a hot topic for many researchers due to
its certain characteristics such as short text, retweet
mechanism etc. Given the popularity and importance of
Twitter, plenty of researchers have studied the
characteristics of Twitter [7][8]. Boyd et al. [7] analyzed
the mechanism retweet in detail and Nagarajan et al. [9]
gave a qualitative examination of retweet practices. Wei
Peng analyzed and experimented constraints based
________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014
43
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________
clustering in music domain[ 2]. Ian Davidson et al. [4]
cover approaches that make use of constraints for
partitional and hierarchical algorithms to both enforce
the constraints or to learn a distance function from the
constraints.

III. OUR APPROACH

Document Representation
Features are the characteristics of the data point. In case
of document clustering, feature are the terms occurred in
the documents. One of the standard representation of
documents is vector space model in which documents
are represented as m ! n matrix which is a sparse matrix,
where m denotes number of documents and n denotes
number of terms. Each document vector denotes the
terms present in the corresponding document. If a
particular term is present in the document, then
corresponding entry denotes number of occurrences of
that term in the document. In this representation, only
emphasis is given on the term frequency (TF). There
could be terms with more frequency but without any
significant meaning. For this purpose, inverse doc ument
frequency is also taken into consideration. Inverse
document frequency (IDF) gives importance to number
of documents in which a particular term has occurred. If
the term is present in many documents, then the term
gets penalized. By considering both of these, a measure
called Term Frequency - Inverse Document Frequency
(TF-IDF) is more powerful measure than the individual
ones.



PenaltyC be the penalty added such that PenaltyC
(ci, cj) = 1 if ci = cj. Let wij be the weights
corresponding to Must Link and cannot Link
constraints. Using these notations, we define
objective function of constraints based kmeans as
follows:
k
J 
j 1
ML 
j 1
sc j
pi( j )  c j
( j)
where, n is number of documents, pi

sc j
pi( j )  c j
w
Si , S j M
CL 
Objective function of K-means algorithm is as follows:


should not be assigned to the same cluster. If these
constraints are violated, cost of violation is added.
Let PenaltyM be the penalty added such that
PenaltyM (ci, cj) = 1 if ci  c j . Similarly, let
K-means is one of the standard clustering algorithm
which is a special case of Expectation-Maximization
framework. As part of initialization, k random centroids
are selected (Not necessary the data points). E-step and
M-step are part of each iteration of K-means which
terminates when there is no more convergence i.e.
centroid assignment in the previous iteration is the same
as centroid assignment in the current iteration. In
Expectation step, the distances from all the data points
to the current centroids are calculated. As part of
Maximization step, new centroids are calculated.
k

implies xi and xj should be assigned to the same
cluster. Similarly, let C be the set of Cannot Link
pairs such that xi , x j  C implies xi and xj
K-Means
J 
Constraints based user clustering In this paper, we
define constraints based twitter clustering for user
clustering. In case of constraints based clustering,
small amount of information is available in the
form of Must Link and Cannot Link constraints.
Pairs in Must Link constraints should come
together in a cluster if not, then penalty or cost of
violating the constraints is added. Pairs in Cannot
Link constraints should not come together in a
cluster if it does, then penalty or cost of violating
the constraints is added. Since simple kmeans can
not handle the constraints, we use pairwise
constraints clustering framework in which the
objective function of kmeans is modified to add the
penalty to each non-satisfying constraint. Let M be
the set of Must Link pairs such that xi , x j  M
ij
2
 ML  CL
PenaltyM (ci , c j )
w
Si , S j C
ij
Penaltyc(ci , c j )
(2)
(3)
(4)
Algorithm of constraints based K-means is given as
follows:
2
(1)
Algorithm 1 Constraints based K-means Algorithm
Require: Datapoints P = p1, ..., pn, Distance measure d,
Number of clusters k, positive constraints M, negative
constraints C.
is ith document
in jth cluster. Objective function of K-means tries to
minimize intra-cluster distance among the data points.

Initialize m neighborhoods from M and C. If m <
K, initialize with centroids of neighborhood sets
and remaining clusters at random. Else, initialize
________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014
44
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________

using weighted farthest-first traversal starting from
the largest neighborhood

Iteration

Step 1 Assignment of data point such that J is
minimized.

Step 2 Update centroids for each cluster.

End

Constraints Generation
Workflow of the system
Constraints generation is very crucial part in this paper
since wrong constraints can hamper clustering accuracy.
We generated positive and negative constraints and
found significant improvement in the clustering
accuracy.

Constraints based on distance measure
First we generate the distance matrix with normalized
cosine similarity. To generate negative constraints, we
identified the data points with distance greater than a
threshold(higher value). It implies, if 2 users have
distance greater than a threshold, it is more likely that
they belong to different cluster. Similarly to generate
positive constraints, we identified the data points with
distance lesser than a threshold(smaller value). It
implies, if 2 users have distance lesser than a small
threshold, it is more likely that they belong to same
cluster.

Figure 1: Workflow of our approach.
Filtering and domain expert’s knowledge

We filter out those data points which have repeated
many times as the tweets from that user are very less,
resulting in a small document vector or the same tweet is
repeated many times causing that data point to be invalid
with lot of duplicate data.This approach will be very
ecient when we deal with very large data sets. Also, we
apply some domain experts knowledge in generating the
constraints.
We have removed stop words by using stop word list 2.
As a part of preprocessing, we removed url occurrences
from the documents and converted hashtags into normal
keywords. We have applied Porter Stemming
algorithm(2) which converts each word to their base
form.

IV. EXPERIMENTS AND RESULTS

Preprocessing
Evaluation Measure
We have created a gold standard clusters based on the
user’s subscription to the twitter list. Twitter list Li can
be represented as a cluster i and users belonging to those
twitter lists will be data points in that cluster. Each
algorithmic tagged cluster is assigned to cluster
corresponding gold standard dataset which is more
frequent in the cluster. Let N be the number of
documents in corpus. Number of documents in cluster Cj
is denoted by |Cj|. |Cj|class=i denotes common documents
in class i and cluster Cj .
Dataset
Twitter1 is one of the large source for short text
documents. Each tweet contains meta-data and
respective text which is treated as a document. We have
collected 40000 tweets from 400 users which have
subscribed to 4 unique twitter lists. In our setting, user is
a data point which is modeled by concatenation of all
tweets of the user.
________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014
45
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________
So, purity can be represented by
Purity (C j ) 

1
max C j
Cj
k
Cj
j 1
N
Purity (C j )  

VI. REFERENCES
classi

Purity (C j )
(5)
[1]
Rangrej, Aniket, Sayali Kulkarni, and Ashish V.
Tendulkar. ”Comparative study of clustering
techniques
for
short
text
documents.”
Proceedings of the 20th international conference
companion on World wide web. ACM, 2011.
[2]
Peng, Wei, Tao Li, and Mitsunori Ogihara.
”Music Clustering with Constraints.” ISMIR.
2007.
[3]
N. Ab Samat, M.A.A. Murad, M.T. Abdullah,
and R. Atan, “Malay documents clustering
algorithm
Based
on
Singular
Value
Decomposition”, Journal of Theoretical and
Applied Information Technology, July 2005.
[4]
Davidson, Ian, and Sugato Basu. ”A survey of
clustering with instance level constraints.” ACM
Transactions on Knowledge Discovery from Data
(2007): 1-41.
[5]
Basu, Sugato, Mikhail Bilenko, and Raymond J.
Mooney. ”Comparing and unifying search-based
and similarity-based approaches to semisupervised clustering.” Proceedings of the ICML2003 Workshop on the Continuum from Labeled
to Unlabeled Data in Machine Learning and Data
Mining. 2003.
[6]
Anand, Rajul, and Chandan K. Reddy. ”Graphbased clustering with constraints.” Advances in
Knowledge
[7]
Discovery and Data Mining. Springer Berlin
Heidelberg, 2011. 51-62. D. Boyd, S. Golder, and
G. Lotan.
[8]
Tweet, tweet, retweet : Conversational aspects of
retweeting on Twitter. In 43rd Hawaii
International Conf. on System Sciences, page
412, 2008. A. Java, T. Finin, X. Song, B. Tseng.
Why We Twitter: Understanding Microblogging
Usage and Communities. In WBDKDDˆa A ´Z
07, 2007
[9]
M. Nagarajan, H. Purohit and A. Sheth. A
Qualitative Examination of Topical Tweet and
Retweet Practices. In Proceedings of Fourth
International AAAI Conference on Weblogs and
Social Media (ICWSMˆa A ´Z 2010), pages
295ˆa A ¸S298, 2010.
(6)
Results
Figure 2: Comparison of twitter user clustering Simple
kmeans Vs Constraints based modified kmeans.
Using the constraints based approach, improvement in
the clustering accuracy is notable as shown in the graph.
-
Accuracy
of
kmeans
without
constraints(baseline) is 64.143%.
any
-
Accuracy of kmeans with only CL constraints is
65.024%.
-
Accuracy of kmeans with only ML constraints is
76.220%.
-
Accuracy of kmeans with both CL and ML
constraints is 78.123%
V. CONCLUSIONS
Constraint based clustering has higher accuracy than
traditional Kmeans clustering algorithm. At the same
time, identifying the optimal set of constraints with right
combination of CL and ML constraints plays a pivotal
role. In this paper, we have shown that by combining
techniques described in section 3.2.1 and 3.2.2 we can
generate optimal set of constraints. In our
experimentation, we found ML constraints more useful
than CL constraints.

________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014
46