Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________ INSTANCE LEVEL CONSTRAINT BASED USER CLUSTERING 1 Aniket Rangrej, 2Chandrashekar Muniyappa Yahoo Research and Development Embassy Golf Links, Domlur, Bangalore, India Email : [email protected], [email protected] In this paper, we cluster twitter users based on their tweets which can be used in user recommendation or tweet recommendation. ABSTRACT - Clustering with constraints is fairly recent trend of development which allows a domain expert’s interference in the clustering process. We use modified objective function of kmeans algorithm to incorporate constraints to cluster twitter users. In this paper, we generate instance level constraints using normalized cosine measure, do basic filtering and apply domain expertise. We can apply these techniques on any similar data sets like Facebook posts, answers and comments for modeling and clustering the users. By experiments we show that purity of clusters can be improved by applying instance level constraints. We do consider textual contents in the tweet such as tweet text, URL, hashtag. Similarity measure plays very crucial role in document clustering. II. BACKGROUND AND RELATED WORK Clustering is an unsupervised learning method in which labels are not known, unlike in the case of classification. Document clustering is an interesting research area which is mainly used to improve accuracy and performance of search engines by pre-clustering the documents. For any clustering mechanism, features play a very crucial role since features represent a data point. Thus, a document can be represented by a set of features which can be modeled as a vector space model. Similarity between data points can be calculated by some similarity measure such as cosine distance or jaccard similarity. In case of short text documents, jaccard similarity performs better than cosine measure[1]. But since, we are concatenating all tweets of a user to make a data point, it would not be a short text anymore. Thus, we have used cosine similarity for experimentation purpose. Categories and Subject Descriptors -H.4 [Information Retrieval, Text Mining, Machine Learning]: Clustering Algorithms General Terms -Data Mining, Clustering, Machine Learning, Must Link Constraint, Cannot Link Constraint Keywords -Constraints based clustering, Text Mining I. INTRODUCTION Document clustering has been studied for quite a while and has wide applications like search result grouping and categorization. It also forms a base for applications like topic extraction and content filtering. K-means and hierarchical clustering and some of its variations are some of the well known techniques for the same. With the increasing popularity of micro-blogging and social networking, the documents collected from the web are becoming more and more condensed. Such data imposes new challenges in applying pristine document clustering techniques. Clustering such micro-blogs and discussions could lead to new trends in web search and other such applications. Twitter has been a hot topic for many researchers due to its certain characteristics such as short text, retweet mechanism etc. Given the popularity and importance of Twitter, plenty of researchers have studied the characteristics of Twitter [7][8]. Boyd et al. [7] analyzed the mechanism retweet in detail and Nagarajan et al. [9] gave a qualitative examination of retweet practices. Wei Peng analyzed and experimented constraints based ________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014 43 International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________ clustering in music domain[ 2]. Ian Davidson et al. [4] cover approaches that make use of constraints for partitional and hierarchical algorithms to both enforce the constraints or to learn a distance function from the constraints. III. OUR APPROACH Document Representation Features are the characteristics of the data point. In case of document clustering, feature are the terms occurred in the documents. One of the standard representation of documents is vector space model in which documents are represented as m ! n matrix which is a sparse matrix, where m denotes number of documents and n denotes number of terms. Each document vector denotes the terms present in the corresponding document. If a particular term is present in the document, then corresponding entry denotes number of occurrences of that term in the document. In this representation, only emphasis is given on the term frequency (TF). There could be terms with more frequency but without any significant meaning. For this purpose, inverse doc ument frequency is also taken into consideration. Inverse document frequency (IDF) gives importance to number of documents in which a particular term has occurred. If the term is present in many documents, then the term gets penalized. By considering both of these, a measure called Term Frequency - Inverse Document Frequency (TF-IDF) is more powerful measure than the individual ones. PenaltyC be the penalty added such that PenaltyC (ci, cj) = 1 if ci = cj. Let wij be the weights corresponding to Must Link and cannot Link constraints. Using these notations, we define objective function of constraints based kmeans as follows: k J j 1 ML j 1 sc j pi( j ) c j ( j) where, n is number of documents, pi sc j pi( j ) c j w Si , S j M CL Objective function of K-means algorithm is as follows: should not be assigned to the same cluster. If these constraints are violated, cost of violation is added. Let PenaltyM be the penalty added such that PenaltyM (ci, cj) = 1 if ci c j . Similarly, let K-means is one of the standard clustering algorithm which is a special case of Expectation-Maximization framework. As part of initialization, k random centroids are selected (Not necessary the data points). E-step and M-step are part of each iteration of K-means which terminates when there is no more convergence i.e. centroid assignment in the previous iteration is the same as centroid assignment in the current iteration. In Expectation step, the distances from all the data points to the current centroids are calculated. As part of Maximization step, new centroids are calculated. k implies xi and xj should be assigned to the same cluster. Similarly, let C be the set of Cannot Link pairs such that xi , x j C implies xi and xj K-Means J Constraints based user clustering In this paper, we define constraints based twitter clustering for user clustering. In case of constraints based clustering, small amount of information is available in the form of Must Link and Cannot Link constraints. Pairs in Must Link constraints should come together in a cluster if not, then penalty or cost of violating the constraints is added. Pairs in Cannot Link constraints should not come together in a cluster if it does, then penalty or cost of violating the constraints is added. Since simple kmeans can not handle the constraints, we use pairwise constraints clustering framework in which the objective function of kmeans is modified to add the penalty to each non-satisfying constraint. Let M be the set of Must Link pairs such that xi , x j M ij 2 ML CL PenaltyM (ci , c j ) w Si , S j C ij Penaltyc(ci , c j ) (2) (3) (4) Algorithm of constraints based K-means is given as follows: 2 (1) Algorithm 1 Constraints based K-means Algorithm Require: Datapoints P = p1, ..., pn, Distance measure d, Number of clusters k, positive constraints M, negative constraints C. is ith document in jth cluster. Objective function of K-means tries to minimize intra-cluster distance among the data points. Initialize m neighborhoods from M and C. If m < K, initialize with centroids of neighborhood sets and remaining clusters at random. Else, initialize ________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014 44 International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________ using weighted farthest-first traversal starting from the largest neighborhood Iteration Step 1 Assignment of data point such that J is minimized. Step 2 Update centroids for each cluster. End Constraints Generation Workflow of the system Constraints generation is very crucial part in this paper since wrong constraints can hamper clustering accuracy. We generated positive and negative constraints and found significant improvement in the clustering accuracy. Constraints based on distance measure First we generate the distance matrix with normalized cosine similarity. To generate negative constraints, we identified the data points with distance greater than a threshold(higher value). It implies, if 2 users have distance greater than a threshold, it is more likely that they belong to different cluster. Similarly to generate positive constraints, we identified the data points with distance lesser than a threshold(smaller value). It implies, if 2 users have distance lesser than a small threshold, it is more likely that they belong to same cluster. Figure 1: Workflow of our approach. Filtering and domain expert’s knowledge We filter out those data points which have repeated many times as the tweets from that user are very less, resulting in a small document vector or the same tweet is repeated many times causing that data point to be invalid with lot of duplicate data.This approach will be very ecient when we deal with very large data sets. Also, we apply some domain experts knowledge in generating the constraints. We have removed stop words by using stop word list 2. As a part of preprocessing, we removed url occurrences from the documents and converted hashtags into normal keywords. We have applied Porter Stemming algorithm(2) which converts each word to their base form. IV. EXPERIMENTS AND RESULTS Preprocessing Evaluation Measure We have created a gold standard clusters based on the user’s subscription to the twitter list. Twitter list Li can be represented as a cluster i and users belonging to those twitter lists will be data points in that cluster. Each algorithmic tagged cluster is assigned to cluster corresponding gold standard dataset which is more frequent in the cluster. Let N be the number of documents in corpus. Number of documents in cluster Cj is denoted by |Cj|. |Cj|class=i denotes common documents in class i and cluster Cj . Dataset Twitter1 is one of the large source for short text documents. Each tweet contains meta-data and respective text which is treated as a document. We have collected 40000 tweets from 400 users which have subscribed to 4 unique twitter lists. In our setting, user is a data point which is modeled by concatenation of all tweets of the user. ________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014 45 International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________ So, purity can be represented by Purity (C j ) 1 max C j Cj k Cj j 1 N Purity (C j ) VI. REFERENCES classi Purity (C j ) (5) [1] Rangrej, Aniket, Sayali Kulkarni, and Ashish V. Tendulkar. ”Comparative study of clustering techniques for short text documents.” Proceedings of the 20th international conference companion on World wide web. ACM, 2011. [2] Peng, Wei, Tao Li, and Mitsunori Ogihara. ”Music Clustering with Constraints.” ISMIR. 2007. [3] N. Ab Samat, M.A.A. Murad, M.T. Abdullah, and R. Atan, “Malay documents clustering algorithm Based on Singular Value Decomposition”, Journal of Theoretical and Applied Information Technology, July 2005. [4] Davidson, Ian, and Sugato Basu. ”A survey of clustering with instance level constraints.” ACM Transactions on Knowledge Discovery from Data (2007): 1-41. [5] Basu, Sugato, Mikhail Bilenko, and Raymond J. Mooney. ”Comparing and unifying search-based and similarity-based approaches to semisupervised clustering.” Proceedings of the ICML2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining. 2003. [6] Anand, Rajul, and Chandan K. Reddy. ”Graphbased clustering with constraints.” Advances in Knowledge [7] Discovery and Data Mining. Springer Berlin Heidelberg, 2011. 51-62. D. Boyd, S. Golder, and G. Lotan. [8] Tweet, tweet, retweet : Conversational aspects of retweeting on Twitter. In 43rd Hawaii International Conf. on System Sciences, page 412, 2008. A. Java, T. Finin, X. Song, B. Tseng. Why We Twitter: Understanding Microblogging Usage and Communities. In WBDKDDˆa A ´Z 07, 2007 [9] M. Nagarajan, H. Purohit and A. Sheth. A Qualitative Examination of Topical Tweet and Retweet Practices. In Proceedings of Fourth International AAAI Conference on Weblogs and Social Media (ICWSMˆa A ´Z 2010), pages 295ˆa A ¸S298, 2010. (6) Results Figure 2: Comparison of twitter user clustering Simple kmeans Vs Constraints based modified kmeans. Using the constraints based approach, improvement in the clustering accuracy is notable as shown in the graph. - Accuracy of kmeans without constraints(baseline) is 64.143%. any - Accuracy of kmeans with only CL constraints is 65.024%. - Accuracy of kmeans with only ML constraints is 76.220%. - Accuracy of kmeans with both CL and ML constraints is 78.123% V. CONCLUSIONS Constraint based clustering has higher accuracy than traditional Kmeans clustering algorithm. At the same time, identifying the optimal set of constraints with right combination of CL and ML constraints plays a pivotal role. In this paper, we have shown that by combining techniques described in section 3.2.1 and 3.2.2 we can generate optimal set of constraints. In our experimentation, we found ML constraints more useful than CL constraints. ________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -1, 2014 46