Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Social Communities Detection in Social Media Ruiqi Hu Social Communities Detection in Social Media Social Communities Detection in Social Media Table of Contents Abstract ................................................................................................................................. 3 Introduction ........................................................................................................................... 3 Research Problem and Research Gap ................................................................................... 4 Present Work ......................................................................................................................... 4 Methods based on graph structure ................................................................................... 4 Methods based on node characteristic analysis ............................................................... 6 Methods combine the way based on graph structure analyse and node characteristic analyse ............................................................................................................................... 7 Evaluation of the related papers and books I read ............................................................... 8 Evaluation Methods .......................................................................................................... 8 Data Collection .................................................................................................................. 8 Data Pre-processing........................................................................................................... 9 Edge Structure ............................................................................................................... 9 Node Characteristics.................................................................................................... 10 Data Set Characteristics............................................................................................... 10 Ground Truth ............................................................................................................... 11 Data Demonstration/Example ......................................................................................... 11 Edges file...................................................................................................................... 11 Feature Matching file .................................................................................................. 11 Ground Truth file ......................................................................................................... 11 Results ............................................................................................................................. 12 Conclusion ....................................................................................................................... 15 Reference List .................................................................................................................. 16 2 Social Communities Detection in Social Media Abstract Social media tools like Twitter and Facebook are already a part of everyone's daily life and my topic helps social media users to automatically organize their friends(followers and followees) into different functional communities. Specifically speaking, based on a social media user's friend- list, we will organize these friends into different meaningful groups according to the relationship graph structure draw via their connection relationships as well as their profile information(node characteristic). This research will not only benefit the personal social media users but also for the enterprise users. For individuals, automatically social circle detection will help them get a deeper understanding of their social networks as well as groups; for companies, the ones who have more comprehensive vision of their customers will gain the advantages. In this literature review, first I will introduce the context of social circle detection. Then some most basic but significant concepts of this area will be explained to make the understanding of research questions, gaps and importance in this filed more easier. Finally, I will briefly introduce some methods of this study and summary the experiments and other works I did so far. Keywords: Social Media, Community Detection Introduction People's as well as enterprises' social network are big and supposed to be categorized, and in current stage, there is no very good way to organized them automatically. Many social media websites allow members to manually cluster their friends into social groups(e.g. "groups" on Google+, and "lists" on FaceBook and Twitter). However, these lists or circles are laborious and construct and have to be manually updated whenever the network grows[Mcauley and Leskovec 2014]. Our research offers a proper way to automatically detect the social circles in these social media networks. For each latent circle we learn its members' connection structure and the circle-specific user profile similarity metric and then modelling node membership to multiple circles allows us to discover overlapping and hierarchically nested circles[Yang and Leskovec 2012]. This research will help social media users organize and category their social networks automatically which not only will save a lot of time for them but also give them a more clear understanding of their own social circles. Furthermore, the enterprises will be benefited via getting a deeper and more comprehensive sight into their customers' structure, helping them precisely target the group for their products. It will also provide the materials and even evidence for the sociology related study. In current stage, we temporarily focus on the enterprises' social communities detection in social media based on few the most biggest and famous social media websites like FaceBook and Twitter. In my personal preference, I would like not just simply to improve the accuracy of the detection including the accuracy of overlapping and nested parts, but to find the meaningful function of each detected group or circle. Currently, I am working on it and the work is not 3 Social Communities Detection in Social Media finished yet, but I will introduce some of my work like experiments and the results in the following chapters. Through reading the related paper and studying the previous work, I found that there are two main approaches to achieve the aim of detecting social circles automatically. One is to analyse network structures(users connection graph structure) to cut the circles; another one is to measure the similarity of node(users) characteristics among users to cluster the circles according to the similarity. There are also many papers proposed methods to combine these two ways to expect the better performance. In this literature review, I will identify the research gap as well as research problem based on the related papers and books I read. Following I will briefly and simply explain some main algorithms to achieve the goal of detecting the social circles in social media and introduce the evaluation methods I applied to evaluate these exist works. The results will be demonstrated in the end of this literature review as I believe the best way to qualify the chosen papers to find are they worth or suitable for academic research is doing the experiments and comparing the results. Research Problem and Research Gap Social Circle or Community detection in social media is mainly to automatically cluster social media website member's social network into several different groups. These groups may contains the overlapping and hierarchically nested structures. There are two main gaps in this field. One is the accuracy of the overlapping and hierarchically nested parts detection. Many papers already did a very good work in nonoverlapping circle detection and their algorithms works well in practice. However, few works and methods can get a very impressive performance on overlapping and nested social circles detection. Another gap is that majority of circle detection related work focus on the accuracy of the circle detection for general or special purposes, but few of them did work to explain the function of the detected circles. More specifically, they just detected the circles but have no idea why these node or users are clustered as a group and what they are talking about. In another word, they do not know the function of these detected groups. These two research gaps are my current interests in social circle detection area. I will briefly introduce my recent work in the following chapter. Present Work Basically, there are two main methods to detect or cluster the social circles from users' networks. One is graph structure analysis and another is node(user) characteristics analysis based on user's profile information. Methods based on graph structure 4 Social Communities Detection in Social Media Each user in social media website will have friends who are have connections to him or her, like the "followers" and "followees" in the Twitter and the "friends" in Facebook and Google plus (See picture 1 as an example). picture 1 Users like Sam Leahey, Sara and Ksar are the followers of Asana. We can graph the network structure according to these connections. Through these directed and undirected connections, we can draw a graph to demonstrate the network structure. Picture 2 is a network graph we draw based on the connection structure. picture 2 picture 3 From picture 2 and picture 3 we can see that a node represent a user in the social media website and the edges are the connection between two nodes(users). Different colour means different groups detected. To analyse the structure of these networks like counting the amount of the connections and analysing the connection density we can divide the whole graph into several clear parts(groups). 5 Social Communities Detection in Social Media Followings are some papers did good researches to detect the social circles based on the graph structure analysis. K-means clustering[MacKay 2003] is of the simplest unsupervised learning algorithms which repeatedly recalculate centres and take each node belonging to a given data set and associate it to the nearest centre till no point pending. Another algorithm called BigClam[Yang and Leskovec 2013] which assumes that overlaps between communities are densely connected which is in contrast with present social circles detection algorithms that believe overlaps between circles are sparsely connected. It combine the non-negative matrix factorization methods with block stochastic gradient descent to analysis network structure and detect social circles. There are also some algorithms based on the AGM model like AGMFit algorithm[Yang and Leskovec 2012] which is based on the assumption that parts of the network where communities overlap tend to be more densely connected then the non-overlapping parts of communities. The algorithm is developed from Community-Affiliation Graph Model(AGM) which reliably reproduces the organization of networks into communities and the overlapping community structure. Furthermore, other methods which consider the geometrical information like GNMF[Cai et al 2008] encode the geometrical information of the data space by constructing a nearest neighbour graph to find a new representation space in which two data points are sufficiently close to each other if they are connected in the graph through a matrix factorization objective function. Methods based on node characteristic analysis In many social media websites like Facebook and LinkedIn, majority of members will show some profile information to attract other interesting or similar members. These profile sections contains many personal information like name, gender, education background, location and etc.. picture 4 Picture 4 shows a typical Facebook page and we can get node features(personal information) like "work and education", "place you have lived", "family and relationship" and even "life events". 6 Social Communities Detection in Social Media Such information can be learned as features of each node(user), and we can analysis these information by leveraging some methods like cosine similarity or sequence alignment to measure the similarity between two nodes to do the cluster job[Xu and et al 2014]. The nodes in these circles may share the common information like have the similar education background(attend in a same university), similar work background(work or worked in a same area or a same company) or even live in a same area like a same city. Methods combine the way based on graph structure analyse and node characteristic analyse There are still some researches trying to combine these two classical approaches to expected better performance in circle clustering. These are many ways to combine these two methods like to learn a weight to combine the results from node characteristic analysis and the network structure analysis. Then give a final result to decide whether this node or edge belongs to this group or not. Another typical way is to add the node features into the network and restructure the graph according to these node attributes. Following algorithms are the typical examples which consider both the information of network structure and node attributes from some high quality papers. The most classical one should be Nips[McAULEY and LESKOVEC 2014] which treats a user as an ego and the ego network will be built based on the connections between ego's friends. Then it poses the task of automatically identifying ego's social circles as a multi-membership node clustering problem. The model considers both network structure and user profile information. Others like Censa[Yang et al 2013] and FNMTF[Wang et al 2011] detect communities via combining network structure as well as node characteristics. It statistically models the interaction between network structure and node attributes to expect more accurate community detection. Another algorithm called DRCC[Gu et al 2009] based on semi-nonnegative matrix trifactorization. It samples both data points(e.g. documents) and features from some manifolds to construct two graph(data graph and feature graph) to explore the geometric structure of data manifold and feature manifold. The comparison among these baselines shows on table 1. table 1 Algorithm Network Structure? Node/Edge Features? Overlapping Communities? Group Function? Yes Automatically detect the number of circles? Yes Nips Yes Yes K-means Yes No No No No BigClam Yes No Yes Yes No AGMFit Yes No Yes Yes No Censa Yes Yes Yes Yes No FNMTF Yes Yes No No No GNMF Yes No No No No DRCC Yes Yes No No No 7 No Social Communities Detection in Social Media Evaluation of the related papers and books I read I believe the best way to qualify the chosen papers or to evaluate are they worth or suitable for academic research is not just to note which conference or journal it published on but doing the experiments and comparing the results. So I download the source code of these algorithms mentioned on the papers I read and collected the data sets then format them for these algorithms respectively to do the experiments. Evaluation Methods The maximum-likelihood of the predicted groups G = {G1 ⋯ Gn } can be examined based on ground truth data after convergence. The task is to make the predicted groups align with the enterprise-self-labelled groups G = {G1 ⋯ Gn } as close as possible. We employ the Balanced Error Rate (BER) between a predicted group G and a manual labelled group G to measure the alignment[Chen and Lin 2006], |Gc \Gc | 1 |G\G| + ). 2 |G| |Gc | BER(G, G) = ( The F1 is also applied: F1 (G, G) = 2 ∙ precision(G,G)∙recall(G,G) . precision(G,G)+recall(G,G) The predicted group G and the ground truth group G will be treated as "retrieved" document set and "relevant" document set respectively, we compute precision and recall using follow equations: precision(G, G) = |G ∩ G| |G ∩ G| , recall(G, G) = . |G| |G| We learn the optimal match through linear assignment by maximizing below equation as we do not know the correspondence between groups in G and G . 1 ∑G∈dom(Ӻ)(1 − |Ӻ| Ӻ:G→G max BER(G, Ӻ(G))), here Ӻ is a correspondence between G and G. There will be two cases in this assignment: one is that if the number of ground truth groups |G| is more the number of predicted groups |G|, then each group G ∈ G must map a match 𝐺 ∈ |G|; an another situation is the number of ground truth group |G| is less than the number of latent groups |G|, we actually do not apply a penalty score for over-predictions which could have been groups but were not included in the ground truth. Similarly, we learn the optimal match when using F1 score by maximizing: 1 ∑G∈dom(Ӻ) F1 (G, Ӻ(G)). |Ӻ| Ӻ:G→G max Data Collection 8 Social Communities Detection in Social Media We developed a twitter data collector by our own to help us gather twitter users' connections with others as well as their tweets' information like content, published date, location information and so on. Then we replaced the real name of each twitter user using the unique id for the privacy consideration. Every company official Twitter account can group their "friends" using "list" function provided by Twitter. "A list is a curated group of Twitter users." Quote from Twitter Help Centre. They normally created lists according to the common potential function the users in the list share. For example, in SonyPictures data set, the users in the list named "Television" are expected to post the news or tweets related to SonyPictures TV shows. See picture 5 as an example. picture 5 In this work, we collected the users in the lists of each company's lists and the latest 100 tweets of these users. we have got 14 data sets including the area of media industry, motor industry, politics, sport, education and entertainment industry. Totally, there are 3,806 nodes(users), 104,418 edges and 320,313 tweets we used to do the experiments. Averagely, each data set contains 272 nodes, 7,459 edges and 22,879 tweets. Table 2 shows the summary of the data sets. table 2 EgoName ABCNews NBA TwitterAU SonyPictures LiberalAus AustralianLabor WhiteHouse MercedesBenz Techreview Cambridge_Uni The_Nationals Greens MGM_Studios BBCNews Nodes(Users) 729 137 531 129 74 346 161 142 155 529 27 154 68 624 Edges 43,032 2,036 13,718 186 3,363 11,484 3,925 1,174 1,220 4,125 173 2,384 621 16,974 Tweets 69,018 12,402 45,324 10,801 3,454 9,646 13,503 11,846 14,024 47,329 1,941 13,078 6,414 61,533 Features 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 List 17 7 13 5 5 5 5 7 7 12 3 9 6 11 Description A Famous media company National Basketball Association Twitter official account in Australia A Famous film company Liberal Party in Australia Labour Party in Australia White House Official Twitter account A very famous Automaker MIT Technology Review Twitter account Cambridge University official Twitter account The national party in Australia The Greens party in Australia A Famous film company A Famous media company Data Pre-processing Edge Structure Edge structure is one of the most important part in our data format and we use the "following relationship" among the users we collected to build the edges. For example, in the following section of RuiqiUal's twitter, we found that RuiqiUal followed NAB, then we 9 Social Communities Detection in Social Media will build an edge between RuiqiUal and NAB. The edge is directed from RuiqiUal to NAB. See Diagram 1. diagram 1 Edge Node Characteristics "The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was created organically by Twitter users as a way to categorize messages. People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search." "The @ sign is used to call out usernames in Tweets. People will use your @username to mention you in Tweets, send you a message or link to your profile. A username is how you're identified on Twitter, and is always preceded immediately by the @ symbol. For instance, Katy Perry is @katyperry. " Quote from Twitter Help Centre. See Picture 6 as an example of hashtag and @symbol in twitter. To prepare each node's characteristics, we collected each user's latest 100 tweets in each data set and selected the words with hashtag or @ sign as for each member as this member's features( node characteristics). picture 6 Example: In the Tweet below, @eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays. Data Set Characteristics Just like the way we pre-process the node characteristics, we put all tweets of each data set together and selected all words with hashtag and @ sign for each data set. Frequency of these word of each data set was counted to rank the words list and we picked 1,000 most frequent these words as the data set's features. 10 Social Communities Detection in Social Media Ground Truth In Twitter, each company official Twitter account will group their friends using "list" function provided by Twitter. So we treat lists created by the ego( company account) as the ground truth group to evaluate our algorithm. Please note that, there are overlapping part in these lists, more specifically speaking, a member may belong to more than one list. Data Demonstration/Example Each data set is formatted as four files: w, OGND, fea and memberIndex. See diagram 2 as an example. Edges file In w file, we use matrix to store the edge information among users. If member 5 have a edge with member 1, then the value of matrix w[5][1] will be 1, or the value will be 0. MemberIndex file This file contains the index mapping of each member. The member name is replaced by the unique id for the privacy consideration. Feature Matching file As described in the Data Pre-processing part, each data set has 1,000 features and we also collected each user's features. In this file, we horizontally listed 1,000 data set's features and check each feature whether appeared in each user's feature list, if so, we will mark 1 here, or the value will be 0. Ground Truth file The OGND file apply cell data format in Matlab to store the information about each member belongs to which list. For example, if the value of cell 666 is 7,10 means the member 666 belongs to No. 7 list and No. 10 list. diagram 2 11 Social Communities Detection in Social Media Results 1. First we set majority of parameters of each algorithm( including our algorithm) as their default values like alpha, beta and sigma in our algorithm as zero. We set the number of cluster just equals to the number of circles in the ground truth. See table 3 and table 4. table 3 Ber_loss ABCNews NBA TwitterAU SonyPictures LiberalAus AustralianLabor WhiteHouse MercedesBenz Techreview Cambridge_Uni The_Nationals Greens MGM_Studios BBCNews Average K-Means 0.41155 0.33789 0.33054 0.27829 0.47403 0.34216 0.34768 0.34216 0.40109 0.43657 0.43981 0.42582 0.32609 0.41512 0.3719 BigClam 0.3205 0.1085 0.3119 0.3442 0.1567 0.5898 0.36 0.234 0.2702 0.2486 0.0769 0.2953 0.2391 0.1622 0.2739 AgmFit 0.2469 0.1158 0.2033 0.4453 0.1129 0.1972 0.2679 0.231 0.319 0.3471 0.0577 0.3302 0.2029 0.251 0.24733 Censa 0.2983 0.1048 0.31 0.3384 0.1842 0.279 0.3631 0.2766 0.2832 0.2514 0.0962 0.3008 0.2094 0.1384 0.24996 table 4 FNMTF 0.41021 0.33653 0.30831 0.35359 0.42942 0.37007 0.36514 0.37007 0.41011 0.39805 0.26686 0.39911 0.37502 0.42348 0.36819 GNMF 0.42538 0.32418 0.25614 0.43076 0.39116 0.34124 0.40739 0.34124 0.39285 0.45383 0.23512 0.40076 0.46225 0.44836 0.3784 DRCC 0.40473 0.22761 0.27939 0.26848 0.43932 0.34033 0.39263 0.34033 0.37492 0.40838 0.3122 0.39257 0.20839 0.40158 0.3347 Nips 0.46115 0.36496 0.449949 0.299225 0.367568 0.312139 0.299379 0.338028 0.347465 0.400756 0.209877 0.395382 0.375 0.399912 0.35794 F1 Score ABCNews NBA TwitterAU SonyPictures LiberalAus AustralianLabor WhiteHouse MercedesBenz Techreview Cambridge_Uni The_Nationals Greens MGM_Studios BBCNews Average K-Means 0.11045 0.40969 0.36574 0.55095 0.3136 0.39616 0.37887 0.39616 0.29113 0.17449 0.34444 0.20823 0.45046 0.25203 0.3329 BigClam 0.2934 0.7859 0.3779 0.3295 0.6548 0.3717 0.2847 0.5376 0.4096 0.5054 0.8519 0.4409 0.5147 0.6577 0.4893 AgmFit 0.3745 0.7689 0.581 0.1137 0.6918 0.6021 0.4586 0.5023 0.3011 0.3081 0.8889 0.3738 0.5686 0.4803 0.4863 Censa 0.3446 0.7908 0.3836 0.3372 0.6117 0.4422 0.2785 0.4507 0.3804 0.4991 0.8148 0.4144 0.5686 0.7124 0.49363 FNMTF 0.14475 0.4029 0.34456 0.39705 0.34897 0.34898 0.36086 0.34898 0.2819 0.20433 0.57268 0.27619 0.30377 0.2497 0.32589 GNMF 0.15895 0.3858 0.42203 0.35035 0.41564 0.42824 0.35184 0.42824 0.32963 0.13644 0.60158 0.22689 0.21378 0.17784 0.32397 DRCC 0.13789 0.57537 0.42797 0.5575 0.3167 0.43044 0.32832 0.43044 0.32258 0.21198 0.55732 0.24314 0.54318 0.28105 0.3882 Nips 0.189125 0.428098 0.24676 0.34763 0.621857 0.631624 0.504132 0.389616 0.313859 0.24009 0.777778 0.350729 0.369882 0.363441 0.39636 2. Then we make the number of circles automatically detected. Please note that although the Nips algorithm claim that they can automatically detect the circle amount in the paper, however we did not fount this function in the source code they published. So we lack of the results in this part. See table 5 and table 6. table 5 Auto Ber_loss ABCNews NBA TwitterAU SonyPictures LiberalAus AustralianLabor WhiteHouse MercedesBenz Techreview Cambridge_Uni The_Nationals Greens MGM_Studios BBCNews Average BigClam 0.3029 0.1253 0.3128 0.3756 0.2332 0.2287 0.2814 0.2272 0.2965 0.2643 0.1325 0.2946 0.269 0.2326 0.1374 AgmFit 0.4837 0.3676 0.4665 0.4609 0.3325 0.3319 0.4474 0.4025 0.435 0.4214 0.2116 0.4399 0.2612 0.1384 0.2572 table 6 Censa 0.4133 0.1048 0.2494 0.4142 0.3339 0.2296 0.3149 0.2946 0.3421 0.2415 0.228 0.3268 0.2633 0.1384 0.37446 Auto F1 Score ABCNews NBA TwitterAU SonyPictures LiberalAus AustralianLabor WhiteHouse MercedesBenz Techreview Cambridge_Uni The_Nationals Greens MGM_Studios BBCNews Average 12 BigClam 0.317 0.7287 0.376 0.2571 0.2252 0.4873 0.2436 0.5282 0.3711 0.4527 0.5494 0.4498 0.4559 0.4472 0.43569 AgmFit 0.038 0.2701 0.0716 0.0853 0.4167 0.3449 0.1149 0.2019 0.1376 0.1588 0.5926 0.1374 0.4853 0.1787 0.2167 Censa 0.1706 0.7908 0.4404 0.1667 0.406 0.4781 0.2947 0.4049 0.2549 0.5205 0.4026 0.3582 0.4265 0.7124 0.41702 Social Communities Detection in Social Media 3. We also vary the number of cluster as 3, 5, 7, 9, 11 to track the trend of results quality with the change of circles number. Then we calculated the average performance of each algorithm on the whole datasets and draw the trend line chart for each algorithm. Following charts show that how each algorithm's average value of BER and F1 Score of all data sets change when the number of detected circles vary.(K is the number of detected circles). Average BER_Loss Value for Varing K Average F1 Score Value for Varing K When we set the number of circles small, say, three or five, Amgfit and BigClam will ignore many nodes, but our algorithm will keep all nodes and cluster them into different circles. For some small data sets, say 'The_Nationals' and 'NBA' data, cannot be divided into 9 and 11 different circles in some algorithm even we manually set the value of K 9 or 11. But for others bigger data sets, the auto circle number detection function worked well. 13 Social Communities Detection in Social Media The Nips will achieve better when the number of circles is small. Generally speaking, the Stanford algorithm will the best performance when the K is three. 14 Social Communities Detection in Social Media Conclusion There are three key findings of this literature review: 1. There are two basic approaches to detect the social circles in social media. One is relationship network structure analysis and another is node(user) characteristic analysis. 2. There three main categories of algorithms to detect the groups: First is the methods purely based on the graph structure analysis. These one can even perfectly cluster social circles if the graph structure of data is well-organized and clear. It usually hard to accurately detect the overlapping and nested parts and may gets a bad performance when the network structure of data set is complex and confused. Second is those which only based on the node characteristic analysis. These algorithms ignore the network structure and mainly depended on the similarity among nodes(users) to do the cluster. It may get good performance when the data set has complete and enough user profile information but may work does not good when the data set lack of these information like the data come from Twitter. Third is those methods which consider both graph structure and node attributes. This kind of algorithms may globally get the best performance compared with other two kinds of algorithms above mentioned. However, for some data sets which has very clear graph structures, the algorithms purely based on graph structure analysis may beat this kind methods as the node information may confuses the already cleaned circle classification leading to a no good results. 3. in the future work, I will fill the gap of functional group detection which explain each detected group's function and meaning through clustering the content of these groups. 15 Social Communities Detection in Social Media Reference List Yang, J., and Leskovec, J. 2013, 'Overlapping community detection at scale: a nonnegative matrix factorization approach', In Proceedings of the sixth ACM international conference on Web search and data mining, ACM , pp. 587-596. Lee, D.D. & Seung H.S. 2001, 'Algorithms for non-negative matrix factorization', In Advances in neural information processing systems, pp. 556-562. Yang, J., & Leskovec, J. 2013, 'Overlapping community detection at scale: a nonnegative matrix factorization approach' In Proceedings of the sixth ACM international conference on Web search and data mining, ACM, pp. 587-596. Gu, Q., and Zhou, J. 2009, 'Co-clustering on manifolds' In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 359368. Yang, J., and Leskovec, J. 2012, 'Community-affiliation graph model for overlapping network community detection' In Data Mining (ICDM), 2012 IEEE 12th International Conference on, IEEE , pp. 1170-1175. Mcauley, J., and Leskovec J. 2014, 'Discovering social circles in ego networks', ACM Transactions on Knowledge Discovery from Data (TKDD) 8, no. 1, pp. 4. Wang, H., Nie, F., Huang, H., & Makedon, F. 2011, 'Fast nonnegative matrix tri-factorization for large-scale data co-clustering' In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, no. 1, pp. 1553. Yang, J., Julian M., & Leskovec J. 2013, 'Community detection in networks with node attributes' In Data Mining (ICDM), 2013 IEEE 13th International Conference on, IEEE, pp. 1151-1156. MacKay, D.J.C. 2003, Information theory, inference, and learning algorithms. Vol. 7. Cambridge university press, Cambridge. Yang, Y.H. 2005, 'Information theory, inference, and learning algorithms', Journal of the American Statistical Association, Vol. 100, no. 472, pp. 1461-1462. Leskovec, J, & Julian, J.M. 2012, 'Learning to discover social circles in ego networks', In Advances in neural information processing systems, pp. 539-547. Cai, D., He, X. Wu, X. & Han, J. 2008, 'Non-negative matrix factorization on manifold', In Data Mining, ICDM'08. Eighth IEEE International Conference on, IEEE, pp. 63-72. Ding, C.H.Q., He, F. & Simon, D.H. 2005, 'On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering' In SDM, vol. 5, pp. 606-610. Leskovec, J., & Julian J.M. 2012, 'Learning to discover social circles in ego networks', In Advances in neural information processing systems, pp. 539-547. Xu, Z.Q., Ke, Y.P., Wang, Y., Cheng, H., & Cheng, J. 2012, 'A model-based approach to attributed graph clustering' In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ACM , pp. 505-516. Yang, Z., Cheng, H., & Yu, X.J. 2009, "Graph clustering based on structural/attribute similarities." Proceedings of the VLDB Endowment 2, no. 1, pp. 718-729. Barabási, A.L., Natali G., & Joseph L. 2011, 'Network medicine: a network-based approach to human disease.', Nature Reviews Genetics, no. 1, pp. 56-68. 16 Social Communities Detection in Social Media Maier, M., Matthias, H., & Ulrike, V.L. 2007, 'Cluster identification in nearest-neighbor graphs' In Algorithmic Learning Theory, pp. 196-210. Kuang, D., Haesun, P., & Ding, C. 2012, 'Symmetric Nonnegative Matrix Factorization for Graph Clustering' In SDM, vol. 12, pp. 106-117. Reagans, R., & Bill M. 2003, 'Network structure and knowledge transfer: The effects of cohesion and range' Administrative science quarterly, no. 2, pp. 240-267. 17