Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Social Communities Detection in Social Media
Ruiqi Hu
Social Communities Detection in Social Media
Social Communities Detection in Social Media
Table of Contents
Abstract ................................................................................................................................. 3
Keywords ............................................................................................................................... 3
Stakeholders .......................................................................................................................... 3
Aims ....................................................................................................................................... 3
Objectives .............................................................................................................................. 3
Significance ............................................................................................................................ 4
Introduction ........................................................................................................................... 4
Research Problem and Research Gap ................................................................................... 5
Present Work ......................................................................................................................... 5
Methods based on graph structure ................................................................................... 5
Methods based on node characteristic analysis ............................................................... 7
Methods combine the way based on graph structure analyse and node characteristic
analyse ............................................................................................................................... 7
Evaluation of the related papers and books I read ............................................................... 8
Evaluation Methods .......................................................................................................... 8
Data Collection .................................................................................................................. 9
Data Pre-processing......................................................................................................... 10
Edge Structure ............................................................................................................. 10
Node Characteristics.................................................................................................... 11
Data Set Characteristics............................................................................................... 11
Ground Truth ............................................................................................................... 11
Data Demonstration/Example ......................................................................................... 12
Edges file...................................................................................................................... 12
Feature Matching file .................................................................................................. 12
Ground Truth file ......................................................................................................... 12
Results ............................................................................................................................. 13
Conclusion ....................................................................................................................... 16
Reference List .................................................................................................................. 17
2
Social Communities Detection in Social Media
Abstract
Social media tools like Twitter and Facebook are already a part of everyone's daily life and
my topic helps social media users to automatically organize their friends(followers and
followees) into different functional communities. Specifically speaking, based on a social
media user's friend- list, we will organize these friends into different meaningful groups
according to the relationship graph structure draw via their connection relationships as well
as their profile information (node characteristic). This research will not only benefit the
personal social media users but also for the enterprise users. For individuals, automatically
social circle detection will help them get a deeper understanding of their social networks as
well as groups; for companies, the ones who have more comprehensive vision of their
customers will gain the advantages. In this literature review, first I will introduce the context
of social circle detection. Then some most basic but significant concepts of this area will be
explained to make the understanding of research questions, gaps and importance in this
filed easier. Finally, I will briefly introduce some methods of this study and summary the
experiments and other works I did so far.
Keywords: Social Media, Community Detection
Stakeholders
This work will mainly benefit following peoples:
Marketers: help them understand what is each group of customers talking about and
discover potential customer groups.
Social media heavy users: automatically and accurately classify their “friends” in social media
into different categories and understanding what is each group of friends talking about most
recently.
Hiring Managers: Automatically and accurately classify their candidates according to their
profiles.
Aims
1. To show social media community detection can be very useful in marketing and hiring
purposes.
2. To show social media community detection can offer great convenience for social media
users.
Objectives
1. To improve the accuracy of community detection in social media. Compared with six
other state of the art baseline algorithms. It has been evaluated based on ground truth.
2. To automatically give the meaningful labels for each detected community which makes
detected community understandable.
3
Social Communities Detection in Social Media
Significance
Marketers will be able to discover new groups of customers through social media
community detection as this research not only can accurately detect the communities, but
give each detected community a meaningful label so people will easily to understand what is
each group talking about and care about. The accuracy of community’ detection and labels
for each detected community have been evaluated based on ground truth.
Introduction
People's as well as enterprises' social network are big and supposed to be categorized, and
in current stage, there is no very good way to organized them automatically. Many social
media websites allow members to manually cluster their friends into social groups (e.g.
"groups" on Google+, and "lists" on FaceBook and Twitter). However, these lists or circles
are laborious and construct and have to be manually updated whenever the network
grows[Mcauley and Leskovec 2014].
Our research offers a proper way to automatically detect the social circles in these social
media networks. For each latent circle we learn its members' connection structure and the
circle-specific user profile similarity metric and then modelling node membership to multiple
circles allows us to discover overlapping and hierarchically nested circles[Yang and Leskovec
2012].
This research will help social media users organize and category their social networks
automatically which not only will save a lot of time for them but also give them a more clear
understanding of their own social circles. Furthermore, the enterprises will be benefited via
getting a deeper and more comprehensive sight into their customers' structure, helping
them precisely target the group for their products. It will also provide the materials and
even evidence for the sociology related study.
In current stage, we temporarily focus on the enterprises' social communities detection in
social media based on few the most biggest and famous social media websites like FaceBook
and Twitter.
In my personal preference, I would like not just simply to improve the accuracy of the
detection including the accuracy of overlapping and nested parts, but to find the meaningful
function of each detected group or circle. Currently, I am working on it and the work is not
finished yet, but I will introduce some of my work like experiments and the results in the
following chapters.
Through reading the related paper and studying the previous work, I found that there are
two main approaches to achieve the aim of detecting social circles automatically. One is to
analyse network structures(users connection graph structure) to cut the circles; another one
is to measure the similarity of node(users) characteristics among users to cluster the circles
according to the similarity. There are also many papers proposed methods to combine these
two ways to expect the better performance.
In this literature review, I will identify the research gap as well as research problem based on
the related papers and books I read. Following I will briefly and simply explain some main
4
Social Communities Detection in Social Media
algorithms to achieve the goal of detecting the social circles in social media and introduce
the evaluation methods I applied to evaluate these exist works. The results will be
demonstrated in the end of this literature review as I believe the best way to qualify the
chosen papers to find are they worth or suitable for academic research is doing the
experiments and comparing the results.
Research Problem and Research Gap
Social Circle or Community detection in social media is mainly to automatically cluster social
media website member's social network into several different groups. These groups may
contains the overlapping and hierarchically nested structures.
There are two main gaps in this field. One is the accuracy of the overlapping and
hierarchically nested parts detection. Many papers already did a very good work in nonoverlapping circle detection and their algorithms works well in practice. However, few works
and methods can get a very impressive performance on overlapping and nested social circles
detection.
Another gap is that majority of circle detection related work focus on the accuracy of the
circle detection for general or special purposes, but few of them did work to explain the
function of the detected circles. More specifically, they just detected the circles but have no
idea why these node or users are clustered as a group and what they are talking about. In
another word, they do not know the function of these detected groups.
These two research gaps are my current interests in social circle detection area. I will briefly
introduce my recent work in the following chapter.
Present Work
Basically, there are two main methods to detect or cluster the social circles from users'
networks. One is graph structure analysis and another is node(user) characteristics analysis
based on user's profile information.
Methods based on graph structure
Each user in social media website will have friends who are have connections to him or her,
like the "followers" and "followees" in the Twitter and the "friends" in Facebook and Google
plus (See picture 1 as an example).
picture 1
5
Social Communities Detection in Social Media
Users like Sam Leahey, Sara and Ksar are the followers of Asana. We can graph the network structure according to these
connections.
Through these directed and undirected connections, we can draw a graph to demonstrate
the network structure. Picture 2 is a network graph we draw based on the connection
structure.
picture 2
picture 3
From picture 2 and picture 3 we can see that a node represent a user in the social media
website and the edges are the connection between two nodes(users). Different colour
means different groups detected.
To analyse the structure of these networks like counting the amount of the connections and
analysing the connection density we can divide the whole graph into several clear
parts(groups).
Followings are some papers did good researches to detect the social circles based on the
graph structure analysis.
K-means clustering[MacKay 2003] is of the simplest unsupervised learning algorithms which
repeatedly recalculate centres and take each node belonging to a given data set and
associate it to the nearest centre till no point pending.
Another algorithm called BigClam[Yang and Leskovec 2013] which assumes that overlaps
between communities are densely connected which is in contrast with present social circles
6
Social Communities Detection in Social Media
detection algorithms that believe overlaps between circles are sparsely connected. It
combine the non-negative matrix factorization methods with block stochastic gradient
descent to analysis network structure and detect social circles.
There are also some algorithms based on the AGM model like AGMFit algorithm[Yang and
Leskovec 2012] which is based on the assumption that parts of the network where
communities overlap tend to be more densely connected then the non-overlapping parts of
communities. The algorithm is developed from Community-Affiliation Graph Model(AGM)
which reliably reproduces the organization of networks into communities and the
overlapping community structure.
Furthermore, other methods which consider the geometrical information like GNMF[Cai et
al 2008] encode the geometrical information of the data space by constructing a nearest
neighbour graph to find a new representation space in which two data points are sufficiently
close to each other if they are connected in the graph through a matrix factorization
objective function.
Methods based on node characteristic analysis
In many social media websites like Facebook and LinkedIn, majority of members will show
some profile information to attract other interesting or similar members. These profile
sections contains many personal information like name, gender, education background,
location and etc..
picture 4
Picture 4 shows a typical Facebook page and we can get node features(personal information)
like "work and education", "place you have lived", "family and relationship" and even "life
events".
Such information can be learned as features of each node(user), and we can analysis these
information by leveraging some methods like cosine similarity or sequence alignment to
measure the similarity between two nodes to do the cluster job[Xu and et al 2014].
The nodes in these circles may share the common information like have the similar
education background(attend in a same university), similar work background(work or
worked in a same area or a same company) or even live in a same area like a same city.
Methods combine the way based on graph structure analyse and node
characteristic analyse
7
Social Communities Detection in Social Media
There are still some researches trying to combine these two classical approaches to
expected better performance in circle clustering. These are many ways to combine these
two methods like to learn a weight to combine the results from node characteristic analysis
and the network structure analysis. Then give a final result to decide whether this node or
edge belongs to this group or not. Another typical way is to add the node features into the
network and restructure the graph according to these node attributes.
Following algorithms are the typical examples which consider both the information of
network structure and node attributes from some high quality papers.
The most classical one should be Nips[McAULEY and LESKOVEC 2014] which treats a user as
an ego and the ego network will be built based on the connections between ego's friends.
Then it poses the task of automatically identifying ego's social circles as a multi-membership
node clustering problem. The model considers both network structure and user profile
information.
Others like Censa[Yang et al 2013] and FNMTF[Wang et al 2011] detect communities via
combining network structure as well as node characteristics. It statistically models the
interaction between network structure and node attributes to expect more accurate
community detection.
Another algorithm called DRCC[Gu et al 2009] based on semi-nonnegative matrix trifactorization. It samples both data points(e.g. documents) and features from some
manifolds to construct two graph(data graph and feature graph) to explore the geometric
structure of data manifold and feature manifold.
The comparison among these baselines shows on table 1.
table 1
Algorithm
Network
Structure?
Node/Edge
Features?
Overlapping
Communities?
Group
Function?
Yes
Automatically
detect the number
of circles?
Yes
Nips
Yes
Yes
K-means
Yes
No
No
No
No
BigClam
Yes
No
Yes
Yes
No
AGMFit
Yes
No
Yes
Yes
No
Censa
Yes
Yes
Yes
Yes
No
FNMTF
Yes
Yes
No
No
No
GNMF
Yes
No
No
No
No
DRCC
Yes
Yes
No
No
No
No
Evaluation of the related papers and books I read
I believe the best way to qualify the chosen papers or to evaluate are they worth or suitable
for academic research is not just to note which conference or journal it published on but
doing the experiments and comparing the results. So I download the source code of these
algorithms mentioned on the papers I read and collected the data sets then format them for
these algorithms respectively to do the experiments.
Evaluation Methods
8
Social Communities Detection in Social Media
The maximum-likelihood of the predicted groups G = {G1 ⋯ Gn } can be examined based on
ground truth data after convergence. The task is to make the predicted groups align with the
enterprise-self-labelled groups G = {G1 ⋯ Gn } as close as possible.
We employ the Balanced Error Rate (BER) between a predicted group G and a manual
labelled group G to measure the alignment[Chen and Lin 2006],
|Gc \Gc |
1 |G\G|
+
).
|G|
|Gc |
BER(G, G) = 2 (
The F1 is also applied:
F1 (G, G) = 2 ∙
precision(G,G)∙recall(G,G)
.
precision(G,G)+recall(G,G)
The predicted group G and the ground truth group G will be treated as "retrieved" document
set and "relevant" document set respectively, we compute precision and recall using follow
equations:
precision(G, G) =
|G ∩ G|
|G ∩ G|
, recall(G, G) =
.
|G|
|G|
We learn the optimal match through linear assignment by maximizing below equation as we
do not know the correspondence between groups in G and G .
1
∑G∈dom(Ӻ)(1 −
|Ӻ|
Ӻ:G→G
max
BER(G, Ӻ(G))),
here Ӻ is a correspondence between G and G. There will be two cases in this assignment:
one is that if the number of ground truth groups |G| is more the number of predicted groups
|G|, then each group G ∈ G must map a match 𝐺 ∈ |G|; an another situation is the number
of ground truth group |G| is less than the number of latent groups |G|, we actually do not
apply a penalty score for over-predictions which could have been groups but were not
included in the ground truth. Similarly, we learn the optimal match when using F1 score by
maximizing:
1
∑G∈dom(Ӻ) F1 (G, Ӻ(G)).
|Ӻ|
Ӻ:G→G
max
Data Collection
We developed a twitter data collector by our own to help us gather twitter users'
connections with others as well as their tweets' information like content, published date,
location information and so on. Then we replaced the real name of each twitter user using
the unique id for the privacy consideration.
Every company official Twitter account can group their "friends" using "list" function
provided by Twitter. "A list is a curated group of Twitter users." Quote from Twitter Help
Centre. They normally created lists according to the common potential function the users in
the list share. For example, in SonyPictures data set, the users in the list named "Television"
are expected to post the news or tweets related to SonyPictures TV shows. See picture 5 as
an example.
picture 5
9
Social Communities Detection in Social Media
In this work, we collected the users in the lists of each company's lists and the latest 100
tweets of these users. we have got 14 data sets including the area of media industry, motor
industry, politics, sport, education and entertainment industry. Totally, there are 3,806
nodes(users), 104,418 edges and 320,313 tweets we used to do the experiments. Averagely,
each data set contains 272 nodes, 7,459 edges and 22,879 tweets. Table 2 shows the
summary of the data sets.
table 2
EgoName
ABCNews
NBA
TwitterAU
SonyPictures
LiberalAus
AustralianLabor
WhiteHouse
MercedesBenz
Techreview
Cambridge_Uni
The_Nationals
Greens
MGM_Studios
BBCNews
Nodes(Users)
729
137
531
129
74
346
161
142
155
529
27
154
68
624
Edges
43,032
2,036
13,718
186
3,363
11,484
3,925
1,174
1,220
4,125
173
2,384
621
16,974
Tweets
69,018
12,402
45,324
10,801
3,454
9,646
13,503
11,846
14,024
47,329
1,941
13,078
6,414
61,533
Features
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
1,000
List
17
7
13
5
5
5
5
7
7
12
3
9
6
11
Description
A Famous media company
National Basketball Association
Twitter official account in Australia
A Famous film company
Liberal Party in Australia
Labour Party in Australia
White House Official Twitter account
A very famous Automaker
MIT Technology Review Twitter account
Cambridge University official Twitter account
The national party in Australia
The Greens party in Australia
A Famous film company
A Famous media company
Data Pre-processing
Edge Structure
Edge structure is one of the most important part in our data format and we use the
"following relationship" among the users we collected to build the edges. For example, in
the following section of RuiqiUal's twitter, we found that RuiqiUal followed NAB, then we
will build an edge between RuiqiUal and NAB. The edge is directed from RuiqiUal to NAB.
See Diagram 1.
diagram 1
10
Social Communities Detection in Social Media
Edge
Node Characteristics
"The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was
created organically by Twitter users as a way to categorize messages. People use the hashtag
symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those
Tweets and help them show more easily in Twitter Search."
"The @ sign is used to call out usernames in Tweets. People will use your @username to
mention you in Tweets, send you a message or link to your profile. A username is how you're
identified on Twitter, and is always preceded immediately by the @ symbol. For instance,
Katy Perry is @katyperry. "
Quote from Twitter Help Centre. See Picture 6 as an example of hashtag and @symbol in
twitter.
To prepare each node's characteristics, we collected each user's latest 100 tweets in each
data set and selected the words with hashtag or @ sign as for each member as this
member's features( node characteristics).
picture 6
Example: In the Tweet below, @eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people
that others should follow on Twitter. You'll see this on Fridays.
Data Set Characteristics
Just like the way we pre-process the node characteristics, we put all tweets of each data set
together and selected all words with hashtag and @ sign for each data set. Frequency of
these word of each data set was counted to rank the words list and we picked 1,000 most
frequent these words as the data set's features.
Ground Truth
In Twitter, each company official Twitter account will group their friends using "list" function
provided by Twitter. So we treat lists created by the ego( company account) as the ground
truth group to evaluate our algorithm. Please note that, there are overlapping part in these
lists, more specifically speaking, a member may belong to more than one list.
11
Social Communities Detection in Social Media
Data Demonstration/Example
Each data set is formatted as four files: w, OGND, fea and memberIndex. See diagram 2 as
an example.
Edges file
In w file, we use matrix to store the edge information among users. If member 5 have a edge
with member 1, then the value of matrix w[5][1] will be 1, or the value will be 0.
MemberIndex file
This file contains the index mapping of each member. The member name is replaced by the
unique id for the privacy consideration.
Feature Matching file
As described in the Data Pre-processing part, each data set has 1,000 features and we also
collected each user's features. In this file, we horizontally listed 1,000 data set's features and
check each feature whether appeared in each user's feature list, if so, we will mark 1 here,
or the value will be 0.
Ground Truth file
The OGND file apply cell data format in Matlab to store the information about each member
belongs to which list. For example, if the value of cell 666 is 7,10 means the member 666
belongs to No. 7 list and No. 10 list.
diagram 2
12
Social Communities Detection in Social Media
Results
1. First we set majority of parameters of each algorithm( including our algorithm) as their
default values like alpha, beta and sigma in our algorithm as zero. We set the number of
cluster just equals to the number of circles in the ground truth. See table 3 and table 4.
table 3
Ber_loss
ABCNews
NBA
TwitterAU
SonyPictures
LiberalAus
AustralianLabor
WhiteHouse
MercedesBenz
Techreview
Cambridge_Uni
The_Nationals
Greens
MGM_Studios
BBCNews
Average
K-Means
0.41155
0.33789
0.33054
0.27829
0.47403
0.34216
0.34768
0.34216
0.40109
0.43657
0.43981
0.42582
0.32609
0.41512
0.3719
BigClam
0.3205
0.1085
0.3119
0.3442
0.1567
0.5898
0.36
0.234
0.2702
0.2486
0.0769
0.2953
0.2391
0.1622
0.2739
AgmFit
0.2469
0.1158
0.2033
0.4453
0.1129
0.1972
0.2679
0.231
0.319
0.3471
0.0577
0.3302
0.2029
0.251
0.24733
Censa
0.2983
0.1048
0.31
0.3384
0.1842
0.279
0.3631
0.2766
0.2832
0.2514
0.0962
0.3008
0.2094
0.1384
0.24996
table 4
FNMTF
0.41021
0.33653
0.30831
0.35359
0.42942
0.37007
0.36514
0.37007
0.41011
0.39805
0.26686
0.39911
0.37502
0.42348
0.36819
GNMF
0.42538
0.32418
0.25614
0.43076
0.39116
0.34124
0.40739
0.34124
0.39285
0.45383
0.23512
0.40076
0.46225
0.44836
0.3784
DRCC
0.40473
0.22761
0.27939
0.26848
0.43932
0.34033
0.39263
0.34033
0.37492
0.40838
0.3122
0.39257
0.20839
0.40158
0.3347
Nips
0.46115
0.36496
0.449949
0.299225
0.367568
0.312139
0.299379
0.338028
0.347465
0.400756
0.209877
0.395382
0.375
0.399912
0.35794
F1 Score
ABCNews
NBA
TwitterAU
SonyPictures
LiberalAus
AustralianLabor
WhiteHouse
MercedesBenz
Techreview
Cambridge_Uni
The_Nationals
Greens
MGM_Studios
BBCNews
Average
K-Means
0.11045
0.40969
0.36574
0.55095
0.3136
0.39616
0.37887
0.39616
0.29113
0.17449
0.34444
0.20823
0.45046
0.25203
0.3329
BigClam
0.2934
0.7859
0.3779
0.3295
0.6548
0.3717
0.2847
0.5376
0.4096
0.5054
0.8519
0.4409
0.5147
0.6577
0.4893
AgmFit
0.3745
0.7689
0.581
0.1137
0.6918
0.6021
0.4586
0.5023
0.3011
0.3081
0.8889
0.3738
0.5686
0.4803
0.4863
Censa
0.3446
0.7908
0.3836
0.3372
0.6117
0.4422
0.2785
0.4507
0.3804
0.4991
0.8148
0.4144
0.5686
0.7124
0.49363
FNMTF
0.14475
0.4029
0.34456
0.39705
0.34897
0.34898
0.36086
0.34898
0.2819
0.20433
0.57268
0.27619
0.30377
0.2497
0.32589
GNMF
0.15895
0.3858
0.42203
0.35035
0.41564
0.42824
0.35184
0.42824
0.32963
0.13644
0.60158
0.22689
0.21378
0.17784
0.32397
DRCC
0.13789
0.57537
0.42797
0.5575
0.3167
0.43044
0.32832
0.43044
0.32258
0.21198
0.55732
0.24314
0.54318
0.28105
0.3882
Nips
0.189125
0.428098
0.24676
0.34763
0.621857
0.631624
0.504132
0.389616
0.313859
0.24009
0.777778
0.350729
0.369882
0.363441
0.39636
2. Then we make the number of circles automatically detected. Please note that although
the Nips algorithm claim that they can automatically detect the circle amount in the paper,
however we did not fount this function in the source code they published. So we lack of the
results in this part. See table 5 and table 6.
table 5
Auto Ber_loss
ABCNews
NBA
TwitterAU
SonyPictures
LiberalAus
AustralianLabor
WhiteHouse
MercedesBenz
Techreview
Cambridge_Uni
The_Nationals
Greens
MGM_Studios
BBCNews
Average
BigClam
0.3029
0.1253
0.3128
0.3756
0.2332
0.2287
0.2814
0.2272
0.2965
0.2643
0.1325
0.2946
0.269
0.2326
0.1374
AgmFit
0.4837
0.3676
0.4665
0.4609
0.3325
0.3319
0.4474
0.4025
0.435
0.4214
0.2116
0.4399
0.2612
0.1384
0.2572
table 6
Censa
0.4133
0.1048
0.2494
0.4142
0.3339
0.2296
0.3149
0.2946
0.3421
0.2415
0.228
0.3268
0.2633
0.1384
0.37446
Auto F1 Score
ABCNews
NBA
TwitterAU
SonyPictures
LiberalAus
AustralianLabor
WhiteHouse
MercedesBenz
Techreview
Cambridge_Uni
The_Nationals
Greens
MGM_Studios
BBCNews
Average
13
BigClam
0.317
0.7287
0.376
0.2571
0.2252
0.4873
0.2436
0.5282
0.3711
0.4527
0.5494
0.4498
0.4559
0.4472
0.43569
AgmFit
0.038
0.2701
0.0716
0.0853
0.4167
0.3449
0.1149
0.2019
0.1376
0.1588
0.5926
0.1374
0.4853
0.1787
0.2167
Censa
0.1706
0.7908
0.4404
0.1667
0.406
0.4781
0.2947
0.4049
0.2549
0.5205
0.4026
0.3582
0.4265
0.7124
0.41702
Social Communities Detection in Social Media
3. We also vary the number of cluster as 3, 5, 7, 9, 11 to track the trend of results quality
with the change of circles number. Then we calculated the average performance of each
algorithm on the whole datasets and draw the trend line chart for each algorithm.
Following charts show that how each algorithm's average value of BER and F1 Score of all
data sets change when the number of detected circles vary.(K is the number of detected
circles).
Average BER_Loss Value for Varing K
Average F1 Score Value for Varing K
When we set the number of circles small, say, three or five, Amgfit and BigClam will ignore
many nodes, but our algorithm will keep all nodes and cluster them into different circles.
For some small data sets, say 'The_Nationals' and 'NBA' data, cannot be divided into 9 and
11 different circles in some algorithm even we manually set the value of K 9 or 11. But for
others bigger data sets, the auto circle number detection function worked well.
14
Social Communities Detection in Social Media
The Nips will achieve better when the number of circles is small. Generally speaking, the
Stanford algorithm will the best performance when the K is three.
15
Social Communities Detection in Social Media
Conclusion
There are three key findings of this literature review:
1. There are two basic approaches to detect the social circles in social media. One is
relationship network structure analysis and another is node(user) characteristic analysis.
2. There three main categories of algorithms to detect the groups:
First is the methods purely based on the graph structure analysis. These one can even
perfectly cluster social circles if the graph structure of data is well-organized and clear. It
usually hard to accurately detect the overlapping and nested parts and may gets a bad
performance when the network structure of data set is complex and confused.
Second is those which only based on the node characteristic analysis. These algorithms
ignore the network structure and mainly depended on the similarity among nodes(users) to
do the cluster. It may get good performance when the data set has complete and enough
user profile information but may work does not good when the data set lack of these
information like the data come from Twitter.
Third is those methods which consider both graph structure and node attributes. This kind of
algorithms may globally get the best performance compared with other two kinds of
algorithms above mentioned. However, for some data sets which has very clear graph
structures, the algorithms purely based on graph structure analysis may beat this kind
methods as the node information may confuses the already cleaned circle classification
leading to a no good results.
3. in the future work, I will fill the gap of functional group detection which explain each
detected group's function and meaning through clustering the content of these groups.
16
Social Communities Detection in Social Media
Reference List
Yang, J., and Leskovec, J. 2013, 'Overlapping community detection at scale: a nonnegative
matrix factorization approach', In Proceedings of the sixth ACM international conference on
Web search and data mining, ACM , pp. 587-596.
Lee, D.D. & Seung H.S. 2001, 'Algorithms for non-negative matrix factorization', In Advances
in neural information processing systems, pp. 556-562.
Yang, J., & Leskovec, J. 2013, 'Overlapping community detection at scale: a nonnegative
matrix factorization approach' In Proceedings of the sixth ACM international conference on
Web search and data mining, ACM, pp. 587-596.
Gu, Q., and Zhou, J. 2009, 'Co-clustering on manifolds' In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 359368.
Yang, J., and Leskovec, J. 2012, 'Community-affiliation graph model for overlapping network
community detection' In Data Mining (ICDM), 2012 IEEE 12th International Conference on,
IEEE , pp. 1170-1175.
Mcauley, J., and Leskovec J. 2014, 'Discovering social circles in ego networks', ACM
Transactions on Knowledge Discovery from Data (TKDD) 8, no. 1, pp. 4.
Wang, H., Nie, F., Huang, H., & Makedon, F. 2011, 'Fast nonnegative matrix tri-factorization
for large-scale data co-clustering' In IJCAI Proceedings-International Joint Conference on
Artificial Intelligence, vol. 22, no. 1, pp. 1553.
Yang, J., Julian M., & Leskovec J. 2013, 'Community detection in networks with node
attributes' In Data Mining (ICDM), 2013 IEEE 13th International Conference on, IEEE, pp.
1151-1156.
MacKay, D.J.C. 2003, Information theory, inference, and learning algorithms. Vol. 7.
Cambridge university press, Cambridge.
Yang, Y.H. 2005, 'Information theory, inference, and learning algorithms', Journal of the
American Statistical Association, Vol. 100, no. 472, pp. 1461-1462.
Leskovec, J, & Julian, J.M. 2012, 'Learning to discover social circles in ego networks',
In Advances in neural information processing systems, pp. 539-547.
Cai, D., He, X. Wu, X. & Han, J. 2008, 'Non-negative matrix factorization on manifold', In Data
Mining, ICDM'08. Eighth IEEE International Conference on, IEEE, pp. 63-72.
Ding, C.H.Q., He, F. & Simon, D.H. 2005, 'On the Equivalence of Nonnegative Matrix
Factorization and Spectral Clustering' In SDM, vol. 5, pp. 606-610.
Leskovec, J., & Julian J.M. 2012, 'Learning to discover social circles in ego networks',
In Advances in neural information processing systems, pp. 539-547.
Xu, Z.Q., Ke, Y.P., Wang, Y., Cheng, H., & Cheng, J. 2012, 'A model-based approach to
attributed graph clustering' In Proceedings of the 2012 ACM SIGMOD International
Conference on Management of Data, ACM , pp. 505-516.
Yang, Z., Cheng, H., & Yu, X.J. 2009, "Graph clustering based on structural/attribute
similarities." Proceedings of the VLDB Endowment 2, no. 1, pp. 718-729.
Barabási, A.L., Natali G., & Joseph L. 2011, 'Network medicine: a network-based approach to
human disease.', Nature Reviews Genetics, no. 1, pp. 56-68.
17
Social Communities Detection in Social Media
Maier, M., Matthias, H., & Ulrike, V.L. 2007, 'Cluster identification in nearest-neighbor
graphs' In Algorithmic Learning Theory, pp. 196-210.
Kuang, D., Haesun, P., & Ding, C. 2012, 'Symmetric Nonnegative Matrix Factorization for
Graph Clustering' In SDM, vol. 12, pp. 106-117.
Reagans, R., & Bill M. 2003, 'Network structure and knowledge transfer: The effects of
cohesion and range' Administrative science quarterly, no. 2, pp. 240-267.
18