Download Models for Community Detection using Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028
Models for Community Detection
using Data Mining
Sanjeev Kumar, Prabhat Kumar, and M. P. Singh
Now we talk about taxonomy of community. So in fig. 1 as
shown below, we give a taxonomy of community based on
habit of people. Basically it is divided into two major part i.e.
malicious and non-malicious. Malicious community are those
who are indulge in unsocial work like cheat, murder, fraud
practice etc. Teaser (The people who hurts the sentiment of
user like the user who post , comment, unethical contents
related to religion (contents that manipulate the thought of
children), unethical contents related to social community(cast,
creed, colour, sex), vulgar contents.) a new malicious
community is also proposed. Non-malicious community
include a large set of element but here we present a new and
different approach of community classification that is based on
habits. So we define non-malicious community with people
having habit of playing new games and application, love new
electronic gadgets, fascinated in travel and tourism, interested
in magazine reading etc.
Abstract—In the era of globalization and technological
advancement, everybody wants to remain connected through social
networking websites, blogs, emails, web pages and mobile phones as
per their motives which may be malicious or non-malicious. This
paper explores all the different methodologies, concepts and
algorithms used to identify these kinds of community on the basis of
certain pattern, properties, structure and trends in their linkage to
extract remarkable knowledge from World Wide Web. We also give a
simple community finding technique and a broad taxonomy of
community according to habit.
Keywords— Data mining, community mining, group detection,
malicious community, non-malicious community.
I. INTRODUCTION
W
EB community detection using online repositories such
as digital libraries, user generated media (e.g. blog,
forum, review, news group, BBS etc.), and content of social
networking websites (e.g. post, comment, description etc.)
become more popular with the globalization technological
advancement. And so examining these networked data have
become an important research topic. Web community helps the
users in finding friends of similar interests, habits, providing
timely help and allowing them to share interests with each
other like every researcher of opinion mining need opinion of
different people about any topic, product, and subject etc.
Hence for the opinion he/she can request to whole community
(e.g. a community of people who are working in data mining)
to help in research. One major area in investigating such
networked data is to detect prominent communities amongst
individuals.
The detection of community has large number of
applications such as understanding the social structure of
organizations, society, their habits, characteristics etc. With
this understanding we can revolutionize the world, by
modifying large-scale networks in Internet services, or by
altering motive of member of community for example if
community member belongs to a community of criminal then
by giving physiological treatment (e.g. giving motivational
thoughts) very similar to treatment done in children’s prison.
Fig. 1 Classification of Community
There are different formulations for community detection
that extract relevant knowledge from online repositories. A
networked data set is usually represented as a graph where
individuals in the network are represented by the nodes in the
graph. The nodes are tied with each other by either directed
links or undirected links, which represent the relations among
the individuals. In addition to the links that they are incident
to, nodes are often described by certain attributes, which we
Sanjeev Kumar, National Institute of Technology Patna, India,
Email:[email protected]
Prabhat Kumar, National Institute of Technology Patna, India, Email:
[email protected]
M. P. Singh, National Institute of Technology Patna, Email: India,
[email protected]
513
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028
refer to as contents of the nodes. For example, when it comes
to the web pages, online blogs, or scientific papers, the
contents are usually represented by histograms of keywords; in
the network of co-authorship, the contents of nodes can be the
demographic or affiliation information of researchers.
Fig.2. Different views at a data collection: the “classical” data mining view, the social network analysis and the
network data mining view.
The concept of data mining is described here by using an
example of data shown in Fig. 2 about a group of friends of a
Facebook user. The data is arranged in table containing the
following columns: name of the friend; Magazine update;
Notification of paper; Advertisement sharing; a record of
whether these friends have been sharing software update; a
record of whether the friends indulge in research; a record
about the proximity of the Academic qualifications of the
friends. As the “Name” column contains unique identifiers, it
will be ignored, and the data mining task will be to develop a
community of the friends from list of friends with respect to
the attributes “Magazine update”, ”Notification of paper”,
”Advertisement sharing”, ”Software Updates” and “Research
work”. In an unsupervised approach, the friends will be
clustered into groups and the analyst ends up with the
description of the different groups. In this case, the analyst is
interested in predicting whether a new friend will belong to
research group or not.
The network perspective with respect to a data set is
illustrated in Fig. 2. The structural component of the data set
describes explicitly some relationships between the individual
entities in the data (example in Fig. 2, the column “Academic
Qualification” explicitly represents the relation of educational
proximity between the areas in which they belong). Social
network analysis deals with this type of data analysis.
Network models, which include the topology of the network
and the characteristics of its nodes, links, and clusters of nodes
and links, attempt to explain observed phenomena through the
interactions of individual nodes or node clusters. Historically,
sociologists have been exploring the social networks between
people in different social settings. A typical social network
analysis research scenario
involves
data
collection
through questionnaires or tables, where individuals describe
their interactions with other individuals in the same social
setting (for example, a club, school, organization, across
organizations, etc.). Collected data is then used to infer a
social network model in which nodes represent individuals and
edges represent the interactions between these individuals.
Classical social network analysis studies deal with relatively
small data sets and look at the structure of individuals in the
network, measured by such indices as centrality (which
individuals have most links, can reach many others.
Fig. 3 (a) “Black-box” input-output generalization
Are in a position to exert most influence, etc.) and
connectivity (paths between individuals or clusters of
individuals through the network).
514
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028
can either be set of {0, 1} for showing
link from node i to j.
existence /non-existence of link or any non-negative value
, we define link to
depends on the weight of link. For
be directional i.e. from node i to node j. LI(i) ∈V is the “Link
In” [ ]to denote the link from i to other node and LO(i)∈V is
the “Link Out” []to denote the link from other node to i. So,
} &O(i) = { j|
} are two types of set of
I(i) = {j|
link associated with node i. Consider K denote the number of
community, Z i ∈ { 1, …, K } represent variable of community
created by node i & Y i = { Y i1 , … , Y ik }represent the
membership of community of node i.
III. COMMUNITY DETECTION MODELS
Fig. 3 (b) Structure of a network model
In this section, we have listed different link model and
followed by a maximum likelihood estimation method used to
estimate the unknown parameters of the proposed model using
above assumed notation.
Circumstances in identification of Community
a) It is tough to have the right characterization of the notion
of “community” that is to be detected.
b) The entities/nodes involved are distributed in real life
applications, and hence distributed means of identification
will be desired.
c) A snapshot based dataset may not be able to capture the
real picture. What are most important lies in the local
relationships (e.g. the nature and frequency of local
interactions) between the entities/nodes?
A. Hypertext-Induced Topic Selection (HITS) Model
HITS algorithm is used basically for retrieving most
relevant documents from World Wide Web. It generates a
matrix of documents that has correlation and among these
documents correlation can be calculated using eigenvector of
the matrix such that each of these matrixes forms a
“community” of equally likely documents. It implements
iterative method for identification of principal eigenvector (as
principal community) of the matrix M ij. Where M ij should be
nonzero iff document i refers document j or, if i have a
hyperlink to j. According to the scope of vector in document’s
dimension is known as “loading” of the document on the
vector. Loading of document on the principal eigenvector of
MM’ infer as its “hub” value in the community. Since it is
exploring the interrelated online document thus we can assume
it as the starting of community detection on WWW.
Research Challenges for Community Detection
a) To understand the network’s static structures (e.g.
topologies and clusters).
b) To understand the dynamic behaviour (such as growth
factors, robustness, and functional efficiency).
II. BACKGROUND
Traditionally community is decided on the basis of
extraction of buried characteristics, features specified by the
member of the community itself or on the basis of exploring
their defined habits like the member specifies that he/she
belongs to IT profession, Research scholar etc. At that time it
is very simple to categories the people but now research reach
at a level that without explicit specification of people’s
community we try to find their community using technology.
But today finding only community like IT professional or
Research scholar is not that researcher’s work finished. Today
finding community of people who indulge in the malicious
work is the real challenging task. So for these type of
community we should mine the buried pattern of their work
that is stored in the online document as their micro-blog,
comment, post or simply we say scanning their habits and
behaviour on the different social networking website since in
this era of technological advancement everyone is using these
social networking websites.
Assumptions and notations used here for creating
mathematical formulae. All the nodes are represented by V =
{1… n}. Where these nodes represent a document stored as
records the data needed to establish
online repositories. Let
B. Probability-based Hypertext-Induced Topic Selection
(PHITS) Model
PHITS model is bit more precise than HITS for finding
relationship between documents. Since it introduces
probability Pr(i|j) for the conditional link [ Directed network
comm. Det.] . Here i and j are two nodes which have
similarity. Node i produce a link that will be close to node j so
probability of similarity is more in j with respect to all other
nodes. For calculating Pr(i|j), take a community variable z i that
gives the community membership of node i, furthermore Pr(j|i,
, where
denote the
z i ) is calculated as Pr(j|i, z i ) =
probability of link for node j given by any element in z i set.
Finally we have PHITS model after integrating z i i.e. as
follows:
(1)
PLSA [5] is using PHITS as an application for networking
of data.
515
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028
• c i : the weight of node i in terms of deciding thecommunity
prior Pr(k) (which will be elaboratedmomentarily).
To handle scale invariance, we normalize so that
=
Generative Process We explains Equation (4) bythe
following generative process of PPL:
• Sample a community z according to a prior
distribution 1,… K, where k is computed by
C. Popularity-based Conditional Link (PCL) Model
PCL is another conditional model that introduces popularity
of hidden variable for every node which illustrates the property
of link. This model gives the probability of linking node i with
node j among the element of LO (i) more likely with respect to
the remaining element of it. For this we include a set of hidden
variable z i and a popularity variable b i (non-negative integer)
for each node i so that as the popularity of node rises as
similarly the chance of citation by other node also increases.
Hence Pr(j|i, z i ) is given as :
• Given community z, the conditional link probabilityis given
by
Where, b j represent the popularity of node j. Now on
integrating z i yield:
(5)
There are two unique features in the above
generativeprocess:
1. Prior probability is constructed asthe weighted sum of
nodemembership’sY ik ,
(2)
PHITS model can be taken as a simplest form of PCL model
if we introduce the community-dependent popularity as shown
in [9]. It is significant to note that both conditional models
only take into account one type of links, and hence it is
incomplete for directed network community detection.
Where,c i is used to weight node i in the combination. This
construction
enforces
the
consistency
between
nodemembership’sY ik and community prior
This specific construction of community priors
alsosimplifies relation between the proposed frameworkand
some existing models for community detection.
2. In Equation (5), the two ends of link i’ jare treated
differently when modellingPr(i’ , j• |z):besides the
dependence on community membershipsY ik and Y jk ,
Pr(i’ |z) and Pr(j• |z) are modelled bya i (i.e., the
productivity of node i) and b j .(i.e., the popularity of node
j), respectively, leadingto the differentiation of the roles
played by thetwo nodes.
With the joint link probability defined in Equation (4), the
log-likelihood for links can be writtenas
D. Symmetric Joint Link (SJL) Model
SJL model gives the likelihood of “link erection” in two
nodes i and j. This is done by combining link probability Pr
(i,j) for node i and j such as follows.
(3)
Where, Àk represent the previous probability for a link that
represent
is produced in community k, whereas
the conditional probabilities for node i and j such that these are
the two ends of the link. Here we also conclude that Pr(i,j) =
Pr(j,i), even then we can’t say SJL is appropriate for directed
network community detection.
(6)
Note that we use original data s ij in the joint linkmodel
rather than normalized data
E. Popularity and Productivity Link Model(PPL)
PPL models the jointlink probability Pr(i, j), i.e., how likely
there is adirected link from node I to node j. In order
toemphasize the different roles played by I and j, we writePr(i,
j) as Pr(i ’ , j • ), denoting that node I plays therole of
producing the link, and node j plays the role ofreceiving the
as
link. Following the idea of SJL, we model
follows:
Usedin conditional link models [15, 09]. Parameters, a,
b,and c can be inferred by maximizing the log-likelihoodL(a,
b, c,y).
F. Criminal Group Detection Models
Four proprietary models [24]; each model is relying one or
more features of crime data by linking similar criminals. The
first model, GDM is based on co-offending feature (e.g. who
committed crime with whom). The second model OGDM is
based on similarity of three crime features: crime location,
crime date, and modus operandi. The third model SoDM is
based on similarity of criminals’ surname and hometown
information. The fourth model ComDM is based on all of
these five features; crime location, date, modus operandi,
criminals’ surname, and hometown.
(4)
Where
• Y ik : the probability for node i to belong to communityY k .
• a i : the productivity of node i, i.e., among all thenodes, how
likely a link is produced by node i.
• b j : the popularity of node j, i.e., among all thenodes, how
likely a link is received by node j.
516
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028
[19] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent
topic models for text and citations. In Proc. of the 14th ACM
International Conference on Knowledge Discovery and Data Mining,
2008.
[20] L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation
influences. In Proc. of the 24th International Conference on Machine
Learning, 2007.
[21] E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership models of
scientific publications. In Proceedings of the National Academy of
Science 2004.
[22] J. M. Hofman and C. H.Wiggins. A Bayesian approach to network
modularity. Phy. Rev. Letters, 100, 2008.
[23] Larson, R. R. (1996). Bibliometrics of the World Wide Web: An
exploratory analysis of the intellectual structure of cyberspace
(Technical Report TRCS96-05). Computer Science Department,
University of California, Santa Barbara.
[24] Fatih OZGUL, Murat GOK, Zeki ERDEM, Yakup OZAL, “Detecting
criminal networks: SNA models are compared to proprietary models”.
IV. CONCLUSION
Community mining is an emerging and rapidly growing area
of data mining that is focused on finding patterns in data by
exploiting and explicitly modelling the links among the data
instances. We have surveyed several models for community
detection starting from the HITS models that implements
eigenvector to detect principal community, then PHITS
models that focuses on the conditional link probability, next is
PCL that introduce the latent variable popularity for each
node, then SJL that models the link structure by the joint
probability, then PPL that implements both popularity and
productivity of link in community detection and finally there
are four crime specific group detection models.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
John Galloway and Simeon J. Simoff(2007).“Network Data Mining:
Methods and Techniques for Discovering Deep Linkage between
Attributes”.
W. Ren, G. Yan, X. Liao, and L. Xiao. Simple probabilisticalgorithm
for detecting community structure.Physical Review E, 79, 2009.
T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining linkand content for
community detection: A discriminativeapproach. In Proc. of the 15th
ACM InternationalConference on Knowledge Discovery and Data
Mining,2009.
K. Yu, S. Yu, and V. Tresp.Soft clustering on graphs.InProc. of
19thAdvancesin Neural Information ProcessingSystems, 2005
T. Holfmann. Probabilistic latent semantic indexing. In Proc. Of the
22nd International Conference on Research and Development in
Information Retrieval, 1999.
Bharat, K., &Henzinger, M. (1998). Improved algorithms for topic
distillations in a hyperlinked environment,. Twenty first Associations
for Computing Machinery Special Interest Group on Information
Retrieval Conference on Research and Development in Information
Retrieval.
Golub, G,.& Loan, C.V.(1989). Matrix computations. Johns Hopkins
University Press.
R. Gorsuch, R. (1983). Factor analysis. Lawrence Erlbaum Associates.
T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for
community detection: A discriminative approach. In Proc. of the 15th
ACM International Conference on Knowledge Discovery and Data
Mining, 2009.
Fayyad, U. M., G. Piatetsky-Shapiro, et al. (1996): From data mining to
knowledge discovery: An overview. Advances in Knowledge Discovery
and Data Mining. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R.
Uthurusamy. Cambridge, Massachusetts, AAAI Press/The MIT Press, 134.
Fayyad, U. M. (2003): "Editorial." ACM SIGKDD Explorations 5(2), 13.
Dunham, M. H. (2002): Data Mining: Introductory and Advanced
Topics, Prentice Hall.
Han, J. and M. Kamber (2001): Data Mining: Concepts and Techniques.
San Francisco, CA, Morgan Kaufmann Publishers.
Ramoni, M. F. and P. Sebastiani (2003): Bayesian methods for
intelligent data analysis. Intelligent Data Analysis: An Introduction. M.
Berthold and D. J. Hand. New York, NY, Springer, 131-168.
D. Cohn and H. Chang. Learning to probabilistically identify
authoritative documents. In Proc. of the 17th International Conference
on Machine Learning, 2000.
K. Yu, S. Yu, and V. Tresp. Soft clustering on graphs. In Proc. of 19th
Advances in Neural Information Processing Systems, 2005.
D. Zhou, J. Huang, and B. Sch¨olkopf. Learning from labelled and
unlabelled data on a directed graph. In Proc. of the 22nd International
Conference on Machine Learning, 2005.
E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed
membership stochastic block models for relational data with application
to protein-protein interactions. In Proc. of the International Biometrics
Society Annual Meeting, 2006.
517