Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028 Models for Community Detection using Data Mining Sanjeev Kumar, Prabhat Kumar, and M. P. Singh Now we talk about taxonomy of community. So in fig. 1 as shown below, we give a taxonomy of community based on habit of people. Basically it is divided into two major part i.e. malicious and non-malicious. Malicious community are those who are indulge in unsocial work like cheat, murder, fraud practice etc. Teaser (The people who hurts the sentiment of user like the user who post , comment, unethical contents related to religion (contents that manipulate the thought of children), unethical contents related to social community(cast, creed, colour, sex), vulgar contents.) a new malicious community is also proposed. Non-malicious community include a large set of element but here we present a new and different approach of community classification that is based on habits. So we define non-malicious community with people having habit of playing new games and application, love new electronic gadgets, fascinated in travel and tourism, interested in magazine reading etc. Abstract—In the era of globalization and technological advancement, everybody wants to remain connected through social networking websites, blogs, emails, web pages and mobile phones as per their motives which may be malicious or non-malicious. This paper explores all the different methodologies, concepts and algorithms used to identify these kinds of community on the basis of certain pattern, properties, structure and trends in their linkage to extract remarkable knowledge from World Wide Web. We also give a simple community finding technique and a broad taxonomy of community according to habit. Keywords— Data mining, community mining, group detection, malicious community, non-malicious community. I. INTRODUCTION W EB community detection using online repositories such as digital libraries, user generated media (e.g. blog, forum, review, news group, BBS etc.), and content of social networking websites (e.g. post, comment, description etc.) become more popular with the globalization technological advancement. And so examining these networked data have become an important research topic. Web community helps the users in finding friends of similar interests, habits, providing timely help and allowing them to share interests with each other like every researcher of opinion mining need opinion of different people about any topic, product, and subject etc. Hence for the opinion he/she can request to whole community (e.g. a community of people who are working in data mining) to help in research. One major area in investigating such networked data is to detect prominent communities amongst individuals. The detection of community has large number of applications such as understanding the social structure of organizations, society, their habits, characteristics etc. With this understanding we can revolutionize the world, by modifying large-scale networks in Internet services, or by altering motive of member of community for example if community member belongs to a community of criminal then by giving physiological treatment (e.g. giving motivational thoughts) very similar to treatment done in children’s prison. Fig. 1 Classification of Community There are different formulations for community detection that extract relevant knowledge from online repositories. A networked data set is usually represented as a graph where individuals in the network are represented by the nodes in the graph. The nodes are tied with each other by either directed links or undirected links, which represent the relations among the individuals. In addition to the links that they are incident to, nodes are often described by certain attributes, which we Sanjeev Kumar, National Institute of Technology Patna, India, Email:[email protected] Prabhat Kumar, National Institute of Technology Patna, India, Email: [email protected] M. P. Singh, National Institute of Technology Patna, Email: India, [email protected] 513 International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028 refer to as contents of the nodes. For example, when it comes to the web pages, online blogs, or scientific papers, the contents are usually represented by histograms of keywords; in the network of co-authorship, the contents of nodes can be the demographic or affiliation information of researchers. Fig.2. Different views at a data collection: the “classical” data mining view, the social network analysis and the network data mining view. The concept of data mining is described here by using an example of data shown in Fig. 2 about a group of friends of a Facebook user. The data is arranged in table containing the following columns: name of the friend; Magazine update; Notification of paper; Advertisement sharing; a record of whether these friends have been sharing software update; a record of whether the friends indulge in research; a record about the proximity of the Academic qualifications of the friends. As the “Name” column contains unique identifiers, it will be ignored, and the data mining task will be to develop a community of the friends from list of friends with respect to the attributes “Magazine update”, ”Notification of paper”, ”Advertisement sharing”, ”Software Updates” and “Research work”. In an unsupervised approach, the friends will be clustered into groups and the analyst ends up with the description of the different groups. In this case, the analyst is interested in predicting whether a new friend will belong to research group or not. The network perspective with respect to a data set is illustrated in Fig. 2. The structural component of the data set describes explicitly some relationships between the individual entities in the data (example in Fig. 2, the column “Academic Qualification” explicitly represents the relation of educational proximity between the areas in which they belong). Social network analysis deals with this type of data analysis. Network models, which include the topology of the network and the characteristics of its nodes, links, and clusters of nodes and links, attempt to explain observed phenomena through the interactions of individual nodes or node clusters. Historically, sociologists have been exploring the social networks between people in different social settings. A typical social network analysis research scenario involves data collection through questionnaires or tables, where individuals describe their interactions with other individuals in the same social setting (for example, a club, school, organization, across organizations, etc.). Collected data is then used to infer a social network model in which nodes represent individuals and edges represent the interactions between these individuals. Classical social network analysis studies deal with relatively small data sets and look at the structure of individuals in the network, measured by such indices as centrality (which individuals have most links, can reach many others. Fig. 3 (a) “Black-box” input-output generalization Are in a position to exert most influence, etc.) and connectivity (paths between individuals or clusters of individuals through the network). 514 International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028 can either be set of {0, 1} for showing link from node i to j. existence /non-existence of link or any non-negative value , we define link to depends on the weight of link. For be directional i.e. from node i to node j. LI(i) ∈V is the “Link In” [ ]to denote the link from i to other node and LO(i)∈V is the “Link Out” []to denote the link from other node to i. So, } &O(i) = { j| } are two types of set of I(i) = {j| link associated with node i. Consider K denote the number of community, Z i ∈ { 1, …, K } represent variable of community created by node i & Y i = { Y i1 , … , Y ik }represent the membership of community of node i. III. COMMUNITY DETECTION MODELS Fig. 3 (b) Structure of a network model In this section, we have listed different link model and followed by a maximum likelihood estimation method used to estimate the unknown parameters of the proposed model using above assumed notation. Circumstances in identification of Community a) It is tough to have the right characterization of the notion of “community” that is to be detected. b) The entities/nodes involved are distributed in real life applications, and hence distributed means of identification will be desired. c) A snapshot based dataset may not be able to capture the real picture. What are most important lies in the local relationships (e.g. the nature and frequency of local interactions) between the entities/nodes? A. Hypertext-Induced Topic Selection (HITS) Model HITS algorithm is used basically for retrieving most relevant documents from World Wide Web. It generates a matrix of documents that has correlation and among these documents correlation can be calculated using eigenvector of the matrix such that each of these matrixes forms a “community” of equally likely documents. It implements iterative method for identification of principal eigenvector (as principal community) of the matrix M ij. Where M ij should be nonzero iff document i refers document j or, if i have a hyperlink to j. According to the scope of vector in document’s dimension is known as “loading” of the document on the vector. Loading of document on the principal eigenvector of MM’ infer as its “hub” value in the community. Since it is exploring the interrelated online document thus we can assume it as the starting of community detection on WWW. Research Challenges for Community Detection a) To understand the network’s static structures (e.g. topologies and clusters). b) To understand the dynamic behaviour (such as growth factors, robustness, and functional efficiency). II. BACKGROUND Traditionally community is decided on the basis of extraction of buried characteristics, features specified by the member of the community itself or on the basis of exploring their defined habits like the member specifies that he/she belongs to IT profession, Research scholar etc. At that time it is very simple to categories the people but now research reach at a level that without explicit specification of people’s community we try to find their community using technology. But today finding only community like IT professional or Research scholar is not that researcher’s work finished. Today finding community of people who indulge in the malicious work is the real challenging task. So for these type of community we should mine the buried pattern of their work that is stored in the online document as their micro-blog, comment, post or simply we say scanning their habits and behaviour on the different social networking website since in this era of technological advancement everyone is using these social networking websites. Assumptions and notations used here for creating mathematical formulae. All the nodes are represented by V = {1… n}. Where these nodes represent a document stored as records the data needed to establish online repositories. Let B. Probability-based Hypertext-Induced Topic Selection (PHITS) Model PHITS model is bit more precise than HITS for finding relationship between documents. Since it introduces probability Pr(i|j) for the conditional link [ Directed network comm. Det.] . Here i and j are two nodes which have similarity. Node i produce a link that will be close to node j so probability of similarity is more in j with respect to all other nodes. For calculating Pr(i|j), take a community variable z i that gives the community membership of node i, furthermore Pr(j|i, , where denote the z i ) is calculated as Pr(j|i, z i ) = probability of link for node j given by any element in z i set. Finally we have PHITS model after integrating z i i.e. as follows: (1) PLSA [5] is using PHITS as an application for networking of data. 515 International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028 • c i : the weight of node i in terms of deciding thecommunity prior Pr(k) (which will be elaboratedmomentarily). To handle scale invariance, we normalize so that = Generative Process We explains Equation (4) bythe following generative process of PPL: • Sample a community z according to a prior distribution 1,… K, where k is computed by C. Popularity-based Conditional Link (PCL) Model PCL is another conditional model that introduces popularity of hidden variable for every node which illustrates the property of link. This model gives the probability of linking node i with node j among the element of LO (i) more likely with respect to the remaining element of it. For this we include a set of hidden variable z i and a popularity variable b i (non-negative integer) for each node i so that as the popularity of node rises as similarly the chance of citation by other node also increases. Hence Pr(j|i, z i ) is given as : • Given community z, the conditional link probabilityis given by Where, b j represent the popularity of node j. Now on integrating z i yield: (5) There are two unique features in the above generativeprocess: 1. Prior probability is constructed asthe weighted sum of nodemembership’sY ik , (2) PHITS model can be taken as a simplest form of PCL model if we introduce the community-dependent popularity as shown in [9]. It is significant to note that both conditional models only take into account one type of links, and hence it is incomplete for directed network community detection. Where,c i is used to weight node i in the combination. This construction enforces the consistency between nodemembership’sY ik and community prior This specific construction of community priors alsosimplifies relation between the proposed frameworkand some existing models for community detection. 2. In Equation (5), the two ends of link i’ jare treated differently when modellingPr(i’ , j• |z):besides the dependence on community membershipsY ik and Y jk , Pr(i’ |z) and Pr(j• |z) are modelled bya i (i.e., the productivity of node i) and b j .(i.e., the popularity of node j), respectively, leadingto the differentiation of the roles played by thetwo nodes. With the joint link probability defined in Equation (4), the log-likelihood for links can be writtenas D. Symmetric Joint Link (SJL) Model SJL model gives the likelihood of “link erection” in two nodes i and j. This is done by combining link probability Pr (i,j) for node i and j such as follows. (3) Where, Àk represent the previous probability for a link that represent is produced in community k, whereas the conditional probabilities for node i and j such that these are the two ends of the link. Here we also conclude that Pr(i,j) = Pr(j,i), even then we can’t say SJL is appropriate for directed network community detection. (6) Note that we use original data s ij in the joint linkmodel rather than normalized data E. Popularity and Productivity Link Model(PPL) PPL models the jointlink probability Pr(i, j), i.e., how likely there is adirected link from node I to node j. In order toemphasize the different roles played by I and j, we writePr(i, j) as Pr(i ’ , j • ), denoting that node I plays therole of producing the link, and node j plays the role ofreceiving the as link. Following the idea of SJL, we model follows: Usedin conditional link models [15, 09]. Parameters, a, b,and c can be inferred by maximizing the log-likelihoodL(a, b, c,y). F. Criminal Group Detection Models Four proprietary models [24]; each model is relying one or more features of crime data by linking similar criminals. The first model, GDM is based on co-offending feature (e.g. who committed crime with whom). The second model OGDM is based on similarity of three crime features: crime location, crime date, and modus operandi. The third model SoDM is based on similarity of criminals’ surname and hometown information. The fourth model ComDM is based on all of these five features; crime location, date, modus operandi, criminals’ surname, and hometown. (4) Where • Y ik : the probability for node i to belong to communityY k . • a i : the productivity of node i, i.e., among all thenodes, how likely a link is produced by node i. • b j : the popularity of node j, i.e., among all thenodes, how likely a link is received by node j. 516 International Journal of Computer Science and Electronics Engineering (IJCSEE) Volume 1, Issue 4 (2013) ISSN 2320-401X; EISSN 2320-4028 [19] R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In Proc. of the 14th ACM International Conference on Knowledge Discovery and Data Mining, 2008. [20] L. Dietz, S. Bickel, and T. Scheffer. Unsupervised prediction of citation influences. In Proc. of the 24th International Conference on Machine Learning, 2007. [21] E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership models of scientific publications. In Proceedings of the National Academy of Science 2004. [22] J. M. Hofman and C. H.Wiggins. A Bayesian approach to network modularity. Phy. Rev. Letters, 100, 2008. [23] Larson, R. R. (1996). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace (Technical Report TRCS96-05). Computer Science Department, University of California, Santa Barbara. [24] Fatih OZGUL, Murat GOK, Zeki ERDEM, Yakup OZAL, “Detecting criminal networks: SNA models are compared to proprietary models”. IV. CONCLUSION Community mining is an emerging and rapidly growing area of data mining that is focused on finding patterns in data by exploiting and explicitly modelling the links among the data instances. We have surveyed several models for community detection starting from the HITS models that implements eigenvector to detect principal community, then PHITS models that focuses on the conditional link probability, next is PCL that introduce the latent variable popularity for each node, then SJL that models the link structure by the joint probability, then PPL that implements both popularity and productivity of link in community detection and finally there are four crime specific group detection models. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] John Galloway and Simeon J. Simoff(2007).“Network Data Mining: Methods and Techniques for Discovering Deep Linkage between Attributes”. W. Ren, G. Yan, X. Liao, and L. Xiao. Simple probabilisticalgorithm for detecting community structure.Physical Review E, 79, 2009. T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining linkand content for community detection: A discriminativeapproach. In Proc. of the 15th ACM InternationalConference on Knowledge Discovery and Data Mining,2009. K. Yu, S. Yu, and V. Tresp.Soft clustering on graphs.InProc. of 19thAdvancesin Neural Information ProcessingSystems, 2005 T. Holfmann. Probabilistic latent semantic indexing. In Proc. Of the 22nd International Conference on Research and Development in Information Retrieval, 1999. Bharat, K., &Henzinger, M. (1998). Improved algorithms for topic distillations in a hyperlinked environment,. Twenty first Associations for Computing Machinery Special Interest Group on Information Retrieval Conference on Research and Development in Information Retrieval. Golub, G,.& Loan, C.V.(1989). Matrix computations. Johns Hopkins University Press. R. Gorsuch, R. (1983). Factor analysis. Lawrence Erlbaum Associates. T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: A discriminative approach. In Proc. of the 15th ACM International Conference on Knowledge Discovery and Data Mining, 2009. Fayyad, U. M., G. Piatetsky-Shapiro, et al. (1996): From data mining to knowledge discovery: An overview. Advances in Knowledge Discovery and Data Mining. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy. Cambridge, Massachusetts, AAAI Press/The MIT Press, 134. Fayyad, U. M. (2003): "Editorial." ACM SIGKDD Explorations 5(2), 13. Dunham, M. H. (2002): Data Mining: Introductory and Advanced Topics, Prentice Hall. Han, J. and M. Kamber (2001): Data Mining: Concepts and Techniques. San Francisco, CA, Morgan Kaufmann Publishers. Ramoni, M. F. and P. Sebastiani (2003): Bayesian methods for intelligent data analysis. Intelligent Data Analysis: An Introduction. M. Berthold and D. J. Hand. New York, NY, Springer, 131-168. D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proc. of the 17th International Conference on Machine Learning, 2000. K. Yu, S. Yu, and V. Tresp. Soft clustering on graphs. In Proc. of 19th Advances in Neural Information Processing Systems, 2005. D. Zhou, J. Huang, and B. Sch¨olkopf. Learning from labelled and unlabelled data on a directed graph. In Proc. of the 22nd International Conference on Machine Learning, 2005. E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic block models for relational data with application to protein-protein interactions. In Proc. of the International Biometrics Society Annual Meeting, 2006. 517