Download Eight Friends Are Enough: Social Graph

Eight Friends Are Enough: Social Graph Approximation via Public Listings Joseph Bonneau, Jonathan Anderson, Ross Anderson and Frank Stajano University of Cambridge Computer Laboratory {joseph.bonneau, jonathan.anderson, ross.anderson, frank.stajano}@cl.cam.ac.uk ABSTRACT have attracted considerable attention from the media, privacy advocates and the research community. Most of The popular social networking website Facebook exposes a “public view” of user profiles to search engines which includes eight of the user’s friendship links. We examine what interesting properties of the complete social graph can be inferred from this public view. In experiments on real social network data, we were able to accurately approximate the degree and centrality of nodes, compute small dominating sets, find short paths between users, and detect community structure. This work demonstrates that it is difficult to safely reveal limited information about a social network. the focus has been on personal data privacy: researchers and operators have attempted to ne-tune access control mechanisms to prevent the accidental leakage of embarrassing or incriminating information to third parties. A less studied problem is that of social graph privacy : preventing data aggregators from reconstructing large portions of the social graph, composed of users and their friendship links. Knowing who a person's friends are is valuable information to marketers, employers, credit rating agencies, insurers, spammers, phishers, po- Categories and Subject Descriptors lice, and intelligence agencies, but protecting the social Data Structures]: Graphs and networks; K.6.5 [Management of Computing and Information Systems]: Security and Protection; F.2.1 [Analysis of Algorithms and Problem Complexity]: Numerical Algorithms and Problems; K.4.1 [Computers and Society]: Public Policy Issues Privacy E.1 [ graph is more dicult than protecting personal data. Personal data privacy can be managed individually by users, while information about a user's place in the social graph can be revealed by any of the user's friends. 1.1 Facebook and Public Listings Facebook is the world's largest social network, claim- General Terms ing over 175 million active users, making it an inter- Security, Algorithms, Experimentation, Measurement, esting case study for privacy. Theory, Legal Aspects cial networking platforms, Facebook is known for hav- Compared to other so- ing relatively accurate user proles, as people primarily use Facebook to represent their real-world persona [10]. Keywords Thus, Facebook is often at the forefront of criticism Social networks, Privacy, Web crawling, Data breaches, about online privacy [2, 17]. In September 2007, Face- Graph theory 1. book started making public search listings available to those not logged in to the site an example is shown in INTRODUCTION Figure 1. These listings are designed to encourage visitors to join by showcasing that many of their friends The proliferation of online social networking services are already members. has entrusted massive silos of sensitive personal information to social network operators. Originally, public listings included a user's name, pho- Privacy concerns tograph, and 10 friends. Showing a new random set of friends on each request is clearly a privacy issue, as this allows a web spider to repeatedly fetch a user's page 1 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SNS ’09 Nuremberg, Germany Copyright 2009 ACM 978-1-60558-463-8 ...$5.00. until it has viewed all of that user's friends . In Jan- 1 Retrieving n friends by repeatedly fetching a random sample of size k is an instance of the coupon collector's problem. n It will take an average of n·H = Θ( nk log n) queries to rek trieve the complete set of friends. A set of 100 friends, for example, would require 65 queries to retrieve with k = 8. 1 combination of user ignorance and poorly designed privacy controls, as most users don't want public listings the primary purpose is to encourage new members to join. Users who joined prior to the deployment of public listings may be unaware that the feature exists, as it is never encountered by members of the site whose browsers store cookies for automatic log-in. Leaking friendship information leads to obvious privacy concerns. Social phishing attacks, in which phishing emails are forged to appear to come from a victim's friend, have been shown to be signicantly more eective than traditional cold phishing [12]. Private information can be inferred directly from one's friend list if, for example, it contains multiple friends with Icelandic names. A friend list may also be checked against data Figure 1: An author's public Facebook prole retrieved from other sources, such as known supporters uary 2009, public listings were reduced to 8 friends, and graph structure is leaked by public search listings. While the selection of users now appears to be a determinis- we are motivated by the question of what data aggre- tic function of the requestin IP address . Facebook has gators can extract from Facebook, we consider the gen- also added public listings for groups, listing 8 members eral question of what interesting properties of a graph in a similar fashion. one can compute from a limited view. of a fringe political party [19]. In this work though, we evaluate how much social 2 We start with G =< V, E >, where V is and E is the set of edges Critically, public listings are designed to be indexed an undirected social graph by search engines, and are not defended technically against the set of vertices (users) spidering. Members-only portions of social networks, (friendships). We produce a publicly sampled graph in contrast, are typically defended by rate-limiting or This suggests roughly 800 Gk =< V, Ek >, where Ek ∈ E is produced by randomly choosing k outgoing friendship edges for each node v ∈ V . We then compute some function of interest f (G) and attempt to approximate it using another function fapprox (Gk ). If f (G) ≈ fapprox (Gk ), we say machine-days of eort are required to retrieve the pub- that the public view of the graph leaks the property legally by terms of use. We have experimentally con- rme the ease of collecting public listings, writing a spidering script which retrieved from a desktop computer. ∼250,000 listings per day f. calculated by ability of a serious aggregator. Thus the complete col- we are most concerned with low values of lection of public listings can be considered available to current Facebook value all information leaks, so k formly random sample of Privacy Implications such as the k = 8. It is important to note that we assume motivated parties. 1.2 As k → ∞, lic listing of every user on Facebook, easily within the E. Ek is a uni- This may not be the case if Facebook is specically showing friends for marketing Our goal is to analyse the privacy implications of eas- purposes. An interesting question for future work is if ily available public listings. The existence of friendship there are public display strategies which would make it information in public listings is troublesome in that it harder to approximate useful functions of the graph. is not mentioned in Facebook's privacy policy, which 2. states only that: Your name, network names, and prole picture thumbnail will be made available to third EXPERIMENTAL DATA In order to obtain a complete subgraph to perform party search engines [1]. Furthermore, public listings experiments on, we crawled data from a large social- are shown by default for all users. Experimenting with network using a special application to repeatedly query users in the Cambridge network, we have found that user data using the network's developer API. On the fewer than 1% of users opt out. We feel this reects a social network in question, we found that more than 2 Using the anonymity network Tor, we were able retrieve dierent sets of friends for the same user by sending requests from dierent IP addresses around the globe. We noticed a correlation between the set of friends shown and the geographic location of the requesting IP address. We suspect this is a marketing feature and not a security feature showing a visitor a group of nearby people makes the site more appealing. 99% of users had their information exposed to our application, making this approach superior to crawling all proles visible to a public network, which is typically only 70-90% [14]. We crawled two sub-networks consisting of students from Stanford and Harvard universities, summarised in Table 1. We note that our crawling method is impractical for 2 Mean d Median d #Users Stanford 15,043 125 90 1,246 Harvard 18,273 116 76 1,213 Table 1: Summary of Datasets used. d Max d Network = degree crawling signicant portions of the graph, because we were subject to rate-limiting and the number of friendship queries required was O(n2 ) users crawled, increased. Crawling public listings, as as n, the number of search engines are encouraged to do, is a much easier task. 3. STATISTICS We studied ve common graph metrics vertex de- Figure 2: Degree estimation methods gree, dominating sets, betweenness centrality, shortest paths, and community detection and found that even with a limited public view, an attacker is able to approximate them all with some success. 3.1 intrinsic bias, as nodes with many relatively low-degree Degree friends will have an articially high estimated degree. Information about the degree d of nodes in the net- In a sampled graph, it is more meaningful for a node to work is exposed in the sampled graph, particularly for have in-edges from a higher-degree node, conceptually low-degree nodes. This is not surprising, because nodes similar to the PageRank algorithm [6]. with d<k must be shown with fewer than k links in their public listing, leaking their precise degree. Thus, we can improve our estimates by implementing an iterative al- For gorithm: rst initialise all estimates most social networks, however, nding the set of users d0 (n) to the naive estimate of equation 1, and then iteratively update: with few friends is less interesting than nding the most popular users in the network, who are useful for marketing purposes. d ≥ k Degree information is leaked for users with X di (n) ← max[dout (n), m∈in-sample(n) k ] di−1 (m) (2) because they will likely be displayed in some of their friends' listings, meaning that there will be ≥k edges for most nodes in the sampled graph. We can consider In order to keep the degree estimates bounded, it is the sampled graph to be a directed graph with edges necessary to normalise after each iteration. We found it originating from the user whose public listing they were works well to force the average of all estimated degrees found in. The out-degree dout (n) of a node n to be our estimated average is the in-degree din (n) we always have ne a cumulative degree function is the number of edges learned from dout (n) ≤ k , n. sum of the degrees of the Note that make a naive approximation of din (n) > k . n's x D(x), which is the highest-degree nodes. This is motivated by the likely use of degree information, to but for a high degree node in the complete graph, we expect the complete graph. To evaluate the performance of this approach, we de- number of edges learned from its public listing, and the other public friend listings which include d¯ for locate a set of very high-degree nodes in a network. In We can in Figure 2, there is a comparison of the growth of degree in the com- D(x) when selecting the highest degree nodes given complete plete graph: graph knowledge against our naive and iterated estima- d¯ d∗ (n) ≈ max[dout (n), din (n) · ] k where tion algorithms. We also plot the growth of D(x) if nodes are selected uniformly at random, which is the (1) best possible approach with no graph information. d¯ is the average degree of all nodes in the com- With k = 8, the iterative estimation method works plete graph (we assume this can be found from public very well; for the Harvard data set the highest-degree statistics). This method works because, for a node n d(n) friends, n will show up in each friend m's listk with probability d(m) . Since we don't know d(m), 1,000 nodes identied by this method have a cumula- with tive degree of 407,746, compared with 452,886 using the we can approximate with the average degree for the timal. We similarly found 89% accuracy in the Stanford complete graph. This approximation, however, has an network. ings complete graph, making our approximation 90% of op- 3 3.2 Dominating Sets A more powerful concept than simply nding highdegree nodes is to compute a minimal dominating set for the graph. A dominating set is a set of nodes that its dominated set set of users V. N ∪ friends(N ) N , such is the complete In the case of a social network, this is a set of people who are, collectively, friends with every person in the network. Marketers can target a small dominating set to indirectly reach the entire network. If an attacker can compromise a small dominating set of accounts, then the entire network is visible. Even given the complete graph, computing the minimum dominating set is an NP-complete problem [13]. The simple strategy of picking the highest-degree nodes can perform poorly in social networks, as many highdegree users with overlapping groups of friends will be Figure 3: Dominating set estimation selected which add relatively little coverage. However, a greedy algorithm which repeatedly selects the node which brings the most new nodes into the dominated in [5]. Betweenness centrality is dened as: set has been shown to perform well in practice [7]. To evaluate this greedy approach, we measured the CB (v) = growth of the dominated set as additional nodes are added to the dominating set, if nodes are selected using s6=v6=t∈V the complete graph or the sampled graph. The greedy s In Figure 3 we compared these two selection strate- σst is the number of shortest paths from node t and σst (v) is the number of such paths contain the node v . Thus, the higher a node's betweenness centrality, the more communication paths ing the next highest-degree node (given complete graph it is a part of. The greedy algorithm outperforms this A node with high betweenness centrality, therefore, approach even with only sampled data, demonstrating is one which could intercept many messages travelling that signicant network information is leaked beyond through the network. 3 an approximation of degrees . We simulated such interception occurring on the Stanford sub-network to determine the Our experiments showed that very small sets exist probability that an attacker controlling which dominate most of the network, followed by a long N nodes could intercept a message sent from any node to any other tail of diminishing returns to dominate the entire net- node in the network via shortest-path social routes. The work, consistent with previous research [9]. With com- nodes designated as compromised are selected according plete graph knowledge, we found a set of 100 nodes to one of three policies: maximum centrality, maximum which dominate 65.2% of the Harvard network, while centrality for a sampled graph with we were able to nd a set of 100 nodes giving 62.1% k = 8, or random selection. coverage using only the sampled graph, for 95% accu- 4 shows that, while randomly selecting nodes to com- Similarly, for Stanford, our computed 100-node promise leads to a linear increase in intercepted trac, dominating set had 94% of the optimal coverage. 3.3 (3) to node which gies against a degree selection strategy of always pick- racy. σst (v) σst where algorithm performs very well given the sampled view. knowledge). X a selective targeting of highly central nodes yields much better results. After compromising 10% of nodes in this Centrality and Message Interception sub-network, an attacker could intercept 15.2% of mes- Another important metric used in the analysis of so- sages if she used random selection, but using centrality cial networks is centrality, which is a measure of the to direct the choice of nodes, the attacker can intercept importance of members of the network based on their as much as 51.9% of messages. position in the graph. We use the betweenness central- uses only the information available from a ity metric, for which an ecient algorithm is described lic listing, she can still successfully intercept 49.8% of Even if the attacker k=8 pub- all messages just 4% less than she could do with full 3 In fact, we were surprised to nd that selecting the highestdegree nodes based on our naive degree-estimates from the sampled graph outperformed selecting based on actual degree! This is because a bias in favor of nodes with low-degree friends is useful for nding a dominating set, as the graph's fringes are reached more quickly. centrality information. 3.4 Shortest Paths We tested the extent that the small world property is maintained in a sampled graph by computing the min- 4 Figure 4: Message interception Figure 6: Community Detection cally cluster the users into highly-connected subgroups which signify natural social groupings. Community structure in social graphs could be used to market products that other members of one's social clique have purchased, or to identify dissident groups, among other applications. For our purposes, we will use attempt nd communities with maximal modularity [16], a clustering metric for which ecient algorithms can partition realistically-sized social graphs with thousands of nodes. Modularity is dened to be the number of edges which exist within communities beyond those that would be expected to occur randomly: Figure 5: Reachable nodes Q= the Floyd-Warshall shortest path algorithm [11]. this algorithm has a complexity of O |V | 3 , where and As V (4) of edges in the graph, and w are connected. We implemented the greedy algorithm for community is detection in large graphs described in [8]. The results the set of vertices in a graph, we only calculated shortest paths a subset of our example data. m is the total number Avw = 1 if and only if v where imum path length between every pair of nodes, using d(v)d(w) 1 X Avw − 2m v,w 2m on our sample graph were striking, as shown in Figure Within the 6. With a sampling parameter of k = 8, we were able 2,000 most popular Stanford users, the minimum-length to divide the graph into communities nearly as well as shortest path was a single edge and the maximum-length using complete graph knowledge. Using sampled data, path was just three edges. Figure 5 shows the eect of we divided the graph into 18,150 communities with a limiting friendship knowledge on reachability of nodes modularity of 0.341, 92% as high as the optimal 18,205 in the graph. communities with a modularity of 0.369. The maximum-length shortest path increases from 3 The algorithm produced a slightly dierent set of to 7 as we reduce visible friendships to 2 per person, but communities using the sampled graph, because a well- the graph remains fully connected. connected social network like a college campus contains The average path length for Stanford was 1.94 edges, and with k = 8, it many overlapping communities, but using modularity increased to just 3.09. 3.5 as our metric, the communities identied were almost as signicant. Community detection worked nearly as well Community Detection with k=5, but performance deteriorated signicantly for Given a complete graph, it is possible to automati- k=2 and below. 5 4. RELATED WORK networks, hidden patterns, and structural steganography. In WWW '07: Proceedings of the Graph theory is a well-studied mathematical eld; a good overview is available in [3]. 16th international conference on World Wide Web Sociologists have (New York, NY, USA, 2007), ACM, pp. 181190. long been interested in applying graph theory to social networks, the denitive work is [18]. [5] Brandes, U. A Faster Algorithm for Approximating Betweenness Centrality. Journal of Mathematical information from a social graph when presented with Sociology 25, 2 (2001), 163177. incomplete knowledge, however is far less studied. The [6] Brin, S., and Page, L. The anatomy of a closest example we could nd in the literature was a large-scale hypertextual web search engine. In study which evaluated community detection given only Seventh International World-Wide Web the edges from a small subset of nodes [15]. This work Conference (WWW 1998) (1998). used a dierent sampling strategy, however, and only [7] Chvatal, V. A Greedy Heuristic for the aimed to classify nodes into one of two communities in a simulated graph. Set-Covering Problem. Mathematics of Operations Another set of experiments [9] Research 4, 3 (1979), 233235. examined strategies for placing a small set of nodes un- [8] Clauset, A., Newman, M. E. J., and der surveillance to gain coverage of a larger network, Moore, C. Finding community structure in very similar to our calculation of dominating sets. Related large networks. Physical Review E 70, 6 (Dec to our problem of limiting the usefulness of observable 2004), 066111. graph data is anonymising a social network for research purposes. [9] Danezis, G., and Wittneben, B. The It has been shown that, due to the unique economics of mass surveillance and the structure of groups withing a social graph, it is often questionable value of anonymous communications. easy to de-anonymise a social graph by correlating it WEIS: Workshop on the Economics of with known data [4]. 5. Information Security (2006). [10] Dwyer, C., Hiltz, S. R., and Passerini, K. CONCLUSIONS Trust and privacy concern within social We have examined the diculty of computing graph statistics given a random sample of k networking sites: A comparison of facebook and edges from each myspace. America's Conference on Information node, and found that many interesting properties can be Systems (2007). accurately approximated. This has disturbing implica- [11] Floyd, R. W. Algorithm 97: Shortest path. tions for online privacy, since leaking graph information Communications of the ACM 5, 6 (1962), 345. enables transitive privacy loss: insecure friends' proles [12] Jagatic, T. N., Johnson, N. A., Jakobsson, can be correlated to a user with a private prole. Social M., and Menczer, F. Social phishing. network operators should be aware of the importance of Commun. ACM 50, 10 (2007), 94100. protecting not just user prole data, but the structure [13] Karp, R. Reducibility Among Combinatorial of the social graph. In particular, they shouldn't assist Problems. Complexity of Computer Computations data aggregators by giving away public listings. (1972). [14] Krishnamurthy, B., and Wills, C. E. Acknowledgements Characterizing Privacy in Online Social Networks. In Workshop on Online Social Networks WOSN We would like to thank George Danezis, Claudia Diaz, 2008 (2008), pp. 37 42. Eiko Yoneki, and Joshua Batson, as well as anonymous [15] Nagaraja, S. The economics of covert reviewers, for helpful discussions about this research. community detection and hiding. WEIS: 6. Workshop on the Economics of Information REFERENCES Security (2008). [1] Facebook privacy policy. [16] Newman, M. E. J., and Girvan, M. Finding http://www.facebook.com/policy.php (2009). and evaluating community structure in networks. [2] Acquisti, A., and Gross, R. Imagined Physical Review E 69 (2004), 026113. Communities: Awareness, Information Sharing, [17] Rosenblum, D. What Anyone Can Know: The and Privacy on the Facebook. In Privacy Privacy Risks of Social Networking Sites. IEEE Enhancing Technologies LNCS 4258 (2006), Security & Privacy Magazine 5, 3 (2007), 40. Springer Berlin / Heildelberg, pp. 3658. [18] Wasserman, S., and Faust, K. Social Network [3] Albert, R., and Barabási, A.-L. Statistical Analysis. Cambridge University Press, 1994. mechanics of complex networks. Rev. Mod. Phys. [19] Xu, W., Zhou, X., and Li, L. Inferring privacy 74, 1 (Jan 2002), 4797. information via social relations. International [4] Backstrom, L., Dwork, C., and Kleinberg, Conference on Data Engineering (2008). J. Wherefore art thou r3579x?: anonymized social 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Eight Friends Are Enough: Social Graph