Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Eight Friends Are Enough:
Social Graph Approximation via Public Listings
Joseph Bonneau, Jonathan Anderson, Ross Anderson and Frank Stajano
University of Cambridge Computer Laboratory
{joseph.bonneau, jonathan.anderson, ross.anderson, frank.stajano}@cl.cam.ac.uk
ABSTRACT
have attracted considerable attention from the media,
privacy advocates and the research community. Most of
The popular social networking website Facebook exposes a
“public view” of user profiles to search engines which includes eight of the user’s friendship links. We examine what
interesting properties of the complete social graph can be inferred from this public view. In experiments on real social
network data, we were able to accurately approximate the
degree and centrality of nodes, compute small dominating
sets, find short paths between users, and detect community
structure. This work demonstrates that it is difficult to safely
reveal limited information about a social network.
the focus has been on personal data privacy: researchers
and operators have attempted to ne-tune access control mechanisms to prevent the accidental leakage of
embarrassing or incriminating information to third parties.
A less studied problem is that of social graph pri-
vacy : preventing data aggregators from reconstructing
large portions of the social graph, composed of users and
their friendship links. Knowing who a person's friends
are is valuable information to marketers, employers,
credit rating agencies, insurers, spammers, phishers, po-
Categories and Subject Descriptors
lice, and intelligence agencies, but protecting the social
Data Structures]: Graphs and networks; K.6.5
[Management of Computing and Information Systems]: Security and Protection; F.2.1 [Analysis of
Algorithms and Problem Complexity]: Numerical Algorithms and Problems; K.4.1 [Computers and
Society]: Public Policy Issues Privacy
E.1 [
graph is more dicult than protecting personal data.
Personal data privacy can be managed individually by
users, while information about a user's place in the social graph can be revealed by any of the user's friends.
1.1
Facebook and Public Listings
Facebook is the world's largest social network, claim-
General Terms
ing over 175 million active users, making it an inter-
Security, Algorithms, Experimentation, Measurement,
esting case study for privacy.
Theory, Legal Aspects
cial networking platforms, Facebook is known for hav-
Compared to other so-
ing relatively accurate user proles, as people primarily
use Facebook to represent their real-world persona [10].
Keywords
Thus, Facebook is often at the forefront of criticism
Social networks, Privacy, Web crawling, Data breaches,
about online privacy [2, 17]. In September 2007, Face-
Graph theory
1.
book started making public search listings available to
those not logged in to the site an example is shown in
INTRODUCTION
Figure 1. These listings are designed to encourage visitors to join by showcasing that many of their friends
The proliferation of online social networking services
are already members.
has entrusted massive silos of sensitive personal information to social network operators.
Originally, public listings included a user's name, pho-
Privacy concerns
tograph, and 10 friends. Showing a new random set of
friends on each request is clearly a privacy issue, as this
allows a web spider to repeatedly fetch a user's page
1
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SNS ’09 Nuremberg, Germany
Copyright 2009 ACM 978-1-60558-463-8 ...$5.00.
until it has viewed all of that user's friends . In Jan-
1
Retrieving n friends by repeatedly fetching a random sample of size k is an instance of the coupon collector's problem.
n
It will take an average of n·H
= Θ( nk log n) queries to rek
trieve the complete set of friends. A set of 100 friends, for
example, would require 65 queries to retrieve with k = 8.
1
combination of user ignorance and poorly designed privacy controls, as most users don't want public listings
the primary purpose is to encourage new members to
join. Users who joined prior to the deployment of public listings may be unaware that the feature exists, as
it is never encountered by members of the site whose
browsers store cookies for automatic log-in.
Leaking friendship information leads to obvious privacy concerns. Social phishing attacks, in which phishing emails are forged to appear to come from a victim's
friend, have been shown to be signicantly more eective than traditional cold phishing [12]. Private information can be inferred directly from one's friend list if,
for example, it contains multiple friends with Icelandic
names. A friend list may also be checked against data
Figure 1: An author's public Facebook prole
retrieved from other sources, such as known supporters
uary 2009, public listings were reduced to 8 friends, and
graph structure is leaked by public search listings. While
the selection of users now appears to be a determinis-
we are motivated by the question of what data aggre-
tic function of the requestin IP address . Facebook has
gators can extract from Facebook, we consider the gen-
also added public listings for groups, listing 8 members
eral question of what interesting properties of a graph
in a similar fashion.
one can compute from a limited view.
of a fringe political party [19].
In this work though, we evaluate how much social
2
We start with
G =< V, E >, where V is
and E is the set of edges
Critically, public listings are designed to be indexed
an undirected social graph
by search engines, and are not defended technically against
the set of vertices (users)
spidering.
Members-only portions of social networks,
(friendships). We produce a publicly sampled graph
in contrast, are typically defended by rate-limiting or
This suggests roughly 800
Gk =< V, Ek >, where Ek ∈ E is produced by randomly choosing k outgoing friendship edges for each
node v ∈ V . We then compute some function of interest f (G) and attempt to approximate it using another
function fapprox (Gk ). If f (G) ≈ fapprox (Gk ), we say
machine-days of eort are required to retrieve the pub-
that the public view of the graph leaks the property
legally by terms of use.
We have experimentally con-
rme the ease of collecting public listings, writing a spidering script which retrieved
from a desktop computer.
∼250,000
listings per day
f.
calculated by
ability of a serious aggregator. Thus the complete col-
we are most concerned with low values of
lection of public listings can be considered available to
current Facebook value
all information leaks, so
k
formly random sample of
Privacy Implications
such as the
k = 8.
It is important to note that we assume
motivated parties.
1.2
As
k → ∞,
lic listing of every user on Facebook, easily within the
E.
Ek
is a uni-
This may not be the case
if Facebook is specically showing friends for marketing
Our goal is to analyse the privacy implications of eas-
purposes. An interesting question for future work is if
ily available public listings. The existence of friendship
there are public display strategies which would make it
information in public listings is troublesome in that it
harder to approximate useful functions of the graph.
is not mentioned in Facebook's privacy policy, which
2.
states only that: Your name, network names, and pro-
le picture thumbnail will be made available to third
EXPERIMENTAL DATA
In order to obtain a complete subgraph to perform
party search engines [1]. Furthermore, public listings
experiments on, we crawled data from a large social-
are shown by default for all users. Experimenting with
network using a special application to repeatedly query
users in the Cambridge network, we have found that
user data using the network's developer API. On the
fewer than 1% of users opt out. We feel this reects a
social network in question, we found that more than
2
Using the anonymity network Tor, we were able retrieve
dierent sets of friends for the same user by sending requests from dierent IP addresses around the globe. We
noticed a correlation between the set of friends shown and
the geographic location of the requesting IP address. We
suspect this is a marketing feature and not a security feature showing a visitor a group of nearby people makes the
site more appealing.
99% of users had their information exposed to our application, making this approach superior to crawling all
proles visible to a public network, which is typically
only 70-90% [14]. We crawled two sub-networks consisting of students from Stanford and Harvard universities,
summarised in Table 1.
We note that our crawling method is impractical for
2
Mean
d
Median
d
#Users
Stanford
15,043
125
90
1,246
Harvard
18,273
116
76
1,213
Table 1: Summary of Datasets used.
d
Max
d
Network
= degree
crawling signicant portions of the graph, because we
were subject to rate-limiting and the number of friendship queries required was
O(n2 )
users crawled, increased.
Crawling public listings, as
as
n,
the number of
search engines are encouraged to do, is a much easier
task.
3.
STATISTICS
We studied ve common graph metrics vertex de-
Figure 2: Degree estimation methods
gree, dominating sets, betweenness centrality, shortest
paths, and community detection and found that even
with a limited public view, an attacker is able to approximate them all with some success.
3.1
intrinsic bias, as nodes with many relatively low-degree
Degree
friends will have an articially high estimated degree.
Information about the degree
d
of nodes in the net-
In a sampled graph, it is more meaningful for a node to
work is exposed in the sampled graph, particularly for
have in-edges from a higher-degree node, conceptually
low-degree nodes. This is not surprising, because nodes
similar to the PageRank algorithm [6].
with
d<k
must be shown with fewer than
k
links in
their public listing, leaking their precise degree.
Thus, we can
improve our estimates by implementing an iterative al-
For
gorithm: rst initialise all estimates
most social networks, however, nding the set of users
d0 (n)
to the naive
estimate of equation 1, and then iteratively update:
with few friends is less interesting than nding the most
popular users in the network, who are useful for marketing purposes.
d ≥ k
Degree information is leaked for users with
X
di (n) ← max[dout (n),
m∈in-sample(n)
k
]
di−1 (m)
(2)
because they will likely be displayed in some of their
friends' listings, meaning that there will be
≥k
edges
for most nodes in the sampled graph. We can consider
In order to keep the degree estimates bounded, it is
the sampled graph to be a directed graph with edges
necessary to normalise after each iteration. We found it
originating from the user whose public listing they were
works well to force the average of all estimated degrees
found in.
The out-degree
dout (n)
of a node
n
to be our estimated average
is the
in-degree
din (n)
we always have
ne a cumulative degree function
is the number of edges learned from
dout (n) ≤ k ,
n.
sum of the degrees of the
Note that
make a naive approximation of
din (n) > k .
n's
x
D(x),
which is the
highest-degree nodes. This
is motivated by the likely use of degree information, to
but for a high degree node
in the complete graph, we expect
the complete graph.
To evaluate the performance of this approach, we de-
number of edges learned from its public listing, and the
other public friend listings which include
d¯ for
locate a set of very high-degree nodes in a network. In
We can
in Figure 2, there is a comparison of the growth of
degree in the com-
D(x)
when selecting the highest degree nodes given complete
plete graph:
graph knowledge against our naive and iterated estima-
d¯
d∗ (n) ≈ max[dout (n), din (n) · ]
k
where
tion algorithms.
We also plot the growth of
D(x)
if
nodes are selected uniformly at random, which is the
(1)
best possible approach with no graph information.
d¯ is the average degree of all nodes in the com-
With
k = 8,
the iterative estimation method works
plete graph (we assume this can be found from public
very well; for the Harvard data set the highest-degree
statistics).
This method works because, for a node n
d(n) friends, n will show up in each friend m's listk
with probability
d(m) . Since we don't know d(m),
1,000 nodes identied by this method have a cumula-
with
tive degree of 407,746, compared with 452,886 using the
we can approximate with the average degree for the
timal. We similarly found 89% accuracy in the Stanford
complete graph. This approximation, however, has an
network.
ings
complete graph, making our approximation 90% of op-
3
3.2
Dominating Sets
A more powerful concept than simply nding highdegree nodes is to compute a minimal dominating set
for the graph. A dominating set is a set of nodes
that its dominated set
set of users
V.
N ∪ friends(N )
N , such
is the complete
In the case of a social network, this is
a set of people who are, collectively, friends with every
person in the network.
Marketers can target a small
dominating set to indirectly reach the entire network.
If an attacker can compromise a small dominating set
of accounts, then the entire network is visible.
Even given the complete graph, computing the minimum dominating set is an NP-complete problem [13].
The simple strategy of picking the highest-degree nodes
can perform poorly in social networks, as many highdegree users with overlapping groups of friends will be
Figure 3: Dominating set estimation
selected which add relatively little coverage. However,
a greedy algorithm which repeatedly selects the node
which brings the most new nodes into the dominated
in [5]. Betweenness centrality is dened as:
set has been shown to perform well in practice [7].
To evaluate this greedy approach, we measured the
CB (v) =
growth of the dominated set as additional nodes are
added to the dominating set, if nodes are selected using
s6=v6=t∈V
the complete graph or the sampled graph. The greedy
s
In Figure 3 we compared these two selection strate-
σst is the number of shortest paths from node
t and σst (v) is the number of such paths
contain the node v . Thus, the higher a node's
betweenness centrality, the more communication paths
ing the next highest-degree node (given complete graph
it is a part of.
The greedy algorithm outperforms this
A node with high betweenness centrality, therefore,
approach even with only sampled data, demonstrating
is one which could intercept many messages travelling
that signicant network information is leaked beyond
through the network.
3
an approximation of degrees .
We simulated such interception
occurring on the Stanford sub-network to determine the
Our experiments showed that very small sets exist
probability that an attacker controlling
which dominate most of the network, followed by a long
N
nodes could
intercept a message sent from any node to any other
tail of diminishing returns to dominate the entire net-
node in the network via shortest-path social routes. The
work, consistent with previous research [9]. With com-
nodes designated as compromised are selected according
plete graph knowledge, we found a set of 100 nodes
to one of three policies: maximum centrality, maximum
which dominate 65.2% of the Harvard network, while
centrality for a sampled graph with
we were able to nd a set of 100 nodes giving 62.1%
k = 8,
or random
selection.
coverage using only the sampled graph, for 95% accu-
4 shows that, while randomly selecting nodes to com-
Similarly, for Stanford, our computed 100-node
promise leads to a linear increase in intercepted trac,
dominating set had 94% of the optimal coverage.
3.3
(3)
to node
which
gies against a degree selection strategy of always pick-
racy.
σst (v)
σst
where
algorithm performs very well given the sampled view.
knowledge).
X
a selective targeting of highly central nodes yields much
better results. After compromising 10% of nodes in this
Centrality and Message Interception
sub-network, an attacker could intercept 15.2% of mes-
Another important metric used in the analysis of so-
sages if she used random selection, but using centrality
cial networks is centrality, which is a measure of the
to direct the choice of nodes, the attacker can intercept
importance of members of the network based on their
as much as 51.9% of messages.
position in the graph. We use the betweenness central-
uses only the information available from a
ity metric, for which an ecient algorithm is described
lic listing, she can still successfully intercept 49.8% of
Even if the attacker
k=8
pub-
all messages just 4% less than she could do with full
3
In fact, we were surprised to nd that selecting the highestdegree nodes based on our naive degree-estimates from the
sampled graph outperformed selecting based on actual degree! This is because a bias in favor of nodes with low-degree
friends is useful for nding a dominating set, as the graph's
fringes are reached more quickly.
centrality information.
3.4
Shortest Paths
We tested the extent that the small world property is
maintained in a sampled graph by computing the min-
4
Figure 4: Message interception
Figure 6: Community Detection
cally cluster the users into highly-connected subgroups
which signify natural social groupings. Community structure in social graphs could be used to market products
that other members of one's social clique have purchased, or to identify dissident groups, among other applications. For our purposes, we will use attempt nd
communities with maximal modularity [16], a clustering metric for which ecient algorithms can partition
realistically-sized social graphs with thousands of nodes.
Modularity is dened to be the number of edges which
exist within communities beyond those that would be
expected to occur randomly:
Figure 5: Reachable nodes
Q=
the Floyd-Warshall shortest path algorithm [11].
this algorithm has a complexity of
O |V |
3
, where
and
As
V
(4)
of edges in the graph,
and
w
are connected.
We implemented the greedy algorithm for community
is
detection in large graphs described in [8]. The results
the set of vertices in a graph, we only calculated shortest paths a subset of our example data.
m is the total number
Avw = 1 if and only if v
where
imum path length between every pair of nodes, using
d(v)d(w)
1 X
Avw −
2m v,w
2m
on our sample graph were striking, as shown in Figure
Within the
6. With a sampling parameter of
k = 8,
we were able
2,000 most popular Stanford users, the minimum-length
to divide the graph into communities nearly as well as
shortest path was a single edge and the maximum-length
using complete graph knowledge. Using sampled data,
path was just three edges. Figure 5 shows the eect of
we divided the graph into 18,150 communities with a
limiting friendship knowledge on reachability of nodes
modularity of 0.341, 92% as high as the optimal 18,205
in the graph.
communities with a modularity of 0.369.
The maximum-length shortest path increases from 3
The algorithm produced a slightly dierent set of
to 7 as we reduce visible friendships to 2 per person, but
communities using the sampled graph, because a well-
the graph remains fully connected.
connected social network like a college campus contains
The average path
length for Stanford was 1.94 edges, and with
k = 8,
it
many overlapping communities, but using modularity
increased to just 3.09.
3.5
as our metric, the communities identied were almost as
signicant. Community detection worked nearly as well
Community Detection
with k=5, but performance deteriorated signicantly for
Given a complete graph, it is possible to automati-
k=2 and below.
5
4.
RELATED WORK
networks, hidden patterns, and structural
steganography. In WWW '07: Proceedings of the
Graph theory is a well-studied mathematical eld;
a good overview is available in [3].
16th international conference on World Wide Web
Sociologists have
(New York, NY, USA, 2007), ACM, pp. 181190.
long been interested in applying graph theory to social
networks, the denitive work is [18].
[5] Brandes, U. A Faster Algorithm for
Approximating
Betweenness Centrality. Journal of Mathematical
information from a social graph when presented with
Sociology 25, 2 (2001), 163177.
incomplete knowledge, however is far less studied. The
[6] Brin, S., and Page, L. The anatomy of a
closest example we could nd in the literature was a
large-scale hypertextual web search engine. In
study which evaluated community detection given only
Seventh International World-Wide Web
the edges from a small subset of nodes [15]. This work
Conference (WWW 1998) (1998).
used a dierent sampling strategy, however, and only
[7] Chvatal, V. A Greedy Heuristic for the
aimed to classify nodes into one of two communities
in a simulated graph.
Set-Covering Problem. Mathematics of Operations
Another set of experiments [9]
Research 4, 3 (1979), 233235.
examined strategies for placing a small set of nodes un-
[8] Clauset, A., Newman, M. E. J., and
der surveillance to gain coverage of a larger network,
Moore, C. Finding community structure in very
similar to our calculation of dominating sets. Related
large networks. Physical Review E 70, 6 (Dec
to our problem of limiting the usefulness of observable
2004), 066111.
graph data is anonymising a social network for research
purposes.
[9] Danezis, G., and Wittneben, B. The
It has been shown that, due to the unique
economics of mass surveillance and the
structure of groups withing a social graph, it is often
questionable value of anonymous communications.
easy to de-anonymise a social graph by correlating it
WEIS: Workshop on the Economics of
with known data [4].
5.
Information Security (2006).
[10] Dwyer, C., Hiltz, S. R., and Passerini, K.
CONCLUSIONS
Trust and privacy concern within social
We have examined the diculty of computing graph
statistics given a random sample of
k
networking sites: A comparison of facebook and
edges from each
myspace. America's Conference on Information
node, and found that many interesting properties can be
Systems (2007).
accurately approximated. This has disturbing implica-
[11] Floyd, R. W. Algorithm 97: Shortest path.
tions for online privacy, since leaking graph information
Communications of the ACM 5, 6 (1962), 345.
enables transitive privacy loss: insecure friends' proles
[12] Jagatic, T. N., Johnson, N. A., Jakobsson,
can be correlated to a user with a private prole. Social
M., and Menczer, F. Social phishing.
network operators should be aware of the importance of
Commun. ACM 50, 10 (2007), 94100.
protecting not just user prole data, but the structure
[13] Karp, R. Reducibility Among Combinatorial
of the social graph. In particular, they shouldn't assist
Problems. Complexity of Computer Computations
data aggregators by giving away public listings.
(1972).
[14] Krishnamurthy, B., and Wills, C. E.
Acknowledgements
Characterizing Privacy in Online Social Networks.
In Workshop on Online Social Networks WOSN
We would like to thank George Danezis, Claudia Diaz,
2008 (2008), pp. 37 42.
Eiko Yoneki, and Joshua Batson, as well as anonymous
[15] Nagaraja, S. The economics of covert
reviewers, for helpful discussions about this research.
community detection and hiding. WEIS:
6.
Workshop on the Economics of Information
REFERENCES
Security (2008).
[1] Facebook privacy policy.
[16] Newman, M. E. J., and Girvan, M. Finding
http://www.facebook.com/policy.php (2009).
and evaluating community structure in networks.
[2] Acquisti, A., and Gross, R. Imagined
Physical Review E 69 (2004), 026113.
Communities: Awareness, Information Sharing,
[17] Rosenblum, D. What Anyone Can Know: The
and Privacy on the Facebook. In Privacy
Privacy Risks of Social Networking Sites. IEEE
Enhancing Technologies LNCS 4258 (2006),
Security & Privacy Magazine 5, 3 (2007), 40.
Springer Berlin / Heildelberg, pp. 3658.
[18] Wasserman, S., and Faust, K. Social Network
[3] Albert, R., and Barabási, A.-L. Statistical
Analysis. Cambridge University Press, 1994.
mechanics of complex networks. Rev. Mod. Phys.
[19] Xu, W., Zhou, X., and Li, L. Inferring privacy
74, 1 (Jan 2002), 4797.
information via social relations. International
[4] Backstrom, L., Dwork, C., and Kleinberg,
Conference on Data Engineering (2008).
J. Wherefore art thou r3579x?: anonymized social
6