Download Challenges in Mining Social Network Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cracking of wireless networks wikipedia , lookup

CAN bus wikipedia , lookup

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Computer network wikipedia , lookup

IEEE 1355 wikipedia , lookup

Network tap wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
Challenges in Mining Social Network Data
Jon Kleinberg
Cornell University
Including joint work with Lars Backstrom, Dan Cosley, Cynthia Dwork, Dan
Huttenlocher, Xiangyang Lan, Sid Suri
Jon Kleinberg
Challenges in Mining Social Network Data
Social Network Analysis
High-school dating (Bearman-Moody-Stovel 2004)
Karate club (Zachary 1977)
Social network data
Active research area in sociology, social psychology,
anthropology for the past half-century.
Today: Convergence of social and technological networks
Computing and info. systems with intrinsic social structure.
What can the different fields learn from each other?
Jon Kleinberg
Challenges in Mining Social Network Data
Mining Social Network Data
Mining social networks also has long history in social sciences.
E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social
ties and rivalries in a university karate club.
Jon Kleinberg
Challenges in Mining Social Network Data
Mining Social Network Data
Mining social networks also has long history in social sciences.
E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social
ties and rivalries in a university karate club.
During his observation, conflicts intensified and group split.
Jon Kleinberg
Challenges in Mining Social Network Data
Mining Social Network Data
Mining social networks also has long history in social sciences.
E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social
ties and rivalries in a university karate club.
During his observation, conflicts intensified and group split.
Split could be explained by minimum cut in social network.
Jon Kleinberg
Challenges in Mining Social Network Data
A Matter of Scale
Social network data spans many orders of magnitude
436-node network of e-mail
exchange over 3 months at a
corporate research lab
(Adamic-Adar 2003)
43,553-node network of e-mail exchange over 2 years at a
large university (Kossinets-Watts 2006)
4.4-million-node network of declared friendships on blogging
community LiveJournal (Liben-Nowell et al. 2005, Backstrom
et al. 2006)
240-million-node network of all IM communication over one
month on Microsoft Instant Messenger (Leskovec-Horvitz’07)
Jon Kleinberg
Challenges in Mining Social Network Data
Not Just a Matter of Scale
How does massive
network data compare
to small-scale
studies?
Currently, massive network datasets give you both more and less:
More: can observe global phenomena that are genuine, but
literally invisible at smaller scales.
Less: Don’t really know what any one node or link means.
Easy to measure things; hard to pose nuanced questions.
Goal: Find the point where the lines of research converge.
Jon Kleinberg
Challenges in Mining Social Network Data
Outline
Several core KDD methodologies come into play:
Working with network data that is much messier than just
nodes and edges.
Algorithmic models as a basic vocabulary for expressing
complex social-science questions on complex network data.
Understanding social networks as datasets: privacy
implications and other concerns.
Plan for the talk:
Algorithmic models for search and diffusion in social networks:
Formulating some fundamental unresolved questions.
Evaluating anonymization as a standard approach for
protecting privacy in social network data.
Jon Kleinberg
Challenges in Mining Social Network Data
The Small-World Phenomenon
Milgram’s small-world experiment (1967)
Choose a target in Boston, starters in Nebraska.
A letter begins at each starter, must be passed between
personal acquaintances until target is reached.
Six steps on average −→ six degrees of separation.
Milgram experiment (Travers-Milgram 1970)
Jon Kleinberg
Microsoft IM (Leskovec-Horvitz 2007)
Challenges in Mining Social Network Data
Modeling the Search Process in Social Networks
Routing in a (social) network:
When is local information
sufficient? [Kleinberg 2000]
Variation on network model of
Watts and Strogatz [1998].
Add edges to lattice: u links to v
with probability d(u, v )−α .
Optimal exponent α = 2: yields
routing time ∼ c log2 n.
Similar effects hold in other
structures (hierarchies, set systems)
Connections to long-range
percolation in statistical physics
Jon Kleinberg
Challenges in Mining Social Network Data
Geographic Data: LiveJournal
Liben-Nowell, Kumar, Novak, Raghavan, Tomkins (2005) studied
LiveJournal, an on-line blogging community with friendship links.
Large-scale social network with geographical embedding:
500,000 members with U.S. Zip codes, 4 million links.
Analyzed how friendship probability decreases with distance.
Difficulty: non-uniform population density makes simple
lattice models hard to apply.
Jon Kleinberg
Challenges in Mining Social Network Data
LiveJournal: Rank-Based Friendship
rank 7
w
v
Rank-based friendship: rank of w with respect to v is number of
people x such that d(v , x) < d(v , w ).
Result of [LKNRT’05]: Efficient routing for (nearly) arbitrary
population density, if link probability proportional to 1/rank.
Generalization of lattice result.
Punchline: LiveJournal friendships approximate 1/rank.
Jon Kleinberg
Challenges in Mining Social Network Data
LiveJournal: Rank-Based Friendship
rank 7
w
v
Rank-based friendship: rank of w with respect to v is number of
people x such that d(v , x) < d(v , w ).
Result of [LKNRT’05]: Efficient routing for (nearly) arbitrary
population density, if link probability proportional to 1/rank.
Generalization of lattice result.
Punchline: LiveJournal friendships approximate 1/rank.
Jon Kleinberg
Challenges in Mining Social Network Data
Diffusion in Social Networks
Book recommendations (Leskovec et al 2006)
Contagion of TB (Andre et al. 2006)
Diffusion, another fundamental social processs:
Behaviors that cascade from node to node like an epidemic.
News, opinions, rumors, fads, urban legends, ...
Viral marketing [Domingos-Richardson 2001]
Public health (e.g. obesity [Christakis-Fowler 2007])
Cascading failures in financial markets
Localized collective action: riots, walkouts
Jon Kleinberg
Challenges in Mining Social Network Data
Diffusion Curves
Basis for models: Probability of adopting new behavior depends on
number of friends who have adopted.
Bass 1969; Granovetter 1978; Schelling 1978
Prob.
of
adoption
Prob.
of
adoption
k = number of friends adopting
k = number of friends adopting
Key issue: qualitative shape of the diffusion curves.
Diminishing returns? Critical mass?
Distinction has consequences for models of diffusion at
population level.
Jon Kleinberg
Challenges in Mining Social Network Data
Diffusion Curves
Probability of joining a community when k friends are already members
0.025
0.02
probability
0.015
0.01
0.005
0
0
5
10
15
20
25
k
30
35
40
45
50
Joining a LiveJournal community [Backstrom et al. ’06]
Editing a Wikipedia article [Cosley et al. ’07]
Buying a DVD [Leskovec et al. ’06]
Triadic closure in e-mail [Kossinets-Watts ’06]
Jon Kleinberg
Challenges in Mining Social Network Data
Basic Questions about Diffusion Processes
Is the recurring diminishing-returns shape a consequence of
some more basic phenomenon?
What are its implications for global structure of diffusion?
[Kempe-Kleinberg-Tardos 2003, Mossell-Roch 2007]
Diffusion of Topics
[Adar-Zhang-Adamic-Lukose 04,
Gruhl-Liben-Nowell-Guha-Tomkins 04,
Leskovec-McGlohon-Faloutsos-Glance-Hurst 07]
News stories cascade through networks of bloggers and media
How should we track stories and rank news sources?
A taxonomy of sources: discoverers, amplifiers, reshapers, ...
Building diffusion into the design of social media
Incentives to propagate interesting recommendations along
social network links [Leskovec-Adamic-Huberman 2006]
Simple markets based on question-answering and
information-seeking [Kleinberg-Raghavan 2005]
Jon Kleinberg
Challenges in Mining Social Network Data
Are Cascades Inherently Unpredictable?
Salganik-Dodds-Watts 2006
Music download site:
obscure songs, 14,000
visitors.
Users first previewed songs,
then rated, then optionally
downloaded.
Varied the information available:
Just song titles
Titles and download statistics (feedback from other users).
Version with feedback: ran 8 “parallel universes.”
Feedback produced greater inequality, and top songs varied
across parallel runs.
Jon Kleinberg
Challenges in Mining Social Network Data
Protecting Privacy in Social Network Data
Many large datasets based on communication (e-mail, IM, voice)
where users have strong privacy expectations.
Current safeguards based on anonymization: replace node
names with random IDs.
With more detailed data, anonymization has run into trouble:
Identifying on-line pseudonyms by textual analysis
[Novak-Raghavan-Tomkins 2004]
De-anonymizing Netflix ratings via time series
[Narayanan-Shmatikov 2006]
Search engine query logs: identifying users from their queries.
Our setting is much starker;
does this make things safer?
E.g. no text, time-stamps,
or node attributes
Just a graph with nodes
numbered 1, 2, 3, . . . , n.
Jon Kleinberg
Challenges in Mining Social Network Data
An Example of What Can Go Wrong
Node 32 can find himself: only node of degree 6 connected to
both leaders.
Jon Kleinberg
Challenges in Mining Social Network Data
An Example of What Can Go Wrong
Node 32 can find himself: only node of degree 6 connected to
both leaders.
Node 4 can find herself: only node of degree 5 connected to
defecting leader but not original leader.
Jon Kleinberg
Challenges in Mining Social Network Data
Attacking an Anonymized Network
What we learn from this:
Attacker may have extra power if they are part of the system.
In large e-mail/IM network, can easily add yourself to system.
But “finding yourself” when there are 100 million nodes is
going to be more subtle than when there are 34 nodes.
Template for an active attack on an anonymized network
[Backstrom-Dwork-Kleinberg 2007]
Attacker can create (before the data is released)
nodes (e.g. by registering an e-mail account)
edges incident to these nodes (by sending mail)
Privacy breach: learning whether there is an edge between
two existing nodes in the network.
Note: attacker’s actions are completely “innocuous.”
Main result: active attacks can easily compromise privacy by
creating very few additional nodes (in theory and experiment).
Jon Kleinberg
Challenges in Mining Social Network Data
An Attack
100M nodes
Scenario:
Suppose an organization were going to release an anonymized
communication graph on 100 million users.
Jon Kleinberg
Challenges in Mining Social Network Data
An Attack
100M nodes
An attacker chooses a small set of user accounts to “target”:
Goal is to learn edge relations among them.
Jon Kleinberg
Challenges in Mining Social Network Data
An Attack
100M nodes
Before dataset is released:
Create a small set of new accounts, with links among them,
forming a subgraph H.
Attach this new subgraph H to targeted accounts.
Jon Kleinberg
Challenges in Mining Social Network Data
An Attack
100M+12 nodes
When anonymized dataset is released, need to find H.
Why couldn’t there be many copies of H in the dataset?
(We don’t even know what the network will look like ... )
Why wouldn’t it be computationally hard to find H?
Jon Kleinberg
Challenges in Mining Social Network Data
An Attack
100M+12 nodes
In fact,
Theorem: small random graphs H will likely be unique and
efficiently findable.
Random graph: each edge present with prob. 1/2.
Jon Kleinberg
Challenges in Mining Social Network Data
An Attack
100M+12 nodes
Once H is found:
Can easily find the targeted nodes by following edges from H.
Jon Kleinberg
Challenges in Mining Social Network Data
Specifics of the Attack
First version of the attack:
Create random H on ∼ 2 log n nodes.
Can compromise ∼ (log n)2 targeted nodes.
In experiments on 4.4 million-node LiveJournal graph,
7-node graph H can compromise 70 targeted nodes
(and hence ∼ 2400 edge relations).
Second version of the attack:
Can begin breaching privacy with H of size ∼
√
log n.
Matching lower bound on size of H needed.
Passive attacks:
In LiveJournal graph: with reasonable probability, you and 6
of your friends chosen at random can carry out the first
attack, compromising about 10 users.
Jon Kleinberg
Challenges in Mining Social Network Data
Specifics of the Attack
targeted nodes
H
2
1
3
4
5
6
Random subgraph H: On a bit more than 2 log n nodes;
each pair linked with with probability 21 .
Link each targeted node to distinct subset of nodes in H.
Must show
H is unique in network (even after plugging in).
H is efficiently findable in unlabeled graph.
H has no internal symmetries (automorphisms); this is easy.
Jon Kleinberg
Challenges in Mining Social Network Data
Specifics of the Attack
targeted nodes
H
2
1
3
4
5
6
Random subgraph H: On a bit more than 2 log n nodes;
each pair linked with with probability 21 .
Link each targeted node to distinct subset of nodes in H.
Must show
H is unique in network (even after plugging in).
H is efficiently findable in unlabeled graph.
H has no internal symmetries (automorphisms); this is easy.
Jon Kleinberg
Challenges in Mining Social Network Data
Specifics of the Attack
targeted nodes
H
2
1
3
4
5
6
Random subgraph H: On a bit more than 2 log n nodes;
each pair linked with with probability 21 .
Link each targeted node to distinct subset of nodes in H.
Must show
H is unique in network (even after plugging in).
H is efficiently findable in unlabeled graph.
H has no internal symmetries (automorphisms); this is easy.
Jon Kleinberg
Challenges in Mining Social Network Data
Specifics of the Attack
targeted nodes
H
Random subgraph H: On a bit more than 2 log n nodes;
each pair linked with with probability 21 .
Link each targeted node to distinct subset of nodes in H.
Must show
H is unique in network (even after plugging in).
H is efficiently findable in unlabeled graph.
H has no internal symmetries (automorphisms); this is easy.
Jon Kleinberg
Challenges in Mining Social Network Data
Why is H Unique? Ideas from Ramsey Theory
Basic calculation at the foundation of
Theorem (Erdös, 1947): There exists an n-node
graph with no clique and no independent set of
size > 2 log n.
clique
Quantitative bound for Ramsey’s Theorem;
one of the earliest uses of random graphs.
independent set
The calculation:
Build random n-node graph, include each edge with prob.
1
2.
There are < nk sets of k nodes; each is a clique or
k
2
independent set with probability 2−(2) ≈ 2−k /2 .
2
Product nk · 2−k /2 upper-bounds probability of any clique or
indep. set; it drops below 1 once k exceeds ≈ 2 log n.
Jon Kleinberg
Challenges in Mining Social Network Data
Why is H Unique? Ideas from Ramsey Theory
Erdös: Graph is random, subgraph is non-random.
Our case: Subgraph (H) is random, graph is non-random.
But main calculation starts from same premise:
Almost correct: there are < nk subgraphs that could be a
2
second copy of H, and each is same as H with prob. ≈ 2−k /2 .
targeted nodes
H
2
1
Jon Kleinberg
3
4
5
6
Analysis is greatly
complicated because
H is plugged into full
graph.
New copies of H
could partly overlap
original copy of H.
Challenges in Mining Social Network Data
Why is H Unique? Ideas from Ramsey Theory
Erdös: Graph is random, subgraph is non-random.
Our case: Subgraph (H) is random, graph is non-random.
But main calculation starts from same premise:
Almost correct: there are < nk subgraphs that could be a
2
second copy of H, and each is same as H with prob. ≈ 2−k /2 .
targeted nodes
H
2
Jon Kleinberg
2
1
3
4
5
6
Analysis is greatly
complicated because
H is plugged into full
graph.
New copies of H
could partly overlap
original copy of H.
Challenges in Mining Social Network Data
Finding the subgraph H
To find H:
Can assume there is a path through
nodes 1, 2, . . . , k.
Start search at all possible nodes in G .
Prune search path at depth j if edges
back from node j don’t match, or if
degree of j doesn’t match.
Probability of a spurious path surviving to depth j is ≈ 2−j
(modulo overlap worries).
Overall size of search tree slightly more than linear in n.
Jon Kleinberg
Challenges in Mining Social Network Data
2 /2
Experiments
Probability of successful attack
1
4.4 million nodes,
77 million edges
0.8
probability
Simulated the attack on
social/blogging network
LiveJournal
d0=20, d1=60
d0=10, d1=20
0.6
0.4
0.2
0
0
2
4
6
k
8
With 7 nodes, degrees in [20, 60], success rate > 90%.
Average of 70 nodes compromised (2415 edges).
Search tree about 90,000 nodes (i.e., tiny).
7 nodes much less than 2 log n;
randomization of degrees crucial to performance.
Jon Kleinberg
Challenges in Mining Social Network Data
10
12
Extensions
Optimal attack size:
Attack variant breaches privacy with H of size ∼
Optimal up to constant factors.
√
log n:
Random H attached along “small” edge separation;
recover using Gomory-Hu cut tree.
Passive attacks:
If already in network: an attack
needing no preparation.
Probability of successful attack
1
0.9
0.8
Such subgraphs are unique with
high probability.
0.7
Probability
Form subgraph with small number
of neighbors.
0.6
0.5
0.4
0.3
0.2
Simple Algorithm, High-Degree Friends
Refined Algorithm, High-Deg Friends
Refined Algorithm, Random Friends
0.1
0
2
3
4
5
Coalition Size
6
Can compromise nodes in neigborhood of subgraph.
Jon Kleinberg
Challenges in Mining Social Network Data
7
8
Stronger Theoretical Bound
Variant on construction breaches
√
privacy with H of size ∼ log n:
Optimal up to constant factors.
Construct H as before on k nodes,
but connect to b = k3 targeted
nodes.
With high prob., min. internal cut
in H exceeds b = cut to rest of
graph.
Can find H by breaking graph
apart using Gomory-Hu cut tree.
Jon Kleinberg
Challenges in Mining Social Network Data
Passive Attacks
If node v already in network, can
recruit neighbors after data released.
Suppose subgraph H on v and
colluding neighbors is unique.
If a node w is the only one to attach
to a particular subset of N(v ), then w
is compromised.
Probability of successful attack
Average number of users compromised
50
0.9
45
0.8
40
Number Compromised
1
Probability
0.7
0.6
0.5
0.4
0.3
0.2
Simple Algorithm, High-Degree Friends
Refined Algorithm, High-Deg Friends
Refined Algorithm, Random Friends
0.1
0
2
3
4
5
Coalition Size
6
Passive
Semi-Passive
35
30
25
20
15
10
5
7
8
Jon Kleinberg
0
2
3
4
5
Coalition Size
6
7
Challenges in Mining Social Network Data
8
The Perils of Anonymized Data
General release of an anonymized social network?
Many potential dangers.
Note: earlier datasets additionally protected by
legal/contractual/IRB/employment safeguards.
Fundamental question: privacy-preserving mechanisms for
making social network data accessible.
May be difficult to obfuscate network effectively
(e.g. [Dinur-Nissim 2003, Dwork-McSherry-Talwar 2007])
Interactive mechanisms for network data may be possible
(e.g. [Dwork-McSherry-Nissim-Smith 2006])
Recent proposals specifically aimed at protecting
social networks [Hay et al ’07, Zheleva-Getoor ’07] and
query logs [Adar ’07, Kumar et al ’07, Xiong-Agichtein ’07]
Further issues
Privacy-utility trade-offs.
Even without overt attacks, increasingly refined pictures of
individuals begin to emerge.
Jon Kleinberg
Challenges in Mining Social Network Data
Final Reflections: Toward a Model of You
Further direction: from populations to individuals
Distributions over millions of people leave open several
possibilities:
Individual are highly diverse, and the distribution only appears
in aggregate, or
Each individual personally follows (a version of) the
distribution.
Recent studies suggests that sometimes the second option
may in fact be true.
Example: what is the probability
that you answer a piece of e-mail
t days after receipt (conditioned
on answering at all)?
Recent theories suggest
t −1.5 with exponential
cut-off [Barabasi 2005]
Jon Kleinberg
Challenges in Mining Social Network Data
Final Reflections: Interacting in the On-Line World
MySpace is doubly awkward because it makes public what should
be private. It doesn’t just create social networks, it anatomizes
them. It spreads them out like a digestive tract on the autopsy
table. You can see what’s connected to what, who’s connected to
whom.
– Toronto Globe and Mail, June 2006.
Social networks — implicit for millenia — are increasingly
being recorded at arbitrary resolution and browsable in our
information systems.
Your software has a trace of your activities resolved to the
second — and increasingly knows more about your behavior
than you do.
Models based on algorithmic ideas will be crucial in
understanding these developments.
Jon Kleinberg
Challenges in Mining Social Network Data