* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Challenges in Mining Social Network Data
Survey
Document related concepts
Transcript
Challenges in Mining Social Network Data Jon Kleinberg Cornell University Including joint work with Lars Backstrom, Dan Cosley, Cynthia Dwork, Dan Huttenlocher, Xiangyang Lan, Sid Suri Jon Kleinberg Challenges in Mining Social Network Data Social Network Analysis High-school dating (Bearman-Moody-Stovel 2004) Karate club (Zachary 1977) Social network data Active research area in sociology, social psychology, anthropology for the past half-century. Today: Convergence of social and technological networks Computing and info. systems with intrinsic social structure. What can the different fields learn from each other? Jon Kleinberg Challenges in Mining Social Network Data Mining Social Network Data Mining social networks also has long history in social sciences. E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social ties and rivalries in a university karate club. Jon Kleinberg Challenges in Mining Social Network Data Mining Social Network Data Mining social networks also has long history in social sciences. E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social ties and rivalries in a university karate club. During his observation, conflicts intensified and group split. Jon Kleinberg Challenges in Mining Social Network Data Mining Social Network Data Mining social networks also has long history in social sciences. E.g. Wayne Zachary’s Ph.D. work (1970-72): observe social ties and rivalries in a university karate club. During his observation, conflicts intensified and group split. Split could be explained by minimum cut in social network. Jon Kleinberg Challenges in Mining Social Network Data A Matter of Scale Social network data spans many orders of magnitude 436-node network of e-mail exchange over 3 months at a corporate research lab (Adamic-Adar 2003) 43,553-node network of e-mail exchange over 2 years at a large university (Kossinets-Watts 2006) 4.4-million-node network of declared friendships on blogging community LiveJournal (Liben-Nowell et al. 2005, Backstrom et al. 2006) 240-million-node network of all IM communication over one month on Microsoft Instant Messenger (Leskovec-Horvitz’07) Jon Kleinberg Challenges in Mining Social Network Data Not Just a Matter of Scale How does massive network data compare to small-scale studies? Currently, massive network datasets give you both more and less: More: can observe global phenomena that are genuine, but literally invisible at smaller scales. Less: Don’t really know what any one node or link means. Easy to measure things; hard to pose nuanced questions. Goal: Find the point where the lines of research converge. Jon Kleinberg Challenges in Mining Social Network Data Outline Several core KDD methodologies come into play: Working with network data that is much messier than just nodes and edges. Algorithmic models as a basic vocabulary for expressing complex social-science questions on complex network data. Understanding social networks as datasets: privacy implications and other concerns. Plan for the talk: Algorithmic models for search and diffusion in social networks: Formulating some fundamental unresolved questions. Evaluating anonymization as a standard approach for protecting privacy in social network data. Jon Kleinberg Challenges in Mining Social Network Data The Small-World Phenomenon Milgram’s small-world experiment (1967) Choose a target in Boston, starters in Nebraska. A letter begins at each starter, must be passed between personal acquaintances until target is reached. Six steps on average −→ six degrees of separation. Milgram experiment (Travers-Milgram 1970) Jon Kleinberg Microsoft IM (Leskovec-Horvitz 2007) Challenges in Mining Social Network Data Modeling the Search Process in Social Networks Routing in a (social) network: When is local information sufficient? [Kleinberg 2000] Variation on network model of Watts and Strogatz [1998]. Add edges to lattice: u links to v with probability d(u, v )−α . Optimal exponent α = 2: yields routing time ∼ c log2 n. Similar effects hold in other structures (hierarchies, set systems) Connections to long-range percolation in statistical physics Jon Kleinberg Challenges in Mining Social Network Data Geographic Data: LiveJournal Liben-Nowell, Kumar, Novak, Raghavan, Tomkins (2005) studied LiveJournal, an on-line blogging community with friendship links. Large-scale social network with geographical embedding: 500,000 members with U.S. Zip codes, 4 million links. Analyzed how friendship probability decreases with distance. Difficulty: non-uniform population density makes simple lattice models hard to apply. Jon Kleinberg Challenges in Mining Social Network Data LiveJournal: Rank-Based Friendship rank 7 w v Rank-based friendship: rank of w with respect to v is number of people x such that d(v , x) < d(v , w ). Result of [LKNRT’05]: Efficient routing for (nearly) arbitrary population density, if link probability proportional to 1/rank. Generalization of lattice result. Punchline: LiveJournal friendships approximate 1/rank. Jon Kleinberg Challenges in Mining Social Network Data LiveJournal: Rank-Based Friendship rank 7 w v Rank-based friendship: rank of w with respect to v is number of people x such that d(v , x) < d(v , w ). Result of [LKNRT’05]: Efficient routing for (nearly) arbitrary population density, if link probability proportional to 1/rank. Generalization of lattice result. Punchline: LiveJournal friendships approximate 1/rank. Jon Kleinberg Challenges in Mining Social Network Data Diffusion in Social Networks Book recommendations (Leskovec et al 2006) Contagion of TB (Andre et al. 2006) Diffusion, another fundamental social processs: Behaviors that cascade from node to node like an epidemic. News, opinions, rumors, fads, urban legends, ... Viral marketing [Domingos-Richardson 2001] Public health (e.g. obesity [Christakis-Fowler 2007]) Cascading failures in financial markets Localized collective action: riots, walkouts Jon Kleinberg Challenges in Mining Social Network Data Diffusion Curves Basis for models: Probability of adopting new behavior depends on number of friends who have adopted. Bass 1969; Granovetter 1978; Schelling 1978 Prob. of adoption Prob. of adoption k = number of friends adopting k = number of friends adopting Key issue: qualitative shape of the diffusion curves. Diminishing returns? Critical mass? Distinction has consequences for models of diffusion at population level. Jon Kleinberg Challenges in Mining Social Network Data Diffusion Curves Probability of joining a community when k friends are already members 0.025 0.02 probability 0.015 0.01 0.005 0 0 5 10 15 20 25 k 30 35 40 45 50 Joining a LiveJournal community [Backstrom et al. ’06] Editing a Wikipedia article [Cosley et al. ’07] Buying a DVD [Leskovec et al. ’06] Triadic closure in e-mail [Kossinets-Watts ’06] Jon Kleinberg Challenges in Mining Social Network Data Basic Questions about Diffusion Processes Is the recurring diminishing-returns shape a consequence of some more basic phenomenon? What are its implications for global structure of diffusion? [Kempe-Kleinberg-Tardos 2003, Mossell-Roch 2007] Diffusion of Topics [Adar-Zhang-Adamic-Lukose 04, Gruhl-Liben-Nowell-Guha-Tomkins 04, Leskovec-McGlohon-Faloutsos-Glance-Hurst 07] News stories cascade through networks of bloggers and media How should we track stories and rank news sources? A taxonomy of sources: discoverers, amplifiers, reshapers, ... Building diffusion into the design of social media Incentives to propagate interesting recommendations along social network links [Leskovec-Adamic-Huberman 2006] Simple markets based on question-answering and information-seeking [Kleinberg-Raghavan 2005] Jon Kleinberg Challenges in Mining Social Network Data Are Cascades Inherently Unpredictable? Salganik-Dodds-Watts 2006 Music download site: obscure songs, 14,000 visitors. Users first previewed songs, then rated, then optionally downloaded. Varied the information available: Just song titles Titles and download statistics (feedback from other users). Version with feedback: ran 8 “parallel universes.” Feedback produced greater inequality, and top songs varied across parallel runs. Jon Kleinberg Challenges in Mining Social Network Data Protecting Privacy in Social Network Data Many large datasets based on communication (e-mail, IM, voice) where users have strong privacy expectations. Current safeguards based on anonymization: replace node names with random IDs. With more detailed data, anonymization has run into trouble: Identifying on-line pseudonyms by textual analysis [Novak-Raghavan-Tomkins 2004] De-anonymizing Netflix ratings via time series [Narayanan-Shmatikov 2006] Search engine query logs: identifying users from their queries. Our setting is much starker; does this make things safer? E.g. no text, time-stamps, or node attributes Just a graph with nodes numbered 1, 2, 3, . . . , n. Jon Kleinberg Challenges in Mining Social Network Data An Example of What Can Go Wrong Node 32 can find himself: only node of degree 6 connected to both leaders. Jon Kleinberg Challenges in Mining Social Network Data An Example of What Can Go Wrong Node 32 can find himself: only node of degree 6 connected to both leaders. Node 4 can find herself: only node of degree 5 connected to defecting leader but not original leader. Jon Kleinberg Challenges in Mining Social Network Data Attacking an Anonymized Network What we learn from this: Attacker may have extra power if they are part of the system. In large e-mail/IM network, can easily add yourself to system. But “finding yourself” when there are 100 million nodes is going to be more subtle than when there are 34 nodes. Template for an active attack on an anonymized network [Backstrom-Dwork-Kleinberg 2007] Attacker can create (before the data is released) nodes (e.g. by registering an e-mail account) edges incident to these nodes (by sending mail) Privacy breach: learning whether there is an edge between two existing nodes in the network. Note: attacker’s actions are completely “innocuous.” Main result: active attacks can easily compromise privacy by creating very few additional nodes (in theory and experiment). Jon Kleinberg Challenges in Mining Social Network Data An Attack 100M nodes Scenario: Suppose an organization were going to release an anonymized communication graph on 100 million users. Jon Kleinberg Challenges in Mining Social Network Data An Attack 100M nodes An attacker chooses a small set of user accounts to “target”: Goal is to learn edge relations among them. Jon Kleinberg Challenges in Mining Social Network Data An Attack 100M nodes Before dataset is released: Create a small set of new accounts, with links among them, forming a subgraph H. Attach this new subgraph H to targeted accounts. Jon Kleinberg Challenges in Mining Social Network Data An Attack 100M+12 nodes When anonymized dataset is released, need to find H. Why couldn’t there be many copies of H in the dataset? (We don’t even know what the network will look like ... ) Why wouldn’t it be computationally hard to find H? Jon Kleinberg Challenges in Mining Social Network Data An Attack 100M+12 nodes In fact, Theorem: small random graphs H will likely be unique and efficiently findable. Random graph: each edge present with prob. 1/2. Jon Kleinberg Challenges in Mining Social Network Data An Attack 100M+12 nodes Once H is found: Can easily find the targeted nodes by following edges from H. Jon Kleinberg Challenges in Mining Social Network Data Specifics of the Attack First version of the attack: Create random H on ∼ 2 log n nodes. Can compromise ∼ (log n)2 targeted nodes. In experiments on 4.4 million-node LiveJournal graph, 7-node graph H can compromise 70 targeted nodes (and hence ∼ 2400 edge relations). Second version of the attack: Can begin breaching privacy with H of size ∼ √ log n. Matching lower bound on size of H needed. Passive attacks: In LiveJournal graph: with reasonable probability, you and 6 of your friends chosen at random can carry out the first attack, compromising about 10 users. Jon Kleinberg Challenges in Mining Social Network Data Specifics of the Attack targeted nodes H 2 1 3 4 5 6 Random subgraph H: On a bit more than 2 log n nodes; each pair linked with with probability 21 . Link each targeted node to distinct subset of nodes in H. Must show H is unique in network (even after plugging in). H is efficiently findable in unlabeled graph. H has no internal symmetries (automorphisms); this is easy. Jon Kleinberg Challenges in Mining Social Network Data Specifics of the Attack targeted nodes H 2 1 3 4 5 6 Random subgraph H: On a bit more than 2 log n nodes; each pair linked with with probability 21 . Link each targeted node to distinct subset of nodes in H. Must show H is unique in network (even after plugging in). H is efficiently findable in unlabeled graph. H has no internal symmetries (automorphisms); this is easy. Jon Kleinberg Challenges in Mining Social Network Data Specifics of the Attack targeted nodes H 2 1 3 4 5 6 Random subgraph H: On a bit more than 2 log n nodes; each pair linked with with probability 21 . Link each targeted node to distinct subset of nodes in H. Must show H is unique in network (even after plugging in). H is efficiently findable in unlabeled graph. H has no internal symmetries (automorphisms); this is easy. Jon Kleinberg Challenges in Mining Social Network Data Specifics of the Attack targeted nodes H Random subgraph H: On a bit more than 2 log n nodes; each pair linked with with probability 21 . Link each targeted node to distinct subset of nodes in H. Must show H is unique in network (even after plugging in). H is efficiently findable in unlabeled graph. H has no internal symmetries (automorphisms); this is easy. Jon Kleinberg Challenges in Mining Social Network Data Why is H Unique? Ideas from Ramsey Theory Basic calculation at the foundation of Theorem (Erdös, 1947): There exists an n-node graph with no clique and no independent set of size > 2 log n. clique Quantitative bound for Ramsey’s Theorem; one of the earliest uses of random graphs. independent set The calculation: Build random n-node graph, include each edge with prob. 1 2. There are < nk sets of k nodes; each is a clique or k 2 independent set with probability 2−(2) ≈ 2−k /2 . 2 Product nk · 2−k /2 upper-bounds probability of any clique or indep. set; it drops below 1 once k exceeds ≈ 2 log n. Jon Kleinberg Challenges in Mining Social Network Data Why is H Unique? Ideas from Ramsey Theory Erdös: Graph is random, subgraph is non-random. Our case: Subgraph (H) is random, graph is non-random. But main calculation starts from same premise: Almost correct: there are < nk subgraphs that could be a 2 second copy of H, and each is same as H with prob. ≈ 2−k /2 . targeted nodes H 2 1 Jon Kleinberg 3 4 5 6 Analysis is greatly complicated because H is plugged into full graph. New copies of H could partly overlap original copy of H. Challenges in Mining Social Network Data Why is H Unique? Ideas from Ramsey Theory Erdös: Graph is random, subgraph is non-random. Our case: Subgraph (H) is random, graph is non-random. But main calculation starts from same premise: Almost correct: there are < nk subgraphs that could be a 2 second copy of H, and each is same as H with prob. ≈ 2−k /2 . targeted nodes H 2 Jon Kleinberg 2 1 3 4 5 6 Analysis is greatly complicated because H is plugged into full graph. New copies of H could partly overlap original copy of H. Challenges in Mining Social Network Data Finding the subgraph H To find H: Can assume there is a path through nodes 1, 2, . . . , k. Start search at all possible nodes in G . Prune search path at depth j if edges back from node j don’t match, or if degree of j doesn’t match. Probability of a spurious path surviving to depth j is ≈ 2−j (modulo overlap worries). Overall size of search tree slightly more than linear in n. Jon Kleinberg Challenges in Mining Social Network Data 2 /2 Experiments Probability of successful attack 1 4.4 million nodes, 77 million edges 0.8 probability Simulated the attack on social/blogging network LiveJournal d0=20, d1=60 d0=10, d1=20 0.6 0.4 0.2 0 0 2 4 6 k 8 With 7 nodes, degrees in [20, 60], success rate > 90%. Average of 70 nodes compromised (2415 edges). Search tree about 90,000 nodes (i.e., tiny). 7 nodes much less than 2 log n; randomization of degrees crucial to performance. Jon Kleinberg Challenges in Mining Social Network Data 10 12 Extensions Optimal attack size: Attack variant breaches privacy with H of size ∼ Optimal up to constant factors. √ log n: Random H attached along “small” edge separation; recover using Gomory-Hu cut tree. Passive attacks: If already in network: an attack needing no preparation. Probability of successful attack 1 0.9 0.8 Such subgraphs are unique with high probability. 0.7 Probability Form subgraph with small number of neighbors. 0.6 0.5 0.4 0.3 0.2 Simple Algorithm, High-Degree Friends Refined Algorithm, High-Deg Friends Refined Algorithm, Random Friends 0.1 0 2 3 4 5 Coalition Size 6 Can compromise nodes in neigborhood of subgraph. Jon Kleinberg Challenges in Mining Social Network Data 7 8 Stronger Theoretical Bound Variant on construction breaches √ privacy with H of size ∼ log n: Optimal up to constant factors. Construct H as before on k nodes, but connect to b = k3 targeted nodes. With high prob., min. internal cut in H exceeds b = cut to rest of graph. Can find H by breaking graph apart using Gomory-Hu cut tree. Jon Kleinberg Challenges in Mining Social Network Data Passive Attacks If node v already in network, can recruit neighbors after data released. Suppose subgraph H on v and colluding neighbors is unique. If a node w is the only one to attach to a particular subset of N(v ), then w is compromised. Probability of successful attack Average number of users compromised 50 0.9 45 0.8 40 Number Compromised 1 Probability 0.7 0.6 0.5 0.4 0.3 0.2 Simple Algorithm, High-Degree Friends Refined Algorithm, High-Deg Friends Refined Algorithm, Random Friends 0.1 0 2 3 4 5 Coalition Size 6 Passive Semi-Passive 35 30 25 20 15 10 5 7 8 Jon Kleinberg 0 2 3 4 5 Coalition Size 6 7 Challenges in Mining Social Network Data 8 The Perils of Anonymized Data General release of an anonymized social network? Many potential dangers. Note: earlier datasets additionally protected by legal/contractual/IRB/employment safeguards. Fundamental question: privacy-preserving mechanisms for making social network data accessible. May be difficult to obfuscate network effectively (e.g. [Dinur-Nissim 2003, Dwork-McSherry-Talwar 2007]) Interactive mechanisms for network data may be possible (e.g. [Dwork-McSherry-Nissim-Smith 2006]) Recent proposals specifically aimed at protecting social networks [Hay et al ’07, Zheleva-Getoor ’07] and query logs [Adar ’07, Kumar et al ’07, Xiong-Agichtein ’07] Further issues Privacy-utility trade-offs. Even without overt attacks, increasingly refined pictures of individuals begin to emerge. Jon Kleinberg Challenges in Mining Social Network Data Final Reflections: Toward a Model of You Further direction: from populations to individuals Distributions over millions of people leave open several possibilities: Individual are highly diverse, and the distribution only appears in aggregate, or Each individual personally follows (a version of) the distribution. Recent studies suggests that sometimes the second option may in fact be true. Example: what is the probability that you answer a piece of e-mail t days after receipt (conditioned on answering at all)? Recent theories suggest t −1.5 with exponential cut-off [Barabasi 2005] Jon Kleinberg Challenges in Mining Social Network Data Final Reflections: Interacting in the On-Line World MySpace is doubly awkward because it makes public what should be private. It doesn’t just create social networks, it anatomizes them. It spreads them out like a digestive tract on the autopsy table. You can see what’s connected to what, who’s connected to whom. – Toronto Globe and Mail, June 2006. Social networks — implicit for millenia — are increasingly being recorded at arbitrary resolution and browsable in our information systems. Your software has a trace of your activities resolved to the second — and increasingly knows more about your behavior than you do. Models based on algorithmic ideas will be crucial in understanding these developments. Jon Kleinberg Challenges in Mining Social Network Data