Download Diffusion-Aware Sampling and Estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Diffusion-Aware Sampling and Estimation in
Information Diffusion Networks
By: Motahareh Eslami Mehdiabadi, Hamid Reza Rabiee
and Mostafa Salehi
[email protected]
4th September, 2012
Outline
 Introduction
 Related Work
 Complete Data
 Partial Data
 Problem Formulation
 Proposed Framework
 Sampling Design
 Estimation Approach
 Experimental Evaluation
 Conclusion
 References
2
Diffusion-Aware
Sampling
and Estimation
P2P Live Video
Streaming
DML
Introduction
Information Diffusion
One of the important topics in large On-line Social Networks (OSN)
such as Facebook, Twitter, and YouTube.
Has a latent structure which makes its analysis considerably difficult
Although we usually discover the time of obtaining some information
by people, we can not find the source of information easily.
Using partially-observed network data
3
Diffusion-Aware
Sampling
and Estimation
P2P Live Video
Streaming
DML
Introduction
4
Diffusion-Aware
Sampling
and Estimation
P2P Live Video
Streaming
DML
Different Sampling Designs in Diffusion Networks
 Considering the sampling approaches to study diffusion
behaviors of social networks, apart from their topologies
Figure 1-(a): Using well-known sampling
methods such as BFS and RW without
considering the behavior of the diffusion
process
gathering redundant data +
losing parts of diffusion data + decrease the
performance of these sampling methods
5
Figure 1-(b): Working with diffusion
network directly
It is often not
feasible to work with this latent network
P2P Live Video
Streaming
Diffusion-Aware
Sampling
and Estimation
DML
Introduction
A link-tracing based sampling method which utilizes
diffusion process properties to traverse the network
more accurately.
Only uses the infection times as local information
without any knowledge about the latent structure of the
diffusion network
Proposing a Novel TwoStep Framework
(Sampling/ Estimation)
»DNS«
Extend the well-known Hansen-Hurwitz estimator to correct
the bias of sampled data by three efficient estimators of
characteristics:
(1) link-based (2) node-based (3) cascade-based
The First attempt to introduce a complete
measurement framework for the diffusion networks
6
P2P Live Video
Streaming
Diffusion-Aware
Sampling
and Estimation
DML
Related Work
Complete Data
Partial Data
Complete Data
 The most fundamental approach: Collecting the complete
diffusion data
 Following the Iraq war petitions in the format of e-mail [11],
[19]
 studying communication events between faculty and staff of a
university by e-mails [15]
 tracking the flow of information by extracting short textual
phrases [16]
missing data
8
privacy policies
diffusion network
latent structure
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
Use sampling methods
to obtain partial data
DML
Partial Data
Common Sampling Methods
Breadth-First Search (BFS)
Random Walk (RW)
Using these sampling methods without any attention to diffusion paths
results in some redundant data which are not
related to the diffusion process.
Removing these unnecessary data decreases
the efficiency of sampling methods
No previous work considers diffusion process in the sampling strategy.
This paper presents a diffusion-aware
sampling and estimation methods
9
The first study to introduce a complete
measurement framework for diffusion
networks.
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Problem Formulation
 Basic Notations and Definitions
 Problem Definition
Basic Notations & Definitions
11
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Diffusion Characteristics Measurement
Element Set T: A set of diffusion network elements such as
nodes, links and cascades
L: a finite set of element labels
• Examples of a label: the degree of a node, the weight of a link,
or the length of a cascade.
Target function f: T⇾L: A label le is assigned to each element
e∊ T i.e. f = {(e, le)| e ∊ T, le ∊ L}
• Infection as an example
To measure diffusion network characteristics, some
aggregative function should be used.
• Average Function
12
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Problem Definition
13
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Proposed Framework
DNS: Diffusion Network Sampling
1) Sampling Design
2) Estimation Approach
Sampling Design
 Existing sampling methods such as BFS & RW do
not consider diffusion paths.
Computing the probability of spreading infections over the links of
underlying network
Directing the sampling design toward diffusion paths without any prior
knowledge about the diffusion network structures.
Covering a greater part of unknown diffusion network
Decreasing the redundant data such as nodes and links which do not
attend the diffusion process
15
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Sampling Design (cont’d)
Each cascade c can be assigned to a time vector
 tc = {t1, t2, …, tn} which shows the infection times
of nodes by cascade c
CT = {t1, t2, …, tNc}: the set of cascades’ time vectors
Waiting time model[1] depends on Δ= tv – tu
P( )  e



Exponential Model
Ce: set of cascades which pass over link e
Pc (e)

c

C
The infection probability of link e Pe 
| Ce |
16
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
The Algorithm
 moving from a node u, to a
neighboring node v,
through an outgoing link
with the infection
probability
 Using the infection times
(i.e. CT) as local
information without any
prior knowledge about the
latent structure of the
diffusion network
17
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Estimation Approach
Correcting the selection bias of a sampling method
by re-weighting of the measured values
Hansen-Hurwitz Estimator: elements are
weighted inversely proportional to their visiting
k 1
f ( xi )
probability

i  0  ( xi )
ˆ  k 1
1

i  0  ( xi )
Xi and п(Xi) are the visited element and its visiting
probability on the ith draw of sampling method.
18
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Estimation Approach
To use this estimator, the probability of visiting each element in
the proposed sampling procedure should be computed.
19
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Link-based Characteristics
Gaining some information without having any
connection to others for propagation will be not
valuable in a network.
Link Attendance: shows the amount of presence
in diffusion process for a link
Moving over links with the probability of Pe
п(e) = Pe
Only Using local knowledge to compute the visiting
probabilities of the links.
20
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Node-based Characteristics
Examples: The number of “Seeds” (the beginners of an infection) [3] and
“participation” (the fraction of users involved in the information diffusion) [12]
Modeling the diffusion process as a Markov random walk over the underlying
network G [18]
Where N (u) is the set of node u’ neighbors
Calculating п = {п(0), п(1),…, п(n-1)} needs the global knowledge of a network
(the stationary distribution of the mentioned Markov chain)
Not having the global view of the network in sampling procedure
Finding the exact value of п is not possible in real systems.
21
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Cascade-based Characteristics
Cascades: the building blocks of a diffusion network
Depth of a spreading phenomenon can be determined
by the length of its cascades
A cascade visiting probability depends on visiting all
of its links:
Calculating п(c) needs having all links of cascade c
which is a global knowledge.
22
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Experimental Evaluation
• Setup
• Dataset
• Performance Evaluation
•Diffusion Behavior Analysis
23
Setup
Characteristic: Link-Attendance
Baseline methods: BFS and RW
⍺: Control the speed of cascade
transmission
β: control the distance through which a
cascade can propagates
Diffusion rate (δ): The fraction of the
underlying network G which is covered
by the diffusion network G*
24
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Dataset
 Synthetic Networks
 Real Networks
Table 1: The network and cascade generation parameters
25
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Speed of Cascade
The unknown structure of the diffusion network
structure
unavailability of the cascade speed
0.4 ≤⍺≤ 0.7
Figure 2: Speed of cascade effect of link-attendance bias
26
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Performance Evaluation
Figure 3: Link Attendance characteristic evaluation in different sampling rates
27
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Performance Evaluation
Decreasing the bias by 30% in average by the DNS
framework in comparison to the sampling design
without estimation approach
Table 2: The average performance difference of DNS with BFS,
RW & DNS-WoE
28
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Diffusion Behavior Analysis
Figure 4: Analysis of diffusion rate over sampling frameworks
29
Diffusion-Aware
P2P Live Sampling
Video Streaming
and Estimation
DML
Contributions
Proposing a novel sampling design for gathering
data from a diffusion network by utilizing the
properties of diffusion process.
DNS
Framework
Proposing three estimators for correcting the
bias of sampled data:
link-based, node-based, and cascade-based
characteristics
Decreasing the bias of measuring link-based
characteristics compared to the other common
sampling methods
30
Diffusion
P2P LiveNetwork
Video Streaming
Extraction
DML
Conclusion & Future Work…
The proposed method outperforms BFS and RW in
terms of link-attendance by about 37% and 35%,
respectively
The proposed estimator can improve the
performance of the sampling design by about 30%
DNS can act very well even in low sampling rates
DNS leads to low bias even in low diffusion rates.
31
Diffusion
P2P LiveNetwork
Video Streaming
Extraction
DML
References
[1] M. Gomez-Rodriguez, J. Leskovec and A. Krause, Inferring networks of diffusion and influence, In proc. of KDD ’10, pages 1019-1028, 2010.
[2] Twitter Blog: ̸ = numbers. Blog.twitter.com., Retrieved 2012-01-20, http://blog.twitter.com/2011/03/numbers.html.
[3] M. Eslami, H.R. Rabiee and M.Salehi, Sampling from Information Diffusion Networks, 2012.
[4] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, Walking in Facebook: A Case Study of Unbiased Sampling of OSNs, Proceedings
of IEEE INFOCOM, 2010.
[5] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, Practical Recommendations on Crawling Online Social Networks, IEEE J. Sel.
Areas Commun, 2011.
[6] M. Salehi, H. R. Rabiee, N. Nabavi and Sh. Pooya, Characterizing Twitter with Respondent-Driven Sampling, International Workshop on Cloud and
Social Networking (CSN2011) in conjunction with SCA2011, No. 9, Vol. 29, pages 5521–5529, 2011.
[7] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel and B. Bhattachar-jee, Measurement and analysis of online social networks, Proceedings of
the ACM SIGCOMM conference on Internet measurement, pages 29–42, 2007.
[8] J. Leskovec and Ch. Faloutsos, Sampling from large graphs, Proceedings of the ACM SIGKDD conference on Knowledge discovery and data
mining, pages 631–636, 2006.
[9] M. Salehi, H. R. Rabiee, and A. Rajabi, Sampling from Complex Networks with high Community Structures, Chaos: An Interdisciplinary Journal of
Nonlinear Science , 2012.
[10] J. Leskovec, M. McGlohon, C. Faloutsos, N. S. Glance and M. Hurst, Patterns of Cascading Behavior in Large Blog Graphs, In proc. of
SDM’07, 2007.
[11] D. Liben-Nowell and J. Kleinberg, Tracing information flow on a global scale using Internet chain-letter data, Proc. of the National Academy of
Sciences, 105(12):4633-4638, 25 Mar, 2008.
[12] M. D. Choudhury, Y. Lin, H. Sundaram, K. S. Candan, L. Xie and A.Kelliher, How Does the Data Sampling Strategy Impact the Discovery of
Information Diffusion in Social Media?, Proc. of ICWSM , 2010.
32
Extracting
P2P Live
Network
Video Streaming
of Information
DML
References(cont’d)
[13] E. Sadikov, M. Medina, J. Leskovec and H. Garcia-Molina, Correcting for missing data in information cascades, WSDM, pages 55-64, 2011.
[14] D. Gruhl, R. Guha, D. Liben-Nowell and A. Tomkins, Information dif-fusion through blogspace, In proc. of of the 13th international conference
on World Wide Web, pages 491–501, 2004.
[15] G. Kossinets, J. M. Kleinberg and D.J. Watts, The structure of information pathways in a social communication network, KDD ’08,
pages 435-443. 2008.
[16] J. Leskovec, L. Backstrom and J. Kleinberg, Meme-tracking and the dynamics of the news cycle, KDD ’09: Proc. of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 497-506, 2009.
[17] M.Gomez-Rodriguez, D. Balduzzi and B.Scholkopf, Uncovering the Temporal Dynamics of Diffusion Networks, Proc. of the 28 th International
Conference on Machine Learning, Bellevue,WA, USA, 2011.
[18] M. Eslami, H. R. Rabiee and M. Salehi, DNE: A Method for Extracting Cascaded Diffusion Networks from Social Networks, IEEE Social Computing
Proceedings, 2011.
[19] F. Chierichetti, J. Kleinberg and D. Liben-Nowell, Reconstructing Pat-terns of Information Diffusion from Incomplete Observations, NIPS 2011.
[20] C.X. Lin, Q. Mei, Y.Jiang, J. Han and S. Qi, Inferring the Diffusion and Evolution of Topics in Social Communities, SNA KDD, 2011.
[21] Ch. Wilson, B. Boe, A. Sala, K. P. N Puttaswamy and B. Y. Zhao, User interactions in social networks and their implications, EuroSys ’09:
Proceedings of the 4th ACM European conference on Computer systems, pages 205–218, 2009.
[22] M. Kurant, A. Markopoulou and P. Thiran, On the bias of BFS (Breadth First Search), 22nd IEEE International Teletraffic Congress (ITC), pages
1–8, 2010.
[23] L. Lovas, Random walks on graphs: a survey, Combinatorics, 1993.
[24] M.R Henzinger, A. Heydon,M. Mitzenmacher and M. Najork, On near-uniform URL sampling, Proceedings of the World Wide Web conference
on Computer networks, pages 295-308, 2000.
[25] J. Leskovec and C. Faloutsos, Scalable modeling of real graphs using Kronecker multiplication, Proc. of ICML, pages 497-504, 2007.
33
Extracting
P2P Live
Network
Video Streaming
of Information
DML
References(cont’d)
[26] J. Leskovec, J. Kleinberg and C. Faloutsos, Graphs over Time: Densi-cation Laws, Shrinking Diameters and Possible Explanations, Proc. Of KDD ,
2005.
[27] P.Erds and A. Rnyi, On the evolution of random graphs, Publ. Math. Inst. Hung. Acad. Sci., 5: page 17, 1960.
[28] A. Clauset, C. Moore and M. E. J. Newman, Hierarchical structure and the prediction of missing links in networks, Nature, 453: pages 98-101, 2008.
[29] J. Leskovec, K.J. Lang, A. Dasgupta and M.W. Mahoney, Statistical properties of community structure in large social and information networks,
WWW, pages 695-704, 2008.
[30] L.A. Adamic and N. Glance, The political blogosphere and the 2004 US Election, Proc. of the WWW-2005 Workshop on the Weblogging Ecosystem,
2005.
[31] M. E. J. Newman, Finding community structure in networks using the eigenvectors of matrices, Preprint physics/0605087, 2006.
[32] S.A. Myers and J. Leskovec, On the Convexity of Latent Social Network Inference, Advances in Neural Infromation Processing Systems, 2010.
[33] Ch. Gkantsidis, M. Mihail and A. Saberi, Random walks in peer-to-peer networks: algorithms and evaluation, Elsevier Science Publishers B. V.,
Performance Evaluation, P2P Computing Systems, Vol 63, pages 241–263, 2006.
[34] D. Stutzbach, R. Rejaie, N. Duffield,S. Sen and W. Willinger, On Unbiased Sampling for Unstructured Peer-to-Peer Networks, Proceedings of
IMC,pages 27–40, 2008.
[35] L. Becchetti, C. Castillo, D. Donato and A.Fazzone, On the bias of BFS (Breadth First Search), LinkKDD, pages 1–8, 2006.
[36] J. Yang and J. Leskovec, Modeling Information Diffusion in Implicit Networks, ICDM, IEEE Computer Society, pages 599-608, 2010.
[37] D. Kempe, J. Kleinberg and E. Tardos, Maximizing the spread of influence through a social network, KDD ’03: Proc. of the ninth ACM
SIGKDD international conference on Knowledge discovery and data mining, ACM Press, pages 137-146, 2003.
[38] M. Hansen and W. Hurwitz, On the Theory of Sampling from Finite Populations, Annals of Mathematical Statistics, No. 3, Vol 14, 1943.
[39] E. Volz and D. Heckathorn, Probability based estimation theory for respondent-driven sampling, Official Statistics, pages 79, Vol 24, 2008.
[40] M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99, pages 7821-7826,
2002.
34
Extracting
P2P Live
Network
Video Streaming
of Information
DML
Q&A
Related documents