Download Document

Document related concepts

Network tap wikipedia , lookup

IEEE 1355 wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
Big Data Processing in
Social Networks
社群網路中之巨量資料處理
陳銘憲(Ming-Syan Chen)
中央研究院
資訊科技創新研究中心
September 2, 2014
A Few Words before the Talk
 Well,
Big Data is one of the most
popular topics world-wide these days

No. of attendants of KDD doubled this year
 Talk
materials are from (1) my prior
talks (Keynotes/invited talks in
PAKDD14, WAIM13, KDD12), and (2)
my recent research works; So
probably subjective
M.-S. Chen
2
Outline
 Walkthru
on Big Data
 Information Extraction from a Social
Network Graph
 Issues to Address
M.-S. Chen
3
The Era of Big Data is Coming
 由『全球瘋雲』到『巨資時代』!
 Big
data is high volume, high velocity,
and/or high variety information assets that
require new forms of processing to enable
enhanced decision making, insight
discovery and process optimization (Gartner)
 迅速累積的大量異質資料
 With
unclear veracity
 Source
of intelligence (value)
M.-S. Chen
4
M.-S. Chen
5 Happens In An Internet Minute
Source from Intel: What
http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html
Big data happens in every minute
• 639,800 GB of global IP data transferred




204 million emails sent
Flicker
 3,000 photo uploaded
 20 million photo views
YouTube
 30 hours of video uploaded
 1.3 million video views
LinkedIn
 100+ new accounts
 Twitter
320+ new twitter accounts
 100,000 new tweets

 Facebook
6 millions views
 277,700 logins

 Google

2+ million search queries
M.-S. Chen
6
Data Amount fueled by SN Activities

Twitter



Facebook


One billion users
Amazon Co-purchasing Network



150+ million members
50 million tweets per day
From
twitter.om
half million product nodes
several million recomm. links
Web Pages
Yahoo! Over one billion Web
Pages

M.-S. Chen
7
Amazon From SNSP
Example of Big Data and Social Network
Volume: thousands of people!
Velocity: fast accumulated!!
Variety: eating different food!!!
M.-S. Chen
8
Example of Big Data and Social Network
For some gossip in this occasion, Veracity is an issue
and the information Value could be low.
Mr. Lin
won the
lottery!
Mrs.
Chang
just did a
face lift!
M.-S. Chen
9
Some Views on Big Data

Big data white paper: “Challenges and Opportunities
with Big Data”



McKinsey: “Big data: The next frontier for innovation,
competition, and productivity”


By researchers in major univ. and IT companies in US
http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf
http://www.mckinsey.com/insights/business_technology/big_
data_the_next_frontier_for_innovation
NYTimes: “The age of Big Data” (potential use and
cost)

http://www.nytimes.com/2012/02/12/sunday-review/bigdatas-impact-in-the-world.html?pagewanted=all&_r=0
M.-S. Chen
10
Views on Big Data (cont’d)

IBM (platform, technology and applications)


Microsoft: “Perspective from the fourth paradigm for
scientific discovery”


http://research.microsoft.com/enus/collaboration/fourthparadigm/4th_paradigm_book_c
omplete_lr.pdf
VMware (platform and system architecture)


http://www-01.ibm.com/software/data/bigdata/
http://blogs.vmware.com/vfabric/2012/08/4-key-architectureconsiderations-for-big-data-analytics.html
More (from SAS, Intel, Oracle, etc. on-line)
M.-S. Chen
11
So, is the Notion of Big Data New?
Depends on whom you ask
 In fact, when more funds are available for
big data issues, people jump out to claim
themselves big data people
 一個名詞, 各自表述
 If we read the Big Data white paper from
US, its scope is quite close to that of data
mining


Of course, not considered a consensus here
12
Similar Rationale behind Data
Mining and Big Data

Knowledge discovery from a huge amount
of data


extraction of interesting knowledge (rules,
regularities, patterns, constraints) from data in
large databases
In line with technology trend!


HW, storage, CPU MIPS, network BW, Cloud,
etc
Intelligence and personalization will be key for
differentiation
13
Characteristics of Big Data
Knowledge discovered from big data
 Improving decision quality, optimizing process,
and gaining insights in general (tied to domain)
 Usually not considered as an isolated biz. sector,
not analogue to oil
 Slightly different from traditional business
intelligence
 BI: more on data with high information density
 Big data: more on data with low information
density; more application oriented

M.-S. Chen
14
Example Big Data Applications
金融
保險業
信用評等、客製化金融服務、授信、客戶之資產管理、壞帳分析、道德危機分析、逆向選擇風險分析、
潛在客戶名單分析 (credit analysis, insurance policy, etc)
零售業 (含電子
商務)
即時輔助購買決策之依據 (via proper recommendation),並且提供貨品、架位、物流整合及配置之輔助決策
支援系統 (e.g., 7-11)
(EC is an emerging area!)
製造業
生產過程中作為最佳化生產因素決定之專家輔助決策系統,並且提供最佳化之存貨控管與供應鏈暨顧客
利潤率分析
連鎖業
作為展店店址之選擇,以及分店貨品品項選擇,並且作為物流倉庫位址決策輔助工具,以及物流產能輔
助配置之依據 (e.g., McDonald, etc)
醫療業
醫療作業成本管理之動因分析、作為醫療分析、或病患個人化服務之來源
電信業
提供最佳化之網路交通配置,暨、客製化服務,並且提供即時之線上客製化輔助資訊系統、客製化之入
口網站及輔助促銷功能 ; operation analysis (e.g., alarm system analysis) due to system scale
生技業
提供研發平台以及分析所需工具,加速累積研發能量 (Genome analysis)
教育業
作為潛在學生之來源名單分析,並且運用資訊勘測作為入學申請暨獎學金申請評等之分析,及學生課程
規劃與職涯規劃之依據 (e.g., MOOC)
廣告業
廣告點閱來源分析、回應率分析、行銷策略提供 (augmented with LBS in mobile devices)
in Various Business Sectors
非營利組織
M.-S. Chen
作為勸募捐款信函與通信之聯繫名單方式 (including SN analysis)
15
Some More Words on Big Data
 Primary
sources of big data:
Social network activities
 Internet of things (i.e., from sensor networks)
 Multimedia (mainly video)

 New
methods are required to overcome new
challenges imposed by the big data
 Streaming
data, unstructured data, data from
various sources, etc
 Traditional RDB cannot handle efficiently
M.-S. Chen
16
Tool:
source: http://www.bigdata-startups.com/open-source-tools/
Now, Big Data in a Social Network
A social network is usually composed of
millions of nodes and links (homogeneous
or heterogeneous)
 The huge (volume), fast changing
(velocity), and diversified (variety)
information in a social network imposes
very challenging issues for researchers to
manage and analyze

From twitter.om
Outline
 Walkthru
on Big Data
 Information Extraction from an SN
Graph
 Issues to Address
(In this part, we shall use examples to illustrate
the concepts. Those who are interested in technical
details are referred to related publications. )
M.-S. Chen
19
Graph Extraction
執簡御繁
To handle complicated things with simple skills.
Application/goal-oriented data extraction
Three levels of information extraction from
SNs
 Parameter
stat.)

Fast calculation of closeness centrality
(ICDM13)
 Feature

extraction (e.g., company biz.)
Activity willingness optimization (VLDB14)
 Structure
org.)

extraction (e.g., company
extraction (e.g., company
Decomposing SN graphs (Asonam14)
M.-S. Chen
Parameter
extraction
Structure
extraction
weapon
Feature extraction
M.-S. Chen
(regarding capability)
21
Outline
 Walkthru
on Big Data
 Information Extraction from an SN
Graph
Capturing key parameters (parameter
extraction)
 Activity willingness optimization (feature
extraction)
 Decomposing SN graphs (structure
extraction)

 Issues
to Address
M.-S. Chen
22
Closeness centrality
There are several interesting quantities,
including closeness centrality, network
diameters, degree distribution, in SN graphs.
 Closeness centrality of node v, Cc(v): the
inverse of the average shortest path distance
from v to any other node in a network.
 If Cc(v) is large, v is around the center as it
•
requires only few hops to reach others.
M.-S. Chen
23
Response to Dynamic Changes
 It
is frequent to have edge insertion or
deletion in a social network

It is desirable to fast update the closeness centrality of
every node in response to edge insertion/deletion.
 Example
use: pick a number of people (the
nodes with high CCs) who can maximize
advertisement effectiveness.
M.-S. Chen
24
Example of Closeness Centrality
Cc(v): the inverse of the average shortest path distance
from v to other nodes
Cc ( v ) 
14  1
13

1 4  2  2  3 1  4 1  5  2  6  2  7 1 44
Cc ( w) 
14  1
13

1 3  2  4  3  4  4  2 31
| V | 1
Cc (v) 
uV | p(v, u) |
Thus, node w is closer to all
other node than the node v.
M.-S. Chen
An unweighted and undirected graph
25
G with 14 nodes and 18 edges
Calculating Closeness Centrality
 Note
that only some pairs of shortest
paths will be affected due to certain edge
changes.
 Identify
them (unstable node pairs) for fast
calculation of CC
M.-S. Chen
26
Example
For example, with the addition of (a,b)
 Un-changed shortest paths
◦

p(b,v), p(c,t) and p(r,h), etc.
Changed shortest paths
◦
◦
Before edge insertion
 p(a,b)={a,d,w,b}, p(a,c)={a,d,w,r,c} and p(u,v)={u,l,o,d,w,r,s,v}, etc.
After edge insertion (we then call these nodes unstable)

p(a,b)={a,b}, p(a,c)={a,b,c} and p(u,v)={u,x,a,b,c,v}, etc.
(a): the original unweighted
and undirected graph G.
(b): G’=G∪e(a,b).
M.-S. Chen
27
Illustration of Unstable Node Pairs

To find V’u : u-unstable node set,
whose shortest paths to u changed
after the edge addition

亦即那些到u 點最短距離會變動之點

unstable node pairs: (u,b), (u,c), (u,h),
(u,v) and (u,t).

V’u={b,c,h,v,t}
M.-S. Chen
Gu
G’u
28
(Main Theorem) After the addition of edge (a,b), every
unstable node pair (whose shortest path changed)
{v,u} will have v ∈ V’a and u ∈ V’b
V’b
V’a
.. ..
.
.
.
.
.
.
Only these shortest paths will change
after edge addition (and need to be re-calculated)
Remark
Experiments were done with Hadoop (MapReduce) in
DBLP dataset
 With fast calculation of closeness centrality, the
shortest paths preserving sparsification can be done
efficiently by identifying those edges whose removal
least affect CC.
 The design of new algorithms is called for to
efficiently calculate other key parameters in the fast
changing social network

M.-S. Chen
30
Outline
 Walkthru
on Big Data
 Information Extraction from an SN
Graph
Capturing key parameters (parameter
extraction)
 Activity willingness optimization (feature
extraction)
 Decomposing SN graphs (structure
extraction)

 Issues
to Address
M.-S. Chen
31
Evolution of Activity Formation
Information extracted has been shown to be
helpful for activity formation in social
networks
 Socio-Spatial Group Query [Yang, etal,
KDD-12]


Considering time, social and spatial factors
As more and more information can be
mined from a social network, we can take
the user interest (i.e., willingness) into
consideration when planning an activity
[Shuai, etal, VLDB-14]
M.-S. Chen
32
ts1
2017/7/29
MikeLee
TonyWang
PeterChen
JackLin
JaneLee
GraceYang
John Chen
Mary Fang
O
O
O
O
O
O
ts2
O
O
O
O
ts3
O
O
O
O
O
O
O
O
O
OM.-S. Chen O
ts4
O
O
O
O
O
O
O
ts5
O
O
O
O
O
O
O
ts6
O
O
O
O
33
What Can be Done Further?
Time+Social+Spatial (Heterogeneous SN)
Wow!
Let
meI ask
found a
some restaurant
good
friends to
comebuy-2-getwith
for this
great
2
free deal!
for lunch.
2017/7/29
34
2017/7/29
35
Implementation of SSGQ
Group size
Activity
Location
2017/7/29
Familiarity
Constraint
36
Implementation of SSGQ (cont’d)
Selected Group
Attendee’s
current
locations
2017/7/29
37
Ongoing Experiments on Facebook
(with willingness considered)
Outline
 Walkthru
on Big Data
 Information Extraction from an SN
Graph
Capturing key parameters (parameter
extraction)
 Activity willingness optimization (feature
extraction)
 Decomposing SN graphs (structure
extraction)

 Issues
to Address
M.-S. Chen
39
Diffusion Analysis in Social Networks

Diffusion of Information can be used to
model the interaction among nodes in a
network, e.g.,



Viruses spread over the internet.
Disease spread in the community.
Rumors/news spread among humans.
M.-S. Chen
40
Example Diffusion

Information diffusion can happen in social
networks, such as facebook and twitter.
1
3
0
2
Underlying network
Path of Infection
M.-S. Chen
41
The Network is Hidden
In some situations, the underlying
network is not known (due to cost or
privacy issue).
 Network inference problem (NIP) is
studied to discover the underlying
network

To infer the network from what happened.
M.-S. Chen
42
Network Inference Problem

2
M.-S. Chen
0
43
1
Clustering Cascades
 Traditionally,
NIP assumes there is one
underlying network, which may not
always be true in reality

e.g., Sports news, political news, and
entertainment news are likely to spread in
different ways
 Hence,
we would like to cluster
cascades so that the cascades in each
cluster spread in the same pattern

An SN graph is hence decomposed into
application-specific ones
M.-S. Chen
44
Example Cascades
Cascade a (Lakers news) Cascade b (49ers
news)
0
Cascade c (Redskins news)
1
2
0
0
1
1
Cascade d (Heats news)Cascade e (Jets news)
Cascade f (Celtics news)
2
0
0
1
2
0
3
1
M.-S. Chen
1
45
To Model Inference Network
(as before)

46
Possible Inference Network
(obtained by traditional method)
0.25
0.5
0.5
0.17
0.5
0.67
0.25
0.67
0.5
0.17
0.25
M.-S. Chen
47
To Cluster Cascades by K-Means

48
Graph Decomposition

By considering cascades {a, d, f} and
cascades {b, c, e} independently (based
on which nodes are infected), the original
SN graph is decomposed in accordance
with the information Cascades
carried.
{b, c, e} (NFL)
Cascades {a, d, f} (NBA)
0.25
0.5
0.5
0.17
0.5
0.67
0.67
0.5
0.33
0.5
0.17
M.-S. Chen
49
Remark
Traditionally NIP results in a dense and
complex network, which is difficult to
capture knowledge.
 By properly clustering cascades, we can
have a few resulting concise networks
which carry clearer information


These resulting networks better match the
corresponding cascades than a single dense
network.
M.-S. Chen
50
Outline
 Walkthru
on Big Data
 Information Extraction from an SN
Graph



Capturing key parameters (parameter extraction)
Activity willingness optimization (feature
extraction)
Decomposing SN graphs (structure extraction)
 Issues
to Address
M.-S. Chen
51
Issues to Address
 Issues
which either uniquely occur,
or will become more prevalent, in
social networks

2017/7/29
To discuss those from the perspective of
(1) users, (2) events, (3) time,
(4) platform, and (5) data
M.-S. Chen
52
Issues to Address (1st, on Users)

From collaborative filtering to social
filtering



Traditional collaborative filtering (CF) is used
in recommendation system.
Recently, with the prosperity of social network
sites, social filtering (SF) becomes more
prevalent.
The social network services required
will be very user-dependent and human
centric
2017/7/29
M.-S. Chen
53
Use CF for Recommendation
?
recommend
similar
2017/7/29
M.-S. Chen
54
Use SF for Recommendation
(i.e., letting your friends decide)
?
recommend
friends
This cake is
AWESOME!
2017/7/29
M.-S. Chen
55
Issues to Address (2nd, on Events)
 Bridging

real and virtual lifes
e.g., construction of weighted SR graph
 Mismatch

for confidence level
The confidence level of the social
relationship discovered might not be
high (quite subjective and Adhoc)
 e.g.,
reading the same book (1 pt),
having lunch together (2 pts), going
movie together (3 pts), etc

2017/7/29
However, proper weighting may vary
from one person to another
M.-S. Chen
56
Issues to Address (3rd, on Time)
 Streaming
mining for real-time
decisions (no single snapshot)
 天下武功
惟快不破
 Not
only summarize the social
information, but also find the trend
of evolution (2nd order mining)
Mining on summarized data
 e.g., Not just discover what is the favorite
song of Tom. Rather, to learn the fact that
Tom changed his favorite
every 3 months
2017/7/29
M.-S. Chen

57
Issues to Address (4th, on Platform)



With the availability of mobile devices
and the paradigm shift to cloud
computing, everyone will have 1Gb for
comm., unlimited storage, and access
to data source world-wide
leading to the era of “superman” (with diff.
ways of thinking and doing things) 超人新時代
Will have even faster increase in the variety of
social network activities, in particular those
related to LBS
M.-S. Chen
58
Issues to Address (5th, on Big Data)
 To
process the big data (i.e., a hugh
volume of fast increasing (velocity)
data of different types (variety) with
unclear veracity and domain-dep value
 To integrate different data sources
e.g., locations of photo shot, user
purchase behavior, his/her SN involved
 Objective: Volume, Velocity
Subjective: Variety, Veracity, Value

2017/7/29
M.-S. Chen
59
Other Important Issues
 Mining-assisted
management

social media content
Service with more intelligence required
 Privacy-preserving
on social
information processing
 …more
2017/7/29
M.-S. Chen
60
Conclusion
Due to the paradigm shift to cloud
computing and the fast increase in the
availability of mobile devices, big data
processing in social network is having
an unprecedented impact to our life
 Key factors for the arrival of the big
data era:
Mobile, Social network, and Cloud

2017/7/29
M.-S. Chen
61
Thank you!
2017/7/29
M.-S. Chen
62