Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Processing in Social Networks 社群網路中之巨量資料處理 陳銘憲(Ming-Syan Chen) 中央研究院 資訊科技創新研究中心 September 2, 2014 A Few Words before the Talk Well, Big Data is one of the most popular topics world-wide these days No. of attendants of KDD doubled this year Talk materials are from (1) my prior talks (Keynotes/invited talks in PAKDD14, WAIM13, KDD12), and (2) my recent research works; So probably subjective M.-S. Chen 2 Outline Walkthru on Big Data Information Extraction from a Social Network Graph Issues to Address M.-S. Chen 3 The Era of Big Data is Coming 由『全球瘋雲』到『巨資時代』! Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization (Gartner) 迅速累積的大量異質資料 With unclear veracity Source of intelligence (value) M.-S. Chen 4 M.-S. Chen 5 Happens In An Internet Minute Source from Intel: What http://www.intel.com/content/www/us/en/communications/internet-minute-infographic.html Big data happens in every minute • 639,800 GB of global IP data transferred 204 million emails sent Flicker 3,000 photo uploaded 20 million photo views YouTube 30 hours of video uploaded 1.3 million video views LinkedIn 100+ new accounts Twitter 320+ new twitter accounts 100,000 new tweets Facebook 6 millions views 277,700 logins Google 2+ million search queries M.-S. Chen 6 Data Amount fueled by SN Activities Twitter Facebook One billion users Amazon Co-purchasing Network 150+ million members 50 million tweets per day From twitter.om half million product nodes several million recomm. links Web Pages Yahoo! Over one billion Web Pages M.-S. Chen 7 Amazon From SNSP Example of Big Data and Social Network Volume: thousands of people! Velocity: fast accumulated!! Variety: eating different food!!! M.-S. Chen 8 Example of Big Data and Social Network For some gossip in this occasion, Veracity is an issue and the information Value could be low. Mr. Lin won the lottery! Mrs. Chang just did a face lift! M.-S. Chen 9 Some Views on Big Data Big data white paper: “Challenges and Opportunities with Big Data” McKinsey: “Big data: The next frontier for innovation, competition, and productivity” By researchers in major univ. and IT companies in US http://www.cra.org/ccc/files/docs/init/bigdatawhitepaper.pdf http://www.mckinsey.com/insights/business_technology/big_ data_the_next_frontier_for_innovation NYTimes: “The age of Big Data” (potential use and cost) http://www.nytimes.com/2012/02/12/sunday-review/bigdatas-impact-in-the-world.html?pagewanted=all&_r=0 M.-S. Chen 10 Views on Big Data (cont’d) IBM (platform, technology and applications) Microsoft: “Perspective from the fourth paradigm for scientific discovery” http://research.microsoft.com/enus/collaboration/fourthparadigm/4th_paradigm_book_c omplete_lr.pdf VMware (platform and system architecture) http://www-01.ibm.com/software/data/bigdata/ http://blogs.vmware.com/vfabric/2012/08/4-key-architectureconsiderations-for-big-data-analytics.html More (from SAS, Intel, Oracle, etc. on-line) M.-S. Chen 11 So, is the Notion of Big Data New? Depends on whom you ask In fact, when more funds are available for big data issues, people jump out to claim themselves big data people 一個名詞, 各自表述 If we read the Big Data white paper from US, its scope is quite close to that of data mining Of course, not considered a consensus here 12 Similar Rationale behind Data Mining and Big Data Knowledge discovery from a huge amount of data extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases In line with technology trend! HW, storage, CPU MIPS, network BW, Cloud, etc Intelligence and personalization will be key for differentiation 13 Characteristics of Big Data Knowledge discovered from big data Improving decision quality, optimizing process, and gaining insights in general (tied to domain) Usually not considered as an isolated biz. sector, not analogue to oil Slightly different from traditional business intelligence BI: more on data with high information density Big data: more on data with low information density; more application oriented M.-S. Chen 14 Example Big Data Applications 金融 保險業 信用評等、客製化金融服務、授信、客戶之資產管理、壞帳分析、道德危機分析、逆向選擇風險分析、 潛在客戶名單分析 (credit analysis, insurance policy, etc) 零售業 (含電子 商務) 即時輔助購買決策之依據 (via proper recommendation),並且提供貨品、架位、物流整合及配置之輔助決策 支援系統 (e.g., 7-11) (EC is an emerging area!) 製造業 生產過程中作為最佳化生產因素決定之專家輔助決策系統,並且提供最佳化之存貨控管與供應鏈暨顧客 利潤率分析 連鎖業 作為展店店址之選擇,以及分店貨品品項選擇,並且作為物流倉庫位址決策輔助工具,以及物流產能輔 助配置之依據 (e.g., McDonald, etc) 醫療業 醫療作業成本管理之動因分析、作為醫療分析、或病患個人化服務之來源 電信業 提供最佳化之網路交通配置,暨、客製化服務,並且提供即時之線上客製化輔助資訊系統、客製化之入 口網站及輔助促銷功能 ; operation analysis (e.g., alarm system analysis) due to system scale 生技業 提供研發平台以及分析所需工具,加速累積研發能量 (Genome analysis) 教育業 作為潛在學生之來源名單分析,並且運用資訊勘測作為入學申請暨獎學金申請評等之分析,及學生課程 規劃與職涯規劃之依據 (e.g., MOOC) 廣告業 廣告點閱來源分析、回應率分析、行銷策略提供 (augmented with LBS in mobile devices) in Various Business Sectors 非營利組織 M.-S. Chen 作為勸募捐款信函與通信之聯繫名單方式 (including SN analysis) 15 Some More Words on Big Data Primary sources of big data: Social network activities Internet of things (i.e., from sensor networks) Multimedia (mainly video) New methods are required to overcome new challenges imposed by the big data Streaming data, unstructured data, data from various sources, etc Traditional RDB cannot handle efficiently M.-S. Chen 16 Tool: source: http://www.bigdata-startups.com/open-source-tools/ Now, Big Data in a Social Network A social network is usually composed of millions of nodes and links (homogeneous or heterogeneous) The huge (volume), fast changing (velocity), and diversified (variety) information in a social network imposes very challenging issues for researchers to manage and analyze From twitter.om Outline Walkthru on Big Data Information Extraction from an SN Graph Issues to Address (In this part, we shall use examples to illustrate the concepts. Those who are interested in technical details are referred to related publications. ) M.-S. Chen 19 Graph Extraction 執簡御繁 To handle complicated things with simple skills. Application/goal-oriented data extraction Three levels of information extraction from SNs Parameter stat.) Fast calculation of closeness centrality (ICDM13) Feature extraction (e.g., company biz.) Activity willingness optimization (VLDB14) Structure org.) extraction (e.g., company extraction (e.g., company Decomposing SN graphs (Asonam14) M.-S. Chen Parameter extraction Structure extraction weapon Feature extraction M.-S. Chen (regarding capability) 21 Outline Walkthru on Big Data Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) Issues to Address M.-S. Chen 22 Closeness centrality There are several interesting quantities, including closeness centrality, network diameters, degree distribution, in SN graphs. Closeness centrality of node v, Cc(v): the inverse of the average shortest path distance from v to any other node in a network. If Cc(v) is large, v is around the center as it • requires only few hops to reach others. M.-S. Chen 23 Response to Dynamic Changes It is frequent to have edge insertion or deletion in a social network It is desirable to fast update the closeness centrality of every node in response to edge insertion/deletion. Example use: pick a number of people (the nodes with high CCs) who can maximize advertisement effectiveness. M.-S. Chen 24 Example of Closeness Centrality Cc(v): the inverse of the average shortest path distance from v to other nodes Cc ( v ) 14 1 13 1 4 2 2 3 1 4 1 5 2 6 2 7 1 44 Cc ( w) 14 1 13 1 3 2 4 3 4 4 2 31 | V | 1 Cc (v) uV | p(v, u) | Thus, node w is closer to all other node than the node v. M.-S. Chen An unweighted and undirected graph 25 G with 14 nodes and 18 edges Calculating Closeness Centrality Note that only some pairs of shortest paths will be affected due to certain edge changes. Identify them (unstable node pairs) for fast calculation of CC M.-S. Chen 26 Example For example, with the addition of (a,b) Un-changed shortest paths ◦ p(b,v), p(c,t) and p(r,h), etc. Changed shortest paths ◦ ◦ Before edge insertion p(a,b)={a,d,w,b}, p(a,c)={a,d,w,r,c} and p(u,v)={u,l,o,d,w,r,s,v}, etc. After edge insertion (we then call these nodes unstable) p(a,b)={a,b}, p(a,c)={a,b,c} and p(u,v)={u,x,a,b,c,v}, etc. (a): the original unweighted and undirected graph G. (b): G’=G∪e(a,b). M.-S. Chen 27 Illustration of Unstable Node Pairs To find V’u : u-unstable node set, whose shortest paths to u changed after the edge addition 亦即那些到u 點最短距離會變動之點 unstable node pairs: (u,b), (u,c), (u,h), (u,v) and (u,t). V’u={b,c,h,v,t} M.-S. Chen Gu G’u 28 (Main Theorem) After the addition of edge (a,b), every unstable node pair (whose shortest path changed) {v,u} will have v ∈ V’a and u ∈ V’b V’b V’a .. .. . . . . . . Only these shortest paths will change after edge addition (and need to be re-calculated) Remark Experiments were done with Hadoop (MapReduce) in DBLP dataset With fast calculation of closeness centrality, the shortest paths preserving sparsification can be done efficiently by identifying those edges whose removal least affect CC. The design of new algorithms is called for to efficiently calculate other key parameters in the fast changing social network M.-S. Chen 30 Outline Walkthru on Big Data Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) Issues to Address M.-S. Chen 31 Evolution of Activity Formation Information extracted has been shown to be helpful for activity formation in social networks Socio-Spatial Group Query [Yang, etal, KDD-12] Considering time, social and spatial factors As more and more information can be mined from a social network, we can take the user interest (i.e., willingness) into consideration when planning an activity [Shuai, etal, VLDB-14] M.-S. Chen 32 ts1 2017/7/29 MikeLee TonyWang PeterChen JackLin JaneLee GraceYang John Chen Mary Fang O O O O O O ts2 O O O O ts3 O O O O O O O O O OM.-S. Chen O ts4 O O O O O O O ts5 O O O O O O O ts6 O O O O 33 What Can be Done Further? Time+Social+Spatial (Heterogeneous SN) Wow! Let meI ask found a some restaurant good friends to comebuy-2-getwith for this great 2 free deal! for lunch. 2017/7/29 34 2017/7/29 35 Implementation of SSGQ Group size Activity Location 2017/7/29 Familiarity Constraint 36 Implementation of SSGQ (cont’d) Selected Group Attendee’s current locations 2017/7/29 37 Ongoing Experiments on Facebook (with willingness considered) Outline Walkthru on Big Data Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) Issues to Address M.-S. Chen 39 Diffusion Analysis in Social Networks Diffusion of Information can be used to model the interaction among nodes in a network, e.g., Viruses spread over the internet. Disease spread in the community. Rumors/news spread among humans. M.-S. Chen 40 Example Diffusion Information diffusion can happen in social networks, such as facebook and twitter. 1 3 0 2 Underlying network Path of Infection M.-S. Chen 41 The Network is Hidden In some situations, the underlying network is not known (due to cost or privacy issue). Network inference problem (NIP) is studied to discover the underlying network To infer the network from what happened. M.-S. Chen 42 Network Inference Problem 2 M.-S. Chen 0 43 1 Clustering Cascades Traditionally, NIP assumes there is one underlying network, which may not always be true in reality e.g., Sports news, political news, and entertainment news are likely to spread in different ways Hence, we would like to cluster cascades so that the cascades in each cluster spread in the same pattern An SN graph is hence decomposed into application-specific ones M.-S. Chen 44 Example Cascades Cascade a (Lakers news) Cascade b (49ers news) 0 Cascade c (Redskins news) 1 2 0 0 1 1 Cascade d (Heats news)Cascade e (Jets news) Cascade f (Celtics news) 2 0 0 1 2 0 3 1 M.-S. Chen 1 45 To Model Inference Network (as before) 46 Possible Inference Network (obtained by traditional method) 0.25 0.5 0.5 0.17 0.5 0.67 0.25 0.67 0.5 0.17 0.25 M.-S. Chen 47 To Cluster Cascades by K-Means 48 Graph Decomposition By considering cascades {a, d, f} and cascades {b, c, e} independently (based on which nodes are infected), the original SN graph is decomposed in accordance with the information Cascades carried. {b, c, e} (NFL) Cascades {a, d, f} (NBA) 0.25 0.5 0.5 0.17 0.5 0.67 0.67 0.5 0.33 0.5 0.17 M.-S. Chen 49 Remark Traditionally NIP results in a dense and complex network, which is difficult to capture knowledge. By properly clustering cascades, we can have a few resulting concise networks which carry clearer information These resulting networks better match the corresponding cascades than a single dense network. M.-S. Chen 50 Outline Walkthru on Big Data Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) Issues to Address M.-S. Chen 51 Issues to Address Issues which either uniquely occur, or will become more prevalent, in social networks 2017/7/29 To discuss those from the perspective of (1) users, (2) events, (3) time, (4) platform, and (5) data M.-S. Chen 52 Issues to Address (1st, on Users) From collaborative filtering to social filtering Traditional collaborative filtering (CF) is used in recommendation system. Recently, with the prosperity of social network sites, social filtering (SF) becomes more prevalent. The social network services required will be very user-dependent and human centric 2017/7/29 M.-S. Chen 53 Use CF for Recommendation ? recommend similar 2017/7/29 M.-S. Chen 54 Use SF for Recommendation (i.e., letting your friends decide) ? recommend friends This cake is AWESOME! 2017/7/29 M.-S. Chen 55 Issues to Address (2nd, on Events) Bridging real and virtual lifes e.g., construction of weighted SR graph Mismatch for confidence level The confidence level of the social relationship discovered might not be high (quite subjective and Adhoc) e.g., reading the same book (1 pt), having lunch together (2 pts), going movie together (3 pts), etc 2017/7/29 However, proper weighting may vary from one person to another M.-S. Chen 56 Issues to Address (3rd, on Time) Streaming mining for real-time decisions (no single snapshot) 天下武功 惟快不破 Not only summarize the social information, but also find the trend of evolution (2nd order mining) Mining on summarized data e.g., Not just discover what is the favorite song of Tom. Rather, to learn the fact that Tom changed his favorite every 3 months 2017/7/29 M.-S. Chen 57 Issues to Address (4th, on Platform) With the availability of mobile devices and the paradigm shift to cloud computing, everyone will have 1Gb for comm., unlimited storage, and access to data source world-wide leading to the era of “superman” (with diff. ways of thinking and doing things) 超人新時代 Will have even faster increase in the variety of social network activities, in particular those related to LBS M.-S. Chen 58 Issues to Address (5th, on Big Data) To process the big data (i.e., a hugh volume of fast increasing (velocity) data of different types (variety) with unclear veracity and domain-dep value To integrate different data sources e.g., locations of photo shot, user purchase behavior, his/her SN involved Objective: Volume, Velocity Subjective: Variety, Veracity, Value 2017/7/29 M.-S. Chen 59 Other Important Issues Mining-assisted management social media content Service with more intelligence required Privacy-preserving on social information processing …more 2017/7/29 M.-S. Chen 60 Conclusion Due to the paradigm shift to cloud computing and the fast increase in the availability of mobile devices, big data processing in social network is having an unprecedented impact to our life Key factors for the arrival of the big data era: Mobile, Social network, and Cloud 2017/7/29 M.-S. Chen 61 Thank you! 2017/7/29 M.-S. Chen 62