Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 1944: Sophomore Seminar Big Data and Machine Learning B. Aditya Prakash Assistant Professor Nov 3, 2015 About me Assistant Professor, CS – Member, Discovery Analytics Center Previously – Ph.D. in Computer Science, Carnegie Mellon University – B.Tech in Computer Science and Engg, Indian Institute of Technology (IIT) – Bombay – Internships at Sprint, Yahoo, Microsoft Research Prakash 2015 2 Prakash 2015 3 Data contains value and knowledge Prakash 2015 4 Data and Business Recommended'linksA +79%''clicksA Prakash 2015 Top'SearchesA Personalized'' News'InterestsA +250%'clicksA +43%'clicksA Source: A. Machhanavajjhala 5 Data and Science 5* Data1and1science* Red:1official1numbers1from1Center1for1Disease1Control1and1Prevention;1weekly11 Black:1based1on1Google1search1logs;1daily1(potentially1instantaneously)* Detecting'influenza'epidemics'using'search' engine'query'data1 http:/ / www.nature.com/ nature/ journal/ v457/ n7232/ full/ nature07634.html Prakash 2015 6 Data and Government 6* Data1and1government* http:/ / www.washingtonpost.com/ opinions/ obama-the-big-datapresident/ 2013/ 06/ 14/ 1d71fe2e-d391-11e2b05f-3ea3f0e7bb5a_stor y.html http:/ / www.washingtonpost.com/ business/ economy/ democratspush-to-redeploy-obamas-voterdatabase/ 2012/ 11/ 20/ d14793a4-2e83-11e2-89d4-040c93 30702a_story.html http:/ / www.whitehouse.gov/ blog/ Democratizing-Data Prakash 2015 http:/ / www.theguardian.com/ world/ 2013/ jun/ 23/ edward-snowden-nsa-fi les-timeline Source: A. Machhanavajjhala 7 Data and Culture • http:/ / blogs.plos.org/ everyone/ 2013/ 03/ 20/ what-are-you-inthe-mood-for-emotionaltrends-in-20th-centur y-books/ Prakash 2015 Source: A. Machhanavajjhala 8 8* Data1and1____1☜ your1favorite1subject* Sports* Prakash 2015 Journalism* 9 Good news: Demand for Data Mining Prakash 2015 10 How to extract value from data? Manipulate Data – CS, Domain expertise Analyze Data – Math, CS, Stat… Communicate your results – CS, Domain Expertise Prakash 2015 11 Communication is important! “The"British"government"spends"" £13"billion"a"year"on"universities.”F – – http:/ / wheredoesmymoneygo.org/ " bubbletree-map.html#/ ~/ total/ education/ university “On"average,"1"in"every"15"Europeans"" is"totally"illiterate.”F – – http:/ / datajournalismhandbook.org/ 1.0/ en/ understanding_data_0.html Prakash 2015 12 What is Data Mining? Given lots of data Discover patterns and models that are: – Valid: hold on new data with some certainty – Useful: should be possible to act on the item – Unexpected: non-obvious to the system – Understandable: humans should be able to interpret the pattern Prakash 2015 13 Data Mining Tasks Descriptive methods – Find human-interpretable patterns that describe the data • Example: Clustering Predictive methods – Use some variables to predict unknown or future values of other variables • Example: Recommender systems Prakash 2015 14 Theory & Algo. Biology Physics Comp. Systems ML & Stats. Social Science Big data Econ. 15 Prakash 2015 Data at CS, VT Knowledge, Information and Data http://www.cs.vt.edu/undergraduate/tracks/k id People: Fox, Harrison, Huang, Lu (in NVA), Ramakrishnan (in NVA), Rozovskaya, Prakash Prakash 2015 16 Courses Background in some areas: – CS3414 (Numerical Methods); also prob/stat 4000 level – 4244 Internet Software Development – 4604 Database Management Systems – 4624 Capstone (Multimedia, Information Access) – 4634 Design of Information (Capstone) – 4804 AI – 4984 Computational Linguistics (Capstone) Prakash 2015 17 Discovery Analytics Center Prakash 2015 18 MY RESEARCH Prakash 2015 19 Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] Prakash 2015 20 What else do they have in common? Prakash 2015 21 High School Dating Network Bearman et. al. Am. Jnl. of Sociology, 2004. Image: Mark Newman Blue: Male Pink: Female Interesting observations? Prakash 2015 22 The Internet Skewed Degrees Robustness Prakash 2015 23 Karate Club Network Prakash 2015 24 Dynamical Processes over networks are also everywhere! Prakash 2015 25 Why do we care? Social collaboration Information Diffusion Viral Marketing Epidemiology and Public Health Cyber Security Human mobility Games and Virtual Worlds Ecology ........ Prakash 2015 26 Why do we care? (1: Epidemiology) Dynamical Processes over networks [AJPH 2007] Diseases over contact networks Prakash 2015 CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts 27 Why do we care? (1: Epidemiology) Dynamical Processes over networks • Each circle is a hospital • ~3000 hospitals • More than 30,000 patients transferred [US-MEDICARE NETWORK 2005] Prakash 2015 Problem: Given k units of disinfectant, whom to immunize? 28 Why do we care? (1: Epidemiology) ~6x fewer! CURRENT PRACTICE [US-MEDICARE NETWORK 2005] OUR METHOD Prakash 2015 29 Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year) Why do we care? (2: Online Diffusion) > 800m users, ~$1B revenue [WSJ 2010] ~100m active users > 50m users Prakash 2015 30 Why do we care? (2: Online Diffusion) Dynamical Processes over networks Buy Versace™! Followers Celebrity Prakash 2015 Social Media Marketing 31 Social Biological Contagion Automatically learn models Prakash 2014 32 Why do we care? (3: To change the world?) Dynamical Processes over networks Social networks and Collaborative Action Prakash 2015 33 High Impact – Multiple Settings epidemic out-breaks Q. How to squash rumors faster? products/viruses Q. How do opinions spread? transmit s/w patches Q. How to market better? Prakash 2015 34 Dynamical Processes = (a lot of) Networks + (some) Time-Series Prakash 2015 35 Research Theme ANALYSIS Understanding DATA Large real-world networks & processes Prakash 2015 POLICY/ ACTION Managing 36 Research Theme – Public Health ANALYSIS Will an epidemic happen? DATA Modeling # patient transfers Prakash 2015 POLICY/ ACTION How to control out-breaks? 37 Research Theme – Social Media ANALYSIS # cascades in future? DATA Modeling Tweets spreading Prakash 2015 POLICY/ ACTION How to market better?38 A Question How many of you think your friends have more friends than you? A recent Facebook study – Examined all of FB’s users: 721 million people with 69 billion friendships. • about 10 percent of the world’s population! – Found that user’s friend count was less than the average friend count of his or her friends, 93 percent of the time. – Users had an average of 190 friends, while their friends averaged 635 friends of their own. Prakash 2015 39 Possible Reasons? You are a loner? Your friends are extroverts? There are more extroverts than introverts in the world? Prakash 2015 40 Example Average number of friends? Source: S. Strogatz, NYT 2012 Prakash 2015 41 Example Average number of friends =(1+3+2+2)/4 =2 Source: S. Strogatz, NYT 2012 Prakash 2015 42 Example Average number of friends =(1+3+2+2)/4 =2 Average number of friends of friends Source: S. Strogatz, NYT 2012 Prakash 2015 43 Example Average number of friends =(1+3+2+2)/4 =2 Average number of friends of friends = (3 + 1 + 2 + 2 + 3 + 2 + 3 + 2)/8 = ((1x1) + (3x3) + (2x2) + (2x2))/8 Source: S. Strogatz, NYT 2012 Prakash 2015 44 Example Average number of friends =(1+3+2+2)/4 =2 Average number of friends of friends = (3 + 1 + 2 + 2 + 3 + 2 + 3 + 2)/8 = ((1x1) + (3x3) + (2x2) + (2x2))/8 = 2.25! Source: S. Strogatz, NYT 2012 Prakash 2015 45 Actually it is (almost) always true! Proof? Prakash 2015 46 Actually it is (almost) always true! Proof? E[X] = å xi / N Prakash 2015 47 Actually it is (almost) always true! Proof? E[X] = å xi / N Var[X] = E[(X - E[X]) ] 2 = E[X ]- E[X] 2 Prakash 2015 2 48 Actually it is (almost) always true! Proof? E[X] = å xi / N Var[X] = E[(X - E[X]) ] 2 = E[X ]- E[X] 2 2 E[X ] Var[X] = E[X]+ E[X] E[X] 2 Prakash 2015 49 Actually it is (almost) always true! Essentially, it is true if there is any spread in # of friends (non-zero variance)! Proof? E[X] = å xi / N Var[X] = E[(X - E[X]) ] 2 = E[X ]- E[X] 2 2 E[X ] Var[X] = E[X]+ E[X] E[X] 2 Prakash 2015 50 Implications Immunization Figure 1. Network Illustrating Structural Parameters. This real network of 105 students shows variation in structural attributes and topological position. Each circle represents a person and each line represents a friendship tie. Nodes A and B have different ‘‘degree,’’ a measure that indicates the number of ties. Nodes with higher degree also tend to exhibit higher ‘‘centrality’’ (node A with six friends is more central than B and C who both only have four friends). If contagions infect people at random at the beginning of an epidemic, central individuals are likely to be infected sooner because they lie a shorter number of steps (on average) from all other individuals in the network. Finally, although nodes B and C have the same degree, they differ in ‘‘transitivity’’ (the probability that any two of one’s friends are friends with each other). Node B exhibits high transitivity with many friends that know one another. In contrast, node C’s friends are not connected to one another and therefore they offer more independent possibilities for becoming infected earlier in the epidemic. doi:10.1371/journal.pone.0012948.g001 – acquaintance immunization • Immunize friend-of-friend the variance of the degree distribution divided by m. Hence, when there is variance in degree in a population, and especially when there ishigh variance, the mean number of contacts for the friends will be greater (and potentially much greater) than the mean for the random sample. This is sometimes known as the ‘‘friendship paradox’’ (‘‘your friends have more friends than you do’’) [15–19]. While the idea of immunizing such friends of randomly chosen people has previously been explored in a stimulating theoretical paper [12], to our knowledge, a method that uses nominated friends as sensors for early detection of an outbreak has not previously been proposed, nor hasit been tested on any sort of real outbreak. To evaluate the effectiveness of nominated friends as social network sensors, we therefore monitored the spread of flu at Harvard College from September 1 to December 31, 2009. In the fall of 2009, both seasonal flu (which typically kills 41,000 Americans each year [20]) and the H1N1 strain were prevalent in the US, though the great majority of cases in 2009 have been attributed to the latter.[1] It is estimated that this H1N1 epidemic, which began roughly in April 2009, infected over 50 million Americans. Unlike seasonal flu, which typically affects individuals older than 65, H1N1 tends to affect young people. Nationally, according to the CDC, the epidemic peaked in late October 2009, and vaccination only became widely available in December 2009. Whether another outbreak of H1N1 will occur (for example, in areas and populations that have heretofore been spared) is a Early warning of outbreaks – Again, monitor friends of friends Prakash 2015 Figure 2. Theoretical expectatio ns of differences in contagion between central individuals and the populatio n as a whole. A contagious process passes through two phases, one in which the number of infected individuals exponentially increases as the contagion spreads, and one in which incidence exponentially decreases as susceptible individuals become increasingly scarce. These dynamics can be modeled by a logistic function. Central individuals lie on more paths in a network compared to the average person in a population and are therefore more likely to be infected early by a contagion that randomly infects some individuals and then spreads from person to person within the network. This shifts the Sshaped logistic cumulative incidence function forward in time for central individuals compared to the population as a whole (left panel). It also shifts the peak infection rate forward (right panel). doi:10.1371/journal.pone.0012948.g002 51 Thanks---Questions? B. Aditya Prakash 3160 F Torgersen Hall [email protected] See my homepage for more details and papers: http://www.cs.vt.edu/~badityap Prakash 2015 52