Download My principal research interest is in large

My principal research interest is in large-‐scale data mining, focusing on the analysis of networks. Networks allow us to study phenomena across the social, technological, and _RESEARCH OBJECTIVES_ natural worlds. Networks frame numerous research problems that lead to high-‐impact My principal research interest is in applied machine learning and large-scale data mining, focusing on the applications. For example, social networks on the Internet generate revenue of billions of analysis and modeling of large real-world networks as the study of phenomena across the social, dollars; detection of virus outbreaks in human networks can save lives; anomaly detection technological, and natural worlds. Such graphs frame numerous research problems and high-impact in computer-‐traffic networks is vital for security. My long-‐term research goal is to harness applications. For example, social networks on the internet generate revenue of multiple billions of dollars; large-‐scale and information to understand, predict, and detection of virus outbreakssocial can save lives; regulatorynetworks gene networks help us understand how our ultimately, cells work; enhance social and technological systems. I aim to create explanatory and anomaly detection in large computer-traffic networks is vital for corporate and national security. predictive models goal of actions of large groups of networks people and ocieties, and large technological systems. My long term research is to harness large-scale to sunderstand, predict, and ultimately, enhance social and technological systems.Only I would to create explanatory and predictive of actions of large BACKGROUND. few like years ago the goal of modeling large models social and technological groups of people and societies, and biological and technological systems. Although the actions of a particular systems would be unattainable. However, in less than a decade the World Wide Web has individual or component may be too difficult to model, machine learning and statistics can be applied to large transformed from a large static library that people only browse, into a vast information groups or ensembles, which can yield effective models with ability to predict the flow of future events. Based resource where people interact with each another. Through the emergence of online social on my recent results and research experience, I believe that the study of large networks is the promising networking, social media and social gaming, daily activities of hundreds of also millions of approach to developing such understandings, as graphs capture local dependencies, and reveal people and are migrating to arising the Web. Today Web is of a “sensor” that captures the seemingly pulse of large-scale structure phenomena from the the multitude local interactions. Local humanity: what we atore the thinking, hat wwhere e are dglobal oing, aregularities nd what we and know. “random” behavior can propagate macrowscale patterns emerge, e.g., power-law degree distributions and small-diameters. The activity of millions of humans on the Web leaves massive digital traces, that come in On the way tomany achieving thisand long-term goal,-‐-‐-‐ mycombining research consists of (1) analyzing theoretical modelsand of forms modalities text, images, and video along spatial network structure and evolution; (2) developing statistical machine learning models and algorithms to temporal axes -‐-‐-‐ and are connected in rich network structures. Such data can naturally be efficiently estimate the model parameters from data; (3) working with massive datasets of gigabyte and represented, studied and analyzed as complex dynamic interaction networks. Networks terabyte scale, as certain behaviors and patterns are observable only when the amount of data is large provide enormous potential both to address long-‐standing scientific questions, and also to enough. harness and inform the design of future social computing applications. Networks pose interesting challenges and questions that motivate my research: How is information in a _CURRENT ACHIEVEMENTS_ social network created? How does it flow and mutate as it is passed from a node to node? Through my work, I have a number important questions the properties andhow patterns of How will addressed a community or a of social network evolve regarding in the future? And also, do we large evolving compute networks aby revealing how local behavior and structure leads to large scale phenomena and nd develop algorithms that scale to massive dynamic networks? useful applications. What does a “normal” network look like? How will it evolve over time? Is the network or My research group strives to address the above challenges and harness the opportunities a community “healthy”? How do information and viruses spread over the network? How can we identify and find influentialnetworks put forward. My group combines analysis of complex networks with large-‐scale nodes or select nodes to immunize in networks? Answers to such questions are vital to a range ining the to didentification evelop computational odels of networks. Our misconfigured explorations crouters onsist oon f: the of application data areasmfrom of illegal m money-laundering rings, Internet, viral marketing, and protein-protein to odisease outbreak (1) Modeling the structure interactions and evolution f networks and odetection. nline communities. Results of my doctoral research have been included in the curricula of several graduate classes on network Developing methods for social social media analytics and information diffusion. analysis, advanced(2) data mining, internet algorithms, media and social networks across universities. For example, William Cohen, Kathleen Carley, Stephen Fienberg at CMU, Lada Adamic University Michigan, (3) Working with massive datasets, as certain behaviors and at patterns are of observable Jon Kleinberg at Cornell University, Nina Mishra at University of Virginia, Jiawei Han at UIUC, Constantine only when the amount of data is large enough. Dovrolis at Georgia Tech and others make our results part of their courses. INITIAL STUDIES AND CURRENT RESEARCH. Through my work, I have addressed a In my dissertation research, I focused on static and evolving networks, and the dynamics of processes, like number f important questions regarding the below properties f large networks: of my thesis virus propagation, thatotake place in networks. The table gives othe overall structure research with the to the sections of this (1) mapping Structure and evolution of document. networks. My work had influence on thinking about fundamental structural properties of networks varying over time. Before, it was commonly believed that the average degree of networks remains constant as they grow over time and Analysis Models Algorithms Static networks 6 2 2 8 Dynamics of network evolution 1 1 1 7 5 Dynamics of processes on networks 4 3 8 that the distances in networks slowly (logarithmically) increase with the network size. We showed that networks densify over time as the number of edges 𝑒 𝑡 at time 𝑡 is increasing as 𝑒 𝑡 ∝ 𝑛 𝑡 ! with the number of nodes 𝑛 𝑡 . The densification exponent 𝛼 is non-‐trivial, 𝛼 = 1.2 − 1.6. Even more surprisingly, the diameter of networks shrinks as they grow. These findings were fundamentally different from what was believed at the time and we explained them by developing a “Forest Fire” network model [KDD ‘05]. Besides studying the evolution of macroscopic properties of networks we also investigated the microscopic edge-‐by-‐edge evolution of online social networks LinkedIn, Youtube, Yahoo Answers and Delicious [KDD ‘08]. We modeled the effect of triadic closure and locality on emergence of new edges in the network. These insights led us to developing a friend recommendation engine that is currently deployed at Facebook and correctly predicts 8 out of 20 user’s future friends [WSDM ’11]. This work shows how combining fundamental insights about the structure of networks with computational models leads to high-‐impact applications. (2) Social Media Analytics. To model networked systems we also need to understand how influence, trust and information spread over the edges of the underlying network. My work developed insights into how information propagation data can be used for selecting targets for advertising and marketing, finding opinion leaders, and detecting disease epidemics. For example, since August 2008 my group has been collecting a massive collection of news articles and blog posts. Currently the collection contains well over 37 billion documents, 30 TB of data, and is growing at a rate of 50 GB per day. This data gives us a near complete picture of online media space and allows for computational analysis of online media landscape and evolution. Among other things we have also developed Memetracker, which automatically tracks short textual phrases that propagate over the Web [KDD ’09, TKDD ‘12]. Our analysis of the 2008 U.S. presidential election campaign received coverage by the New York Times. We have also worked with domain experts in the area of journalism. With the Pew Research Center’s Project for Excellence in Journalism we have worked on a widely disseminated report on media coverage of the ongoing economic crisis. (3) Stanford Network Analysis Platform (SNAP). Research community lacks proper tools and datasets for analysis of large networks. I have developed SNAP a freely available C++ network analysis platform that scales to massive networks. For example, using SNAP we investigated the 6-‐degrees of separation hypothesis on the “planetary scale” MSN Messenger network of 240M people and 1.3B edges [WWW ‘08]. I also created the Stanford Large Network Collection, the world’s largest repository of network data. SNAP gets around 500 downloads per month and the datasets get 20,000 accesses per month. VISION FOR THE FUTURE. My research group aims to build and harness models of networked systems to predict events and influence the dynamics of networks, such as large groups of people, social communities, web, and communication networks. Although the actions of a particular individual or component may be too difficult to model, machine learning and data mining can be applied to large groups or ensembles, which can yield effective models with ability to predict the flow of future. Based on my recent results, I believe that the study of large networks is the promising approach to developing such understandings, as graphs capture local dependencies, and also reveal large-‐scale structure and phenomena arising from the multitude of local interactions. The key is to connect local to global, complement the topology information with other types of data, and choose the 2 right scale where micro propagates to macro. In my current research, I made several steps towards this goal. We now better understand microscopic and macroscopic network evolution and models that connect the two [TKDD ’07]. Moreover, we can efficiently fit the network models to the data [ICML ‘12] and predict prior and future states of a network or a community [WSDM ‘12]. We also have a better understanding of how information and influence propagate in networks [TKDD ’12], what are the traces of propagation [WSDM ‘11], and how to find influential nodes or detect disease outbreaks in networks [KDD ‘07]. In the future my research will focus across three dimensions with an overarching theme of how to scale the above analyses to internet-‐scale data: (1) Complex analysis of social networks and social media. (2) Designing networked systems and influencing their evolution. (3) Encompassing richer types of networked data. (1) Social networks and social media. The online world is a rich testbed for my research as social networking sites contain detailed traces of human social activity, people’s profiles, interests, etc. I want to understand how network structure and user activity determine the future of social networks. I aim to define and explore the notion of the “health” of a social network. Based on our collaboration with NING.com we obtained complete interaction history data for 200,000 different social networks that independently evolve over time. This data allows us to study and compare the evolution of 200,000 different parallel “universes”. We aim to develop metrics and indicators that will be indicative of the future “health” of a network. User activity based metrics have proven to be too shallow to capture the fitness of a network community. My hypothesis is that structural signatures of interaction networks and emergence of hierarchical organization will be much better indicators of the future of a particular community and will allow us to answer questions like: Will the network die in the future? What structural indicators need to improve so that the network gets into better health and has better chances of survival? Another important aspect of my future research lies in online media and information diffusion. Using our massive dataset of 30 billion social and news media articles I will computationally study how people consume and alter information, and how they influence information propagation. I aim to develop models and algorithms that scale to our massive collection of articles in order to analyze and build predictive models of information dynamics. To address these questions my research will investigate the propagation of on-‐ line information, focusing on the dynamics by which information is transmitted across the networks, the mechanisms by which it changes as it spreads, and the structure and dynamics of the implicit networks that serve to propagate it. These investigations have the potential to transform our understanding of how to manage real-‐time Web information, as well as our understanding of the evolving landscape of on-‐line news and commentary. (2) Influencing networks. Creators of on-‐line communities have many degrees of freedom in their designs, but the dynamics and mechanics of rewards and reputation mechanisms may be especially critical. Two main incentive mechanisms that community creators use today are badges and reputation points. Reputation is usually defined as a numerical score that increases with every action user takes on the website. And badges are awarded to users in recognition of their contributions to the community. 3 In our current collaboration with J. Kleinberg and D. Huttenlocher from Cornell we are analyzing the dynamics of Stack Overflow, a popular question-‐answering site for programmers. Stack Overflow has an active user community with millions of questions and answers. It also includes a visible reputation system. We found that the reputation mechanism on a site both provides information about user’s level of community involvement, as well as provides incentives for effective contributions and good behavior. Our future investigations will focus on better understanding of incentive mechanisms for on-‐line communities. Badges provide incentives for users to be more active on the site while also steer users to perform actions that lead them towards the badges. This opens an interesting modeling problem of how user behavior changes both in terms of engagement as well as the distribution of the user actions in the vicinity of a badge. Here the idea is that website owner has some notion of what types of user behavior are preferred on the website, and the question is what is the optimal set of incentives (i.e., badges and the conditions for obtaining them) that keep users engaged and promote ‘good’ behavior. We will study the incentive mechanisms for online communities both theoretically at the level of models as well as on real data at the level of empirical analyses and live experiments. (3) Richer types of network data. Traditionally network analysis was mainly occupied considering only the linking structure while ignoring properties and attributes of nodes and edges in the network. We plan to expand the understanding of how network structure and node attributes relate and affect each other. We will investigate models of networks with node and edge attributes. For example, we have developed the Multiplicative Attribute Graphs (MAG) model of networks with node attributes that allows itself for mathematical analysis and is at the same time statistically interesting. Rich network data also leads to interesting problems like detecting social circles in users’ social networks. We plan to study the problem of community identification and social circle detection in networks with node attributes. We plan to develop algorithms that combine network structure as well as node attribute information. These two data modalities will complement each other and allow for more robust detection of overlapping as well as hierarchically nested communities and social circles. CONCLUSION. These steps represent a research framework that will allow me to tackle the challenging problems described above in a unique way. My research on networks is theoretically grounded and spans several areas of computer science as diverse as machine learning, theory and systems. Computation over massive data is at the heart of my research and the implications of my research have direct applications well beyond computer science -‐-‐-‐ to social sciences, physics, economics and marketing. If fact, we are successfully collaborating with journalists, communication scientists, biologists, medical doctors as well as linguists. Similarly, my research also has impact in industry: Facebook has deployed a version of our link prediction engine, Samsung and Volkswagen are evaluating our social media recommendation engine, and market research firm Ipsos (as well as the largest Chinese online advertiser Allyes) are considering our algorithms for online advertising. The collaboration with industry shows that my approaches are more than only ideas but get implemented and solve problems today. I am excited about the influence that my research has already had within industry and academia, and look forward to continuing to make strides on both theoretical foundations and real-‐world applications. 4

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download My principal research interest is in large