Download My principal research interest is in large

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
 My principal research interest is in large-­‐scale data mining, focusing on the analysis of networks. Networks allow us to study phenomena across the social, technological, and _RESEARCH
natural worlds. Networks frame numerous research problems that lead to high-­‐impact My principal research interest is in applied machine learning and large-scale data mining, focusing on the
applications. For example, social networks on the Internet generate revenue of billions of analysis and modeling
of large real-world networks as the study of phenomena across the social,
dollars; detection of virus outbreaks in human networks can save lives; anomaly detection technological, and natural worlds. Such graphs frame numerous research problems and high-impact
in computer-­‐traffic networks is vital for security. My long-­‐term research goal is to harness applications. For
example, social networks on the internet generate revenue of multiple billions of dollars;
large-­‐scale and information to understand, predict, and detection of virus
outbreakssocial can save
regulatorynetworks gene networks
help us understand
our ultimately, cells work;
enhance social and technological systems. I aim to create explanatory and anomaly detection in large computer-traffic networks is vital for corporate and national security. predictive models goal
of actions of large groups of networks
people and ocieties, and large technological systems. My long term research
is to harness
to sunderstand,
and ultimately,
social and technological
systems.Only I would
to create
and predictive
of actions
of large
BACKGROUND. few like
years ago the goal of modeling large models
social and technological groups of people and societies, and biological and technological systems. Although the actions of a particular
systems would be unattainable. However, in less than a decade the World Wide Web has individual or component may be too difficult to model, machine learning and statistics can be applied to large
transformed from a large static library that people only browse, into a vast information groups or ensembles, which can yield effective models with ability to predict the flow of future events. Based
resource where people interact with each another. Through the emergence of online social on my recent results and research experience, I believe that the study of large networks is the promising
networking, social media and social gaming, daily activities of hundreds of also
millions of approach to developing
as graphs
local dependencies,
people and
are migrating to arising
the Web. Today Web is of
a “sensor” that captures the seemingly
pulse of large-scale structure
the the multitude
local interactions.
humanity: what we atore the
thinking, hat wwhere
e are dglobal
oing, aregularities
nd what we and
know. “random” behavior
can propagate
emerge, e.g.,
power-law degree
distributions and small-diameters.
The activity of millions of humans on the Web leaves massive digital traces, that come in On the way tomany achieving
thisand long-term
goal,-­‐-­‐-­‐ mycombining research consists
of (1) analyzing
modelsand of
forms modalities text, images, and video along spatial network structure
temporal axes -­‐-­‐-­‐ and are connected in rich network structures. Such data can naturally be efficiently estimate
the model
from data;
(3) working
with massive
of gigabyte
represented, studied and analyzed as complex dynamic interaction networks. Networks terabyte scale, as certain behaviors and patterns are observable only when the amount of data is large
provide enormous potential both to address long-­‐standing scientific questions, and also to enough.
harness and inform the design of future social computing applications. Networks pose interesting challenges and questions that motivate my research: How is information in a _CURRENT
social network created? How does it flow and mutate as it is passed from a node to node? Through my work,
I have
a number
the properties
andhow patterns
How will addressed
a community or a of
social network evolve regarding
in the future? And also, do we large evolving compute networks aby
nd develop algorithms that scale to massive dynamic networks? useful applications. What does a “normal” network look like? How will it evolve over time? Is the network or
My research group strives to address the above challenges and harness the opportunities a community “healthy”?
How do information and viruses spread over the network? How can we identify and
find influentialnetworks put forward. My group combines analysis of complex networks with large-­‐scale nodes or select nodes to immunize in networks? Answers to such questions are vital to a range
ining the
to didentification
evelop computational odels of networks. Our misconfigured
explorations crouters
onsist oon
f: the
of application data areasmfrom
of illegal m
Internet, viral marketing,
and protein-protein
to odisease
(1) Modeling the structure interactions
and evolution f networks and odetection.
nline communities. Results of my doctoral research have been included in the curricula of several graduate classes on network
Developing methods for social social
media analytics and information diffusion. analysis, advanced(2)
mining, internet
and social
networks across
universities. For
example, William Cohen,
at CMU,
Lada Adamic
(3) Working with massive datasets, as certain behaviors and at
patterns are of
observable Jon Kleinberg at Cornell
only when the amount of data is large enough. Dovrolis at Georgia Tech and others make our results part of their courses.
INITIAL STUDIES AND CURRENT RESEARCH. Through my work, I have addressed a In my dissertation research, I focused on static and evolving networks, and the dynamics of processes, like
number f important questions regarding the below
properties f large networks: of my thesis
virus propagation,
place in
The table
gives othe
research with the
to the
of this
(1) mapping
Structure and evolution of document.
networks. My work had influence on thinking about fundamental structural properties of networks varying over time. Before, it was commonly believed that the average degree of networks remains constant as they grow over time and Analysis Models Algorithms
Static networks
2 8
Dynamics of network evolution
1 1 1
Dynamics of processes on networks
3 8
that the distances in networks slowly (logarithmically) increase with the network size. We showed that networks densify over time as the number of edges 𝑒 𝑡 at time 𝑡 is increasing as 𝑒 𝑡 ∝ 𝑛 𝑡 ! with the number of nodes 𝑛 𝑡 . The densification exponent 𝛼 is non-­‐trivial, 𝛼 = 1.2 − 1.6. Even more surprisingly, the diameter of networks shrinks as they grow. These findings were fundamentally different from what was believed at the time and we explained them by developing a “Forest Fire” network model [KDD ‘05]. Besides studying the evolution of macroscopic properties of networks we also investigated the microscopic edge-­‐by-­‐edge evolution of online social networks LinkedIn, Youtube, Yahoo Answers and Delicious [KDD ‘08]. We modeled the effect of triadic closure and locality on emergence of new edges in the network. These insights led us to developing a friend recommendation engine that is currently deployed at Facebook and correctly predicts 8 out of 20 user’s future friends [WSDM ’11]. This work shows how combining fundamental insights about the structure of networks with computational models leads to high-­‐impact applications. (2) Social Media Analytics. To model networked systems we also need to understand how influence, trust and information spread over the edges of the underlying network. My work developed insights into how information propagation data can be used for selecting targets for advertising and marketing, finding opinion leaders, and detecting disease epidemics. For example, since August 2008 my group has been collecting a massive collection of news articles and blog posts. Currently the collection contains well over 37 billion documents, 30 TB of data, and is growing at a rate of 50 GB per day. This data gives us a near complete picture of online media space and allows for computational analysis of online media landscape and evolution. Among other things we have also developed Memetracker, which automatically tracks short textual phrases that propagate over the Web [KDD ’09, TKDD ‘12]. Our analysis of the 2008 U.S. presidential election campaign received coverage by the New York Times. We have also worked with domain experts in the area of journalism. With the Pew Research Center’s Project for Excellence in Journalism we have worked on a widely disseminated report on media coverage of the ongoing economic crisis. (3) Stanford Network Analysis Platform (SNAP). Research community lacks proper tools and datasets for analysis of large networks. I have developed SNAP a freely available C++ network analysis platform that scales to massive networks. For example, using SNAP we investigated the 6-­‐degrees of separation hypothesis on the “planetary scale” MSN Messenger network of 240M people and 1.3B edges [WWW ‘08]. I also created the Stanford Large Network Collection, the world’s largest repository of network data. SNAP gets around 500 downloads per month and the datasets get 20,000 accesses per month. VISION FOR THE FUTURE. My research group aims to build and harness models of networked systems to predict events and influence the dynamics of networks, such as large groups of people, social communities, web, and communication networks. Although the actions of a particular individual or component may be too difficult to model, machine learning and data mining can be applied to large groups or ensembles, which can yield effective models with ability to predict the flow of future. Based on my recent results, I believe that the study of large networks is the promising approach to developing such understandings, as graphs capture local dependencies, and also reveal large-­‐scale structure and phenomena arising from the multitude of local interactions. The key is to connect local to global, complement the topology information with other types of data, and choose the 2 right scale where micro propagates to macro. In my current research, I made several steps towards this goal. We now better understand microscopic and macroscopic network evolution and models that connect the two [TKDD ’07]. Moreover, we can efficiently fit the network models to the data [ICML ‘12] and predict prior and future states of a network or a community [WSDM ‘12]. We also have a better understanding of how information and influence propagate in networks [TKDD ’12], what are the traces of propagation [WSDM ‘11], and how to find influential nodes or detect disease outbreaks in networks [KDD ‘07]. In the future my research will focus across three dimensions with an overarching theme of how to scale the above analyses to internet-­‐scale data: (1) Complex analysis of social networks and social media. (2) Designing networked systems and influencing their evolution. (3) Encompassing richer types of networked data. (1) Social networks and social media. The online world is a rich testbed for my research as social networking sites contain detailed traces of human social activity, people’s profiles, interests, etc. I want to understand how network structure and user activity determine the future of social networks. I aim to define and explore the notion of the “health” of a social network. Based on our collaboration with we obtained complete interaction history data for 200,000 different social networks that independently evolve over time. This data allows us to study and compare the evolution of 200,000 different parallel “universes”. We aim to develop metrics and indicators that will be indicative of the future “health” of a network. User activity based metrics have proven to be too shallow to capture the fitness of a network community. My hypothesis is that structural signatures of interaction networks and emergence of hierarchical organization will be much better indicators of the future of a particular community and will allow us to answer questions like: Will the network die in the future? What structural indicators need to improve so that the network gets into better health and has better chances of survival? Another important aspect of my future research lies in online media and information diffusion. Using our massive dataset of 30 billion social and news media articles I will computationally study how people consume and alter information, and how they influence information propagation. I aim to develop models and algorithms that scale to our massive collection of articles in order to analyze and build predictive models of information dynamics. To address these questions my research will investigate the propagation of on-­‐
line information, focusing on the dynamics by which information is transmitted across the networks, the mechanisms by which it changes as it spreads, and the structure and dynamics of the implicit networks that serve to propagate it. These investigations have the potential to transform our understanding of how to manage real-­‐time Web information, as well as our understanding of the evolving landscape of on-­‐line news and commentary. (2) Influencing networks. Creators of on-­‐line communities have many degrees of freedom in their designs, but the dynamics and mechanics of rewards and reputation mechanisms may be especially critical. Two main incentive mechanisms that community creators use today are badges and reputation points. Reputation is usually defined as a numerical score that increases with every action user takes on the website. And badges are awarded to users in recognition of their contributions to the community. 3 In our current collaboration with J. Kleinberg and D. Huttenlocher from Cornell we are analyzing the dynamics of Stack Overflow, a popular question-­‐answering site for programmers. Stack Overflow has an active user community with millions of questions and answers. It also includes a visible reputation system. We found that the reputation mechanism on a site both provides information about user’s level of community involvement, as well as provides incentives for effective contributions and good behavior. Our future investigations will focus on better understanding of incentive mechanisms for on-­‐line communities. Badges provide incentives for users to be more active on the site while also steer users to perform actions that lead them towards the badges. This opens an interesting modeling problem of how user behavior changes both in terms of engagement as well as the distribution of the user actions in the vicinity of a badge. Here the idea is that website owner has some notion of what types of user behavior are preferred on the website, and the question is what is the optimal set of incentives (i.e., badges and the conditions for obtaining them) that keep users engaged and promote ‘good’ behavior. We will study the incentive mechanisms for online communities both theoretically at the level of models as well as on real data at the level of empirical analyses and live experiments. (3) Richer types of network data. Traditionally network analysis was mainly occupied considering only the linking structure while ignoring properties and attributes of nodes and edges in the network. We plan to expand the understanding of how network structure and node attributes relate and affect each other. We will investigate models of networks with node and edge attributes. For example, we have developed the Multiplicative Attribute Graphs (MAG) model of networks with node attributes that allows itself for mathematical analysis and is at the same time statistically interesting. Rich network data also leads to interesting problems like detecting social circles in users’ social networks. We plan to study the problem of community identification and social circle detection in networks with node attributes. We plan to develop algorithms that combine network structure as well as node attribute information. These two data modalities will complement each other and allow for more robust detection of overlapping as well as hierarchically nested communities and social circles. CONCLUSION. These steps represent a research framework that will allow me to tackle the challenging problems described above in a unique way. My research on networks is theoretically grounded and spans several areas of computer science as diverse as machine learning, theory and systems. Computation over massive data is at the heart of my research and the implications of my research have direct applications well beyond computer science -­‐-­‐-­‐ to social sciences, physics, economics and marketing. If fact, we are successfully collaborating with journalists, communication scientists, biologists, medical doctors as well as linguists. Similarly, my research also has impact in industry: Facebook has deployed a version of our link prediction engine, Samsung and Volkswagen are evaluating our social media recommendation engine, and market research firm Ipsos (as well as the largest Chinese online advertiser Allyes) are considering our algorithms for online advertising. The collaboration with industry shows that my approaches are more than only ideas but get implemented and solve problems today. I am excited about the influence that my research has already had within industry and academia, and look forward to continuing to make strides on both theoretical foundations and real-­‐world applications. 4