Download Characterizing Online Discussions in Microblogs Using Network Analysis Veronika Strnadova David Jurgens

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Recursive InterNetwork Architecture (RINA) wikipedia , lookup

Computer network wikipedia , lookup

Network tap wikipedia , lookup

Piggybacking (Internet access) wikipedia , lookup

IEEE 802.1aq wikipedia , lookup

CAN bus wikipedia , lookup

Airborne Networking wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Everything2 wikipedia , lookup

Routing in delay-tolerant networking wikipedia , lookup

Transcript
Analyzing Microtext: Papers from the 2013 AAAI Spring Symposium
Characterizing Online Discussions in Microblogs Using Network Analysis
1
Veronika Strnadova1,2
David Jurgens2 and Tsai-Ching Lu2
University of California, Santa Barbara
Santa Barbara, California, USA
[email protected]
HRL Laboratories, LLC
Malibu, California, USA
{dajurgens,tlu}@hrl.com
2
Abstract
the property of the network. For example, we would expect
that the conversations around companies would be different
than those for sports teams, due to the participants focus and
even geographic location. Understanding what the quantitative differences are between types of discussions can enable
the categorization of a novel topic, based on the dynamics of
the online conversation discussing it.
As a part of our initial study, we present an approach to
categorizing online topical discussions in Twitter using a
network to model the relationships between participants, locations, and discussion points. We propose a new extension
to the graph building method of Ruiz et al. (2012) that incorporates a user’s location. Locations enable detecting geographic shifts in a conversation, such as when the discussion around a sports team moves between the cities in which
the team plays. We provide an analysis of six network features and demonstrate how each corresponds to different discussion phenomena. The key contributions are as follows.
First, we demonstrate that a normalized variant of the network diameter can track changes in the level of topical interest that aren’t necessarily reflected in frequency-based features which look for the most popular elements. Second, we
present a case study of graph features that illustrates how the
underlying dynamics of locations, participants, and discussion points varies significantly based on the type of topic in
discussion. Notably, we find that the temporal changes in the
PageRank of the most frequent node types is the strongest
discriminating feature between types of discussions. Third,
while Ruiz et al. (2012) found that several network features
could be strongly correlated with stock market behavior, we
find that their network-based features can often be reduced
to frequency-based features that do not require the construction of a discussion network. Last, we show that a previously
unaddressed aspect of online discussions, the physical locations of the participants, varies significantly by discussion
topic and is essential in categorizing the kind of discussion.
Online discussions of a specific topic in microblogs may
vary widely in their content, locality, and participants.
We describe a method for analyzing microblog discussions in Twitter, using a network model in order to characterize discussions by their network properties. Building the network from participants, their messages, their
locations, and shared message content, we present an
analysis of six network features over four types of discussions. Our analysis reveals that the diameter of the
discussion network, when normalized for size, provides
a strong indicator for the level of concentration around a
few items. In addition, we show that different discussion
types show clear patterns in how their entities (e.g., participants, locations) vary over time in importance to the
discussion. Our analysis identifies clear measures for
quantifying and categorizing novel discussions in terms
of their important features and expected dynamics.
1
Introduction
The large scale of microblogging activity has given rise
to free-form discussions in which participants may join
and leave at any time. Furthermore, for a given topic,
many factors such as news stories or region-specific interest
may drive new users to participate. For example, Romero,
Meeder, and Kleinberg (2011) note that topical categories
often have very different patterns in how information is
shared in their discussions. We present a pilot study aimed at
characterizing the types of online discussions based on network features. Our goal is to identify quantitative measures
of driving factors behind a discussion and to discover those
features which reveal significant differences in the types of
discussion.
A network model of the discussion enables the linking
of participants in a conversation with their messages and
shared conversational features, e.g., hyperlinks. Recently,
Ruiz et al. (2012) proposed a new method for constructing
topical networks from Twitter conversations. They demonstrated how the properties of this network could be used to
predict stock price changes. We propose that this type of
network construction can also be used to categorize the discussion around a topic by observing the temporal change in
2
Twitter Discussion Network
Twitter provides a multifaceted notion of discussion. A single message may be linked to a larger discussion by both
explicit and implicit features. User mentions provide an explicit indication of a person-to-person discussion; a message may link to other users through specific mentions of
the user’s name with the ”@User” notation. Similarly, a
c 2013, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
91
Re-Tweet
Location
Created In
Creates
Tweet
Hashtag
User
Mentions
Annotated
Cites
Musicians
Sports Teams
Companies
Countries
Lady Gaga
Jay-Z
Red Hot Chili Peppers
Dave Matthews Band
Kanye West
Dodgers
Yankees
Cardinals
Diamondbacks
Padres
Microsoft
IBM
Intel
Apple
Sony
Great Britain
United States
Uganda
Greece
Mexico
Table 1: The list of terms used to seed the discussion graphs
Url
data provided by Twitter to establish a retweet. Although the
graph is not directly based on the context of the microtext,
the URL and hashtag features provide generalized indicators
of the tweet’s content.
For associating tweets with locations, we use a two step
process. First, approximately 1% of all tweets come with
geocoordinates, which allows us to accurately recover the location at which a tweet originated. We use the Google Maps
Reverse Geocoding service to convert these into a canonical
city-level location name. Second, when geocoordinates are
not available, we follow the method of Cheng, Caverlee, and
Lee (2010), which looks into a user’s profile for a location
and uses that as the location of the tweeted message. Location names from user profiles were further normalized to the
same set of city labels. Furthermore, using the tweets with
geocoordinates as ground truth, we calculated the error for
profile-based location names and removed those names with
a median error above 25km, ensuring a higher precision at
the expense of recall. Ultimately, approximately 32% of all
messages were linked with a location.
Figure 1: The schema used to construct a discussion graph
from individual tweets, extended from Ruiz et al. (2012)
Clarissa
#cardinals
Tweet 4
David
Tweet 1
Alice
Tweet 3
#baseball
Tweet 5
St. Louis,
Missouri,
USA
Tweet 6
Bob
#dodgers
Los Angeles,
California,
USA
Tweet 2
Figure 2: An example demonstrating the connections between various node types in the discussion graph.
user may repost another message, commonly referred to as
retweeting, which acts as both a form of diffusing information and engaging in conversation (Honey and Herring 2009;
Boyd, Golder, and Lotan 2010). Conversely, hashtags, e.g.,
“#baseball,” provide an explicit indication of a message joining a larger, potentially global conversation; hashtags enable
other users to discover the message by its use of a shared tag
(Huang, Thornton, and Efthimiadis 2010).
We construct a network in which individual tweets link to
four other node types: users, hyperlinks, hashtags, and locations. Figure 1 illustrates the schema used to construct the
network from individual tweets, and Figure 2 provides an
example instantiation of the schema to illustrate connectivity. As the discussion grows, the network grows more connected to elements that are shared in common between many
tweets. For example, a viral news story would generate a
graph with many tweets pointing to the same hyperlink and
possibly to a tweet that was shared multiple times. Importantly, this network representation of an online discussion
enables modeling of shared relations between the features.
The discussion may be driven by extrinsic features, such as
hyperlinks with common interest, or common geographic location of the participants, which may influence their topic
choice.
Following the method of Ruiz et al. (2012), to construct
the graph, we select a seed keyword and build a graph from
all tweets that mention that keyword. We differ slightly from
Ruiz et al. (2012) in our treatment of retweeted messages.
Whereas Ruiz et al. (2012) link two tweets together if the
Jaccard similarity between the tweets message bodies is
above a certain threshold, we found that this method overestimates the number of retweets, and instead used the meta-
3
Collective Discourse Analysis Methodology
Our pilot study consists of discussions for four topic categories: sports teams, popular musicians, companies, and
countries. We hypothesize that each category represents a
different type of online discussion with respect to both content and locality. Specifically, the interest around the topic
varies in terms of news content, geographic locality, and
trending popularity. Our goal is to identify a set of graph
measurements that accurately characterize these different
types of discussions in order to develop a methodology for
characterizing discussions about arbitrary terms. For example, we selected American baseball teams as examples of
discussions that are often highly local based on where the
team is located, but that are mobile as the team travels to
play in opponents’ home cities.
For each topic category, we selected five terms, listed in
Table 1, that might best exhibit different network characteristics. For each of the terms listed in Table 1 we select
all tweets containing that term from a 10% sample of all
the Tweets which appear on Twitter. For computational efficiency, if in one day a term has over 15,000 tweets, we
sample 15,000 tweets without replacement from the selected
tweets.
To analyze changes to the graphs over time, we aggregate discussions at the daily level. A daily level enables us
to observe multiple discussion points per day that correspond to some variability in the network, while still being
92
able to observe consistent properties and entities that persist
between days. For network properties, we selected to observe daily changes in the graph diameter, node type composition, and the degree, PageRank, and closeness distributions of the nodes in each graph. We do not expect these
properties to change significantly over the course of a day,
but daily changes do help us characterize the discussion.
4
topic, resulting in two large disconnected components. However, our analysis revealed that all of our networks exhibited
the same properties: one large connected component, and
many small components. The large component was always
at least an order of magnitude larger than any of the smaller
components, and most of the small components were “singleton” tweets – they consisted of one tweet, one user, and
zero to three of the other node types. These small components represent locations, people, and what we might call peripheral topics, not connected in any way to the main theme
of the discussion.
This behavior resembles the observed component distribution of online social networks, which tend to have a giant component, and many smaller disconnected components
(Kumar, Novak, and Tomkins 2010). However, the behavior
is surprising given that our network is tweet-based with the
majority of connections between tweets coming from intermediary edges from locations, hashtags and hyperlinks, not
directly from tweets.
Ruiz et al. (2012) show that the number of connected
components tends to have very high correlation with
changes in traded volume for a stock. However, given our
experimental results, the correlation may be due simply to
the fact that more people are discussing the keyword which
the graph is built around, and reflects tweet volume rather
than relations among users, hashtags, and hyperlinks. For
example, given a spike in public interest about a topic (e.g.,
a baseball game), more users will tweet messages containing
the keyword. However, the increased volume creates more
variety in the observed hashtags, hyperlinks, and locations,
and therefore creates more nodes that are not connected to
the larger component by virtue of their uniqueness.
Indeed, Ruiz et al. (2012) note that the second most predictive feature was the number of nodes, which can be computed by keeping a simple count of node entities, without
the need for a graph. In our case studies, for all of a keyword’s graphs, Spearman’s ρ for the number of nodes and
the number of components was above 0.9 for nearly every
keyword, thereby suggesting that node count alone could be
used as an effective substitute. We speculate that the correlation in tweet frequency with increased trading volume may
be another example of an already established link between
online user behavior with search engines and trading volume
(Preis, Reith, and Stanley 2010).
Complete-Graph Metrics
Several graph metrics serve as descriptors of the graph as
a whole. We refer to such measures as “complete-graph”
metrics and present the effectiveness of tracking temporal
changes in these metrics in order to classify online discussions. Specifically, we describe how changes in the number
of connected components, the diameter, and the node type
distribution can help us categorize discussions.
4.1
Node Types
We hypothesized that differences in the node type composition of the discussion network might reveal what was most
important to the discussion. For example, if a discussion includes a large number of hyperlink nodes relative to other
days, we may infer that there were external news stories that
drove the discussion.
Figure 3 illustrates the changes in composition for sample
keywords in three of our categories. For each day in July, we
plot the percentage of each node type. In our large analysis
of all the graphs, we did not observe any major variation in
node type composition between the categories; the composition was largely similar for all graphs with a roughly equal
distribution of tweets and users, and a similarly equal distribution of hyperlinks and hashtags. We note that this result
is not necessarily a result of the graph composition method.
For example, if a discussion is generated by a moderately
sized group of users that generate the bulk of the messages,
then we would expect to see a greater disparity between the
percentage of tweets and users.
However, despite the lack of categorically-biased distributions, several minor trends do emerge. The graph for Microsoft in Figure 3a reveals a temporal pattern in its hyperlink and hashtag nodes, which we found corresponded to the
increase in percentage of hashtags on the weekend. A similar temporal trend was seen in other company-based discussions. Both the discussion for the Diamondbacks baseball team and that for Great Britain exhibit significant variation in the node type distributions. However, as the Olympic
Games approaches and begins on July 27th, we see the distribution become much more consistent for Great Britain,
which hosted the games. This change highlights the impact
that an external event can have on the dynamics of the discussion.
4.2
4.3
Diameter
The diameter of a connected component measures the length
of the longest path between two vertices in that component,
a feature which is not accessible using a frequency-based
analysis. We hypothesized that if a conversation is highly
concentrated around a few entities, then it is likely that the
diameter of the largest component will be small relative to
the component size. Conversely, if a discussion is unfocused
with many unrelated entities, then the diameter of the largest
component will increase. Additionally, because the total volume of discussion entitites varies per day, we use a version
of the diameter that is normalized by the total graph size.
We use the term “normalized diameter” to refer to the value
(maxu,v d(u,v))
, where maxu,v d(u, v) is the value of the dinL(G)
Connected Components
As a discussion network grows over time through the addition of new tweets, not all tweets are expected to join the
same connected component. For example, two news stories
may give rise to disjoint sets of users conversing about each
93
0.4
0.3
0.2
0.1
0.6
Location
Tweet
User
Hashtag
Url
0.5
Percentage of Type
0.5
Percentage of Type
0.6
Location
Tweet
User
Hashtag
Url
0.4
0.3
0.2
0.1
0
0.4
0.3
0.2
0.1
0
5
10
15
20
25
Location
Tweet
User
Hashtag
Url
0.5
Percentage of Type
0.6
0
30
5
10
Date in July, 2012
15
20
25
30
5
10
Date in July, 2012
(a) Microsoft
15
20
25
30
Date in July, 2012
(b) Diamondbacks
(c) Great Britain
Figure 3: A comparison of the distribution of node types in the largest component for Great Britain, Microsoft, and the Diamondbacks baseball team highlights how the focus of a discussion in terms of hashtags, retweets, a shared hyperlinks may vary
significantly on a daily basis.
0.012
28
0.01
0.008
26
0.006
24
0.004
22
0.012
0
5
10
15
20
Date in July, 2012
25
22
0.014
20
0.01
0.008
18
0.006
0.004
16
0.002
20
24
0.016
30
(a) Microsoft
14
5
10
15
20
Date in July, 2012
25
0.4
Diameter
Normalized Diameter
0.35
20
0.3
18
0.25
16
0.2
14
0.15
12
0.1
10
0.002
8
0
6
Diameter / Number of Nodes
0.014
30
0.018
Diameter
Diameter
Normalized Diameter
22
Diameter
0.016
32
Diameter
24
0.018
Diameter / Number of Nodes
0.02
Diameter
Normalized Diameter
34
Diameter / Number of Nodes
36
0.05
0
30
5
(b) Yankees
10
15
20
Date in July, 2012
25
30
(c) Great Britain
Figure 4: Spikes in the normalized diameter show a decentralization of the discussion around a topic that correspond to realworld phenomena.
5
Degree Frequency
ameter in the largest connected component, and nL(G) is the
number of vertices in the largest connected component of
graph G.
Figure 4 shows a comparison of the diameter and normalized diameter over time for the keywords “Microsoft”, “Yankees”, and “Great Britain”. We did not observe any changes
in the diameter that corresponded to changes in user behavior. However, the normalized diameter does show clear
patterns that correspond to real-world phenomena affecting
user behavior. Despite being very different keywords, all
three show changes in the normalized diameter that correspond to a concentrated discussion. The spikes in normalized diameter in Fig. 4a correspond to weekends during
which less news is published on Microsoft as a company.
Similarly, the spikes in Fig. 4b to days when the Yankees
do not have games and discussion ranges widely. In contrast, the sharp drop in Fig. 4c corresponds to the start of the
Olympic Games, where a high percentage of the discussion
for Great Britain becomes highly focused around that topic.
Microsoft
Mets
Great Britain
Jay-Z
1000
100
10
1
1
10
100
Degree
1000
10000
Figure 5: A log-log plot of degree frequency for all nodes in
the largest component during one day’s discussion.
versation is highly concentrated around a certain topic, it
does not tell us which topic. In order to evaluate changes
of a local flavor in network structure, we studied changes in
distributions for three measures: degree and PageRank, used
by Ruiz et al. (2012), and a new measure, closeness.
Graph Node Metrics
5.1
In comparison to the complete-graph metrics in Section 4,
graph node metrics track the properties of individual nodes,
which may change significantly over time. For example,
while the normalized diameter may indicate whether a con-
Degree Distribution
Newman, Watts, and Strogatz (2002) state that a highly
skewed degree distribution which obeys a power law is
one of the three distinctive features of social networks.
We expected locally-focused discussions to have the highly
94
0.08
pagerank
0.07
0.06
0.05
0.04
0.025
0.12
#np
#nowplaying
#CelebrityNews
#lastfm
#LadyGaGa
#TeamFollowBack
#ladygaga
#RetweetTheSongs
#NP
#RT
0.02
0.015
#yankees
#GoRedSox
#Mariners
#NYY
#Ichiro
#GoYankees
#ASG
#MLB
#RedSox
#Yankees
0.1
0.08
pagerank
#BROCADE
#VMWare
#EMC
#SVC
#AIX
#XIV
#storage
#Linux
#ibm
#IBM
pagerank
0.1
0.09
0.01
0.06
0.04
0.03
0.02
0.005
0.02
0.01
0
0
10
15
20
25
30
0
5
10
Date in July, 2012
0.014
0.012
0.01
20
25
30
5
0.01
0.008
0.006
0.008
15
20
25
30
UnitedStates, Washington, Washington D.C.
United States, Nevada, Las Vegas
United States, Pennsylvania, Philadelphia
Dominican Republic,Distrito Nacional, Santo Domingo
United States, New York, Bronx
United States, Illinois, Chicago
United States, Washington, Seattle
United States, New York, Brooklyn
United States, Massachusetts, Boston
United States, New York, New York
0.018
0.016
0.014
0.012
0.006
0.01
0.008
0.006
0.004
0.004
0.004
0.002
0.002
0.002
0
5
0.03
10
15
20
Date in July, 2012
25
30
0.06
545788655
201846344
32639033
25044297
557095817
10228272
15812482
18994444
382546408
0.02
0.04
0.015
10
15
20
Date in July, 2012
25
30
0.02
0.005
0.01
10
15
20
Date in July, 2012
25
30
15
20
Date in July, 2012
25
30
259910792
88763317
62812946
273483641
522252139
288764875
40918816
18479513
40927173
0.018
0.016
0.014
0.03
0.01
5
0.02
47092805
96398845
21332196
290874302
69183155
491838627
544273672
14230524
10228272
0.05
pagerank
0.025
0
5
pagerank
0
pagerank
10
Date in July, 2012
Turkey, Istanbul, Unkapan˜–
UnitedStates, California, Los Angeles
UnitedStates, Illinois, Chicago
Brazil, Rio de Janeiro, Rio de Janeiro
Philippines, National Capital Region, Manila
Mexico, The Federal District, Mexico City
United Kingdom, England, London
Brazil, Sao Paulo, Sao Paulo
UnitedStates, NewYork, NewYork
Indonesia, JakartaRaya, Pecenongan
0.012
pagerank
0.016
pagerank
0.014
Philippines, National Capital Region, Manila
United States, Illinois, Chicago
United States, Washington, D.C., Washington
Indonesia, Jakarta Raya, Pecenongan
United States, Georgia, Atlanta
United States, Texas, Austin
United States, California, San Francisco
India, Karnataka, Kanija Bhavan
United States, New York, New York
United Kingdom, England, London
0.018
15
Date in July, 2012
pagerank
5
0.012
0.01
0.008
0.006
0.004
0.002
0
0
5
10
15
20
Date in July, 2012
(a) IBM
25
30
0
5
10
15
20
Date in July, 2012
25
30
(b) Lady Gaga
5
10
(c) Yankees
Figure 6: A comparison of the PageRank values for IBM, Lady Gaga, and Yankees for the ten most frequent nodes of types
hashtag (top), location (middle), and user (bottom). The temporal changes in the PageRanks of the most frequent nodes reveal
distinct types of discussion behaviors that are representative of trends seen for similar terms.
skewed distribution characteristic of social networks, because a topic that is specific to a small geographic region
may include a substantial amount of users who are socially
connected. Figure 5 shows a log-log plot of node degree
frequencies on a single day for four representative discussion keywords, all of which resemble a power law, consistent
with our hypothesis. We note that the networks could have
non-power law distributions if generated by certain types of
user behaviors. For example, we originally hypothesized that
company terms would generate multiple sub-discussions
around their various products, which would yield a more
uniform degree distribution. However, the degree distributions were indistinguishable for different discussion types.
For each keyword, no significant temporal changes were observed in the degree distributions mean, median, variance,
kurtosis, or skewness.
found that these properties did not correlate well with any
of the stock market events which they tried to predict. Their
approach considers statistics such as the quartiles, skewness,
and kurtosis of the degree distributions. Their finding is consistent with our observation that the degree distributions are
relatively stationary over time and therefore the statistics
would not yield additional predictive power.
5.2
PageRank
The PageRank of a node measures how well-connected the
node is in terms of having important neighbors (Brin and
Page 1998). Thus, for graphs representing a topical discussion, we would expect to see that nodes with a large PageRank to be the most important aspects of the discussion. Furthermore, analyzing the changes of the highest ranked nodes
over time can reveal what drives the discussion. We illustrate
this point through Figure 6, which visualizes the PageRank
scores for the most frequent locations, hashtags and users
in the respective keyword-graphs for three types of discussions over the course of a month. For hashtags, we see that
whereas discussions of IBM and Yankees are driven by a rel-
Ruiz et al. (2012) also found that the degree distribution
was not a reliable source of information for graphs representing interactions between actors and topics related to
the stock market. The authors attempted to track changes in
the properties of degree distributions in their networks, but
95
atively stable set of hashtags, the discussions for Lady Gaga
are frequently associated with hashtags such as #RetweetThisSong that quickly spike and die down, as well as tags
such as #LastFM, where a song by Lady Gaga has been
played. The difference in tagging behavior between topics
agrees with a recent analysis by Romero, Meeder, and Kleinberg (2011).
Key differences are also seen for locations in Figure 6,
where the discussions for “IBM” and “Lady Gaga” have
highly populous cities in several countries as the highest
weighted in their graph, with no clear trend in which city
generates the most discussion. Conversely, the discussions
for “Yankees” consistently have their home city of the team,
New York, as the most central. Other highly ranked locations in the network correspond to other cities with baseball teams, with the locations spiking in PageRank when the
Yankees plays a team from that city.
The ranking of users reveals that a single user is consistently ranked the highest in the graph, which further analysis
showed to be Twitter accounts associated with the company,
entertainer, and sports teams, respectively. In addition, both
Lady Gaga and the Yankees showed cases where other nonassociated users were central to the location. In the case of
the Yankees, these users tended to be accounts associated
with the teams they were playing and therefore correlated
with the trends seen in location. We would expect the emergence of a single prominent user in the discussion network
if that user was highly influential, a driving force behind the
discussion. However, we found that the most central user according to the Pagerank measure was merely the user who
tweeted most frequently, and was represented by the corresponding user node with the highest degree.
Last, as a test of the importance of the PageRank score,
we computed Spearman’s ρ between the relative ranking of
a node according to PageRank and to its degree. For every
graph in our pilot study we observed that ρ was above 0.9,
when comparing the nodes in the largest component. Given
the high correlation, it appears that although the PageRank
provides a useful measure of importance for an entity in a
discussion network, a simple frequency-based approach to
measuring importance the degree of a node in our network
is simply a count of occurrences of the entity represented by
the node in the Twitter dataset can provide highly similar
ratings with far lower computational costs.
5.3
45
40
35
Frequency
30
25
20
15
10
5
0
0.08
0.1
0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28
Closeness
0.3
Figure 7: Distribution of node closeness values in the “Cardinals” graph on July 6, 2012.
#CardinalNation
#TakeJake
#Cubs
#stlcards
#MLB
#ASG
#BestTeam
#FinalVote
#FreesePlease
#Cardinals
0.6
closeness
0.5
0.4
0.3
0.2
0.1
0
5
10
15
20
Date in July, 2012
25
30
Figure 8: Closeness values of the top ten hashtags appearing
in the “Cardinals” network for the month of July.
deal with cases when v is not reachable from u. If we rank
all the nodes in the largest component of our graph by closeness, the highest-ranked nodes are closest in path length to
all the other nodes in the graph. Thus, if we believe that an
online discussion is centralized around one or very few topics, people, or locations, we might expect that the nodes representing these focal points to have the best closeness scores.
Examining node importance in this way might be best when
considering compact, highly focused local discussions.
Unlike PageRank and betweenness, closeness did not appear to correlate highly with degree; Spearman’s ρ between
closeness and node degree was less than 0.4 in every case.
Some correlation is still expected due to the network construction process: a node with high degree will be connected
to other nodes in the graph with greater probability, increasing the chance that it has short path lengths to all other vertices. Unlike the degree and PageRank distributions, however, the distribution of node closeness values resembles a
normal (Gaussian) shape, as shown in the distribution in
Closeness
Closeness centrality is a different measure of node importance than degree, PageRank, or betweenness centrality,
measuring which nodes are near the graph’s “center”. Formally, the closeness Cl(v) of a node v ∈ V is defined as the
inverse of the mean of the farness to other nodes:
1
Cl(v) = 1 ,
u∈V dG (u, v)
|V |
where dG (u, v) is the shortest path between u and v, and |V |
the number of vertices, in the graph G. We only consider the
largest component of the graph when computing closeness
centrality. That is, in our case G represents the largest connected component of the Twitter network, and we need not
96
0.1
pagerank
discussion elements, such as events, that are also regionspecific. In identifying what are the most important tweets in
a discussion, De Choudhury, Counts, and Czerwinski (2011)
consider a related problem of what are the most important
to return given a search query for a term. Using microtext,
social network, and discussion attributes that were selected
based on a user survey, they found that tweets which exemplified diversity in these attributes were among the best to
return according to a user assessment. Future work may consider whether a diversity ranking with respect to discussion
elements may enable better identification of the key tweets
in an evolving online discussion.
Recent work has also focused on the related problem
identifying location-specific topics (Mei et al. 2006; Wang
et al. 2007; Hong et al. 2012). Whereas our analysis asks
whether location is an important factor in an online discussion, these models identify the most likely language of interest for specific locations by leveraging microtext that has
been geocoded. Unlike our analysis, these models do not
emphasize the change in topics for a location over time. Both
Hao et al. (2010) and Yin et al. (2011) address the problem
of comparing regional topics, which is related to our question of how to compare discussions. Future analyses may
integrate these comparative approaches by examining how
typical the microtext produced in a location-driven discussion is to that region.
Last, we note that in a study of viewing trends for
YouTube, Crane and Sornette (2008) propose a model of
online viewing that identifies three classes of user behavior. Unlike our analysis, they model only viewing counts,
whereas we consider factors such as location and repeated
participants.
#CardinalNation
#TakeJake
#Cubs
#stlcards
#MLB
#ASG
#BestTeam
#FinalVote
#FreesePlease
#Cardinals
0.12
0.08
0.06
0.04
0.02
0
5
10
15
20
Date in July, 2012
25
30
Figure 9: Pagerank values of the top ten hashtags appearing
in the “Cardinals” network for the month of July.
Figure 7 of all nodes in the largest component of the “Cardinals” graph. In addition, the shape of our node closeness
distributions was not as consistent as that of the degree and
PageRank distributions. We obsereved a variety of distributions which appeared to be nearly uniform, to the normal
shape discussed above, dependent on the day and keyword
being represented.
Closeness appears to be a good indicator of tiers of importance for entities of a given type. In Figure 8, #Cardinals is consistently a top-ranked hashtag, #MLB, #stlcards,
and #CardinalNation are of secondary importance, and the
more trendy hashtags, such as #takejake, only spike in importance around the dates that they are relevant to the discussion. Levels of importance also appear when we consider
other entities, such as locations or users, and for other keywords as well. Ranking hashtags by PageRank provides a
similar ordering (Fig. 9); however, the PageRank values do
not vary sufficiently to differentiate between levels of importance. Closeness values should further enable discussion
analysis by highlighting which entities are of highest importance.
6
7
Conclusion and Future Work
We presented an initial case study on using a network model
and associated statistics to analyze and characterize online
networks. Motivated by the work of Ruiz et al. (2012), we
generated a network model that incorporated nodes for five
discussion features: tweets, users, hyperlinks, hashtags, and
locations, with links between nodes based on the content of
a tweet. Using these discussion networks, we analyzed four
network statistics that were found to be highly predictive of
stock market trends: node type distribution, the number of
connected components, PageRank, and node degree. In addition we proposed two new features, the normalized diameter
and the nodes’ closeness distribution, both of which offered
superior insight into the discussions.
As a result of our analysis, we also found that the two
highly predictive properties used by Ruiz et al. (2012), degree and PageRank, are easily reducible to simple frequency
counts. Furthermore, a secondary analysis of the distribution
of node degree and PageRank revealed that these distributions are stationary over time, and therefore neither aid in
discussion analysis nor in prediction.
We believe that modeling online discussions as relationships between locations, people, and topics, has potential
to reveal interesting properties of online discourse; but in
order to better exploit the interdependence of ties between
Related Work
Several works have leveraged the structure of microblog
discussions to discover important features, such as hyperlinks (Shamma, Kennedy, and Churchill 2010), hashtags (Romero, Meeder, and Kleinberg 2011), events (Lee,
Wakamiya, and Sumiya 2011), or tweets themselves
(De Choudhury, Counts, and Czerwinski 2011). Romero,
Meeder, and Kleinberg (2011) analyze the growth and persistence of hashtags in different topic categories, demonstrating that the emergence of a popular hashtag is highly
topic dependent. Lee, Wakamiya, and Sumiya (2011) leverage geocoded microtext to discover regional events as well
as to categorize events by geographic regions. In contrast,
our analysis may incorporate those discussions that are
not regional; however, the identification of region-specific
discussions may enable additionally discovering important
97
these entities, a structural analysis that moves beyond basic graph metrics is needed. For example, higher-order network features such as motifs (Milo et al. 2002) may provide evidence of repeated connectivity patterns to better discriminate between discussion types. Furthermore, different
methodologies for modeling discussions as networks, e.g.,
linking users together, may provide graphs that are more
amenable to discussion analysis.
Our immediate future work will focus on two areas. First,
we plan to use the presented network features to classify
the discussions of novel terms and discover discussions that
change type over time. For example, we might expect to observe a stabilization of the discussion involving collective
organization where participants begin to use a set of common hashtags and are active in similar locations.
Second, we plan to explore the use of the discussion network itself as a forensic tool for understanding the key elements in a discussion over time. Although the frequency of
a hashtag or location may suggest the most important feature at a moment, once classified, the type of discussion can
reveal which types of entities will generate sustained discussion in days to come.
We see the potential application of our microblog analysis toward the geographic tracking of public interest in a
topic. For example, the keyword “influenza” may appear to
characterize a local discussion geographically centralized in
a small area, but the most central location to the topic may
change over time. We would thus compare the apparent interest in influenza with the spread of the illness, and look
for correlation between immunizations and the importance
of the influenza discussion topic to a geographic region. We
are also interested associating geographic patterns in topical
discussions with social and political movements, which can
help answer questions on how wide-spread and connected a
movement is.
Crane, R., and Sornette, D. 2008. Robust dynamic classes revealed
by measuring the response function of a social system. Proceedings
of the National Academy of Sciences 105(41):15649–15653.
De Choudhury, M.; Counts, S.; and Czerwinski, M. 2011. Identifying relevant social media content: leveraging information diversity
and user cognition. In Proceedings of the 22nd ACM conference
on Hypertext and hypermedia, 161–170. ACM.
Hao, Q.; Cai, R.; Wang, C.; Xiao, R.; Yang, J.; Pang, Y.; and Zhang,
L. 2010. Equip tourists with knowledge mined from travelogues.
In Proceedings of the 19th international conference on World wide
web, 401–410. ACM.
Honey, C., and Herring, S. 2009. Beyond microblogging: Conversation and collaboration via twitter. In System Sciences, 2009.
HICSS’09. 42nd Hawaii International Conference on, 1–10. IEEE.
Hong, L.; Ahmed, A.; Gurumurthy, S.; Smola, A.; and Tsioutsiouliklis, K. 2012. Discovering geographical topics in the twitter stream. In Proceedings of the 21st international conference on
World Wide Web, 769–778. ACM.
Huang, J.; Thornton, K.; and Efthimiadis, E. 2010. Conversational
tagging in twitter. In Proceedings of the 21st ACM conference on
Hypertext and hypermedia, 173–178. ACM.
Kumar, R.; Novak, J.; and Tomkins, A. 2010. Structure and evolution of online social networks. Link Mining: Models, Algorithms,
and Applications 337–357.
Lee, R.; Wakamiya, S.; and Sumiya, K. 2011. Discovery of unusual
regional social activities using geo-tagged microblogs. World Wide
Web 14(4):321–349.
Mei, Q.; Liu, C.; Su, H.; and Zhai, C. 2006. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of the 15th international conference on World Wide Web,
533–542. ACM.
Milo, R.; Shen-Orr, S.; Itzkovitz, S.; Kashtan, N.; Chklovskii, D.;
and Alon, U. 2002. Network motifs: simple building blocks of
complex networks. Science 298(5594):824.
Newman, M.; Watts, D.; and Strogatz, S. 2002. Random graph
models of social networks. Proceedings of the National Academy
of Sciences of the United States of America 99:2566–2572.
Preis, T.; Reith, D.; and Stanley, H. 2010. Complex dynamics of our economic life on different scales: insights from
search engine query data. Philosophical Transactions of the
Royal Society A: Mathematical, Physical and Engineering Sciences 368(1933):5707–5719.
Romero, D.; Meeder, B.; and Kleinberg, J. 2011. Differences in the
mechanics of information diffusion across topics: idioms, political
hashtags, and complex contagion on twitter. In Proceedings of the
20th international conference on World wide web, 695–704. ACM.
Ruiz, E.; Hristidis, V.; Castillo, C.; Gionis, A.; and Jaimes, A. 2012.
Correlating financial time series with micro-blogging activity. In
WSDM’12.
Shamma, D.; Kennedy, L.; and Churchill, E. 2010. Tweetgeist:
Can the twitter timeline reveal the structure of broadcast events? In
CSCW 2010.
Wang, C.; Wang, J.; Xie, X.; and Ma, W. 2007. Mining geographic
knowledge using location aware topic model. In Proceedings of the
4th ACM workshop on Geographical information retrieval, 65–70.
ACM.
Yin, Z.; Cao, L.; Han, J.; Zhai, C.; and Huang, T. 2011. Geographical topic discovery and comparison. In Proceedings of the 20th
international conference on World wide web, 247–256. ACM.
Acknowledgements
Supported by the Intelligence Advanced Research Projects
Activity (IARPA) via Department of Interior National Business Center (DoI / NBC) contract number D12PC00285.
The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding
any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official
policies or endorsements, either expressed or implied, of
IARPA, DoI/NBE, or the U.S. Government.
References
Boyd, D.; Golder, S.; and Lotan, G. 2010. Tweet, tweet, retweet:
Conversational aspects of retweeting on twitter. In System Sciences
(HICSS), 2010 43rd Hawaii International Conference on, 1–10.
IEEE.
Brin, S., and Page, L. 1998. The anatomy of a large-scale hypertextual webs search engine. Computer Networks 107–117.
Cheng, Z.; Caverlee, J.; and Lee, K. 2010. You are where you
tweet: a content-based approach to geo-locating twitter users. In
Proceedings of the 19th ACM international conference on Information and knowledge management, 759–768. ACM.
98