Download 7630_0_report 20100601 draft

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Framework: Event Extraction from the Temporal Web
Yiyang Yang1 and Zhiguo Gong2
Faculty of Science and Technology
University of Macau
Macau
[email protected], [email protected],
Abstract—Temporal-based mining is an attractive
direction which is newly generated from the Data Mining
field. By taking the time factor into account, some
knowledge and interesting information, such as burst
events and topic durations, can be mined out from data
collections which are coordinated according to their
duration (timestamp). Given the huge web as a temporal
data collection, in this paper, we introduce a framework
based on our current work. The main task is to find the
association between two topics in different time slots
(durations). Given a keyword as the main topic, we expect
to find three kinds of topics which are relevant to the
main topic: periodical topic, non-periodical topic and
burst topic. These three types of topics can satisfy the
needs of users with different requirements.
I.
INTRODUCTION
The value of the knowledge is acknowledged by
many researchers. Most of them are hidden in the mass
of data, how to discover them is a critical issue.
Traditionally, the data pattern detection is performed
based on the static data. Moreover, the data are
grouped according to their contents, thus different
technologies are invited, such as documents clustering
and classification, summarization. In other words, from
the researchers’ point of views, such information,
documents are static, some significant information
stored in subset of the data are normally ignored. In
recent years, temporal-based mining becomes a hot
topic, there are many publications focusing on this
direction, and we will introduce them in the later
section.
In industry, time-based search is also an attractive
topic. For instance, Google provides a service named
Time Line, which offers users the temporal information
about given keyword; user can view the “heat” of a
topic in different time periods. Together with the Time
Line function, the Wonder Wheel function will also be
introduced in the later section. These two functions are
critical to our method, because our approach absorbs
their ideas as well as advances, and attempts to
integrate them into a more interesting task.
The structure of this paper is that: in Section II, we
will introduce the related work. In Section III, we will
give a brief introduction to our current work, the
problem and its solution. For the Section IV, we design
a framework which extends from our current work, the
main task for the paper is to mine the hot topics
specific to a keyword. Section V is the summary of this
paper.
II. RELATED WORKS
More and more researchers pay attentions on the
data mining with taking time factor into account,
generally speaking, data are coordinated by time,
instead of analyzing the data globally, the research
focus on the information hidden within certain time
slot/period/transaction. One note is that most of the
recent researches concentrate on the English Event
detection, few of them involves the Chinese-based
environment.
In the [1], M. Böttcher at el. proposed the new
generation of the data mining: Change Mining, the data
evolves in terms of changes, when they occur, it is
necessary to detect these changes. M. Böttcher et al.
describe a four steps procedure to build a Change
Mining method including goal specification, time and
change modeling, and detecting mechanism. The
authors attempt to integrate the advances of
incremental mining, temporal mining and stream
mining into the Change Mining. J.F. Roddick at el. in
[2] also mentioned the detecting changes in pattern.
They pointed out that the change mining is one type of
the high order mining, more and more technologies
such as trends analysis, classification and clustering are
integrated in order to provide user better services, and
the knowledge discovery in temporal perspective
would also benefit the integration.
In [3], G.P.C Fung et al. studied a new problem: hot
burst event detection, and the source data are constrict
to English text streams. In this publication, they found
that the Documents clustering have several problems
such as inaccurate in event modeling, incapable in
handling burst occurrence. The authors believed that
representing the event by several features would
enhance the bursts of the events. The English text
stream is coordinated over the time dimension, then the
burst features are identified, through grouping the
selected features, the events in certain hot periods are
finally extracted.
N. Parikh and N. Sundaresan in [4] introduced an
approach to detect burst from the near real-time
ecommerce Queries (eBay). The Burst Extraction is
incremental, and the Wavelet transformation is used to
preserve the amplitude, time and frequency information
for non-stationary signals. In the algorithm, the query
are treated the same as the text stream which takes the
time variable as one of the main identifies.
R.M. Nallapati et al. in [5] modeled the topics with
time by a new designed method. The most important
feature of this method is that it could handle the topics
evaluation at multiple time scales, thus the time
granularity is not fixed to one or several constants. The
whole period could be represented by a binary tree, as
the node goes deeper, the time granularity becomes
smaller, the root of the tree is the whole periods, and its
two children just cut its duration into half, and so on.
Through this representation, the time variable is
extremely scalable.
In [7], the authors also studied on the features
detection of the Event, two variables are used to
describe the event features: periodic and in-periodic,
frequent and in-infrequent. Thus totally four different
classes of features are defined. In this paper, the
DFIDF is used to measure the feature frequency, and
moreover, the discrete Fourier transformation is
applied to decompose the feature trends so that the
original time series could be represented as linear
combination of complex sinusoids.
C.P.C. Fung at el. in [8] designed a temporal-based
hierarchical event detection method, they claimed that
the time is the main dimension of the burst event
detection; the corresponding features should be
extracted based on the burst time, and then use the
related documents which are highly relevant to the
busty features, to form the event hierarchy. When the
features are identified, each feature may satisfy a query
and be related to a group of documents, thus for the
documents groups, there may be some overlaps. The
authors evaluate the similarity between documents
groups and use them to represent the relation between
corresponding features. With this relationship among
features, it is easy to construct the event hierarchy.
X. Wang and A. McCallum in [9] proposed a model
of Topics over Time (TOT); it extends Latent Dirichlet
Allocation (LDA) model. This model is utilized to
handle the attribute of the time: continuous. By
introducing the LDA, the model avoids discretization
by associating with each topic a continuous distribution
over time. Finally, the performance of the TOT is
better than the LDA.
To summarize, the event extraction approaches
introduced above, have a main drawback: following the
time sequence, the events (knowledge) are extracted
statically, thus it is difficult to find some events for a
specific topic. For the users who might be interested in
some local relevant news or events, in more details, in
case they want to retrieve the historical events that are
restricted to certain topic, the mentioned methods will
only provide the information which is current and
globally extracted. It is also one of our contributions in
this paper. For a given topic, we aim to find the events
strongly associated, and arrange them through their
significant attribute: occurring time.
III. OUR APPROACH
The forms of the WebPages available on the
internet are not regular; they are not well organized in
terms of both document content and structure. Give an
example, our WebPages Crawler downloaded a mass of
illegal WebPages which do not contain the published
date information (they just set the related attribute to be
0); it is even difficult for the documents publisher to
tell the actual information. One possible solution is that
we detect the newly generated WebPages periodically
and record the corresponding temporal information,
however it is costly to build such a system, normally
only the professional companies (e.g. search engine
service provider) can afford it. It leads to another
solution: Through setting the parameters of the Query
and utilizing the API (Almost every Search Engine
supports it), we crawl the links which are returned by
the Search Engines as results. Unfortunately, there are
too many limitations for the public API; we take
Google Search Engine API as an example. It has
several restrictions:
 The number of returned results is limited to 30
 Only allow the user to set the temporal
restriction like ‘Since Date 12/01/2009’, and
moreover, it always returned the most recent
WebPages because the corresponding scores are
higher
 The temporal information of the returned results
are also encrypted, there is no open standard for
user to access.
A. Google Search Engine API.
On 12th Aug. 2009, Google release its next
generation of the Search Engine for open testing (Beta)
[10]. Besides the increased speed, accuracy and
efficiency, it also brings an updated feature: temporal
relevance, which provides user the temporal
information about the Google returned WebPages. The
new version of Google Search provides us the
opportunity to crawl the time coordinated WebPages.
 Specify some interested topics (keywords)
 Choose the appropriate time period and
granularity (e.g. one year and one day
respectively)
 For each time granularity in the chosen period,
form the corresponding query and “ask” the
Google API
 For the returned WebPages, fixed its temporal
information according to the chosen time
granularity and discard its relevant attributes
such Publish Date, Last Modified Date and so
on. Because the values of the relevant attributes
normally are incorrect, especially for the
WebPages in the small web site.
B. Google Option:
The Google Option adds two new features on 13th
May 2009: Wonder Wheel (神奇罗盘) and Time Line
(时光隧道).
The followings are the output of the keyword “澳
门” (Macau) in Wonder Wheel:
data such as Query log, Google found that the users
who are interested in the former keyword also show
interesting on the later one, vice versa, then Google
defines these two keywords are associated. According
to this new function, it is easy to find that how to mine
the relation of two topics (keywords) is one of hottest
topics in the future and the trend of the up-to-date
researches also approves it.
Figure 2 the Time Line View function with keyword “Macau” in on
Month level
This function provides the global point of view for
the topic “heat”; actually it is simply represented by the
query frequency.
Our work could be viewed as the enhanced version
which integrates the Wonder Wheel and Time Line.
The output is not produced simply by combining
results of two functions. In the Wonder Wheel,
normally the result is the super-phrase of the keyword;
more strictly, which embeds the keyword as prefix;
thus for some in-contextual results, fewer are selected.
For the Time Line, it only considers the frequency of
the corresponding input, the information about cooccurrence between keyword and results, is simply
ignored.
IV. APPROACH AND FRAMEWORK
Our approach aims to find the temporal-based
association between two different topics. In current
progress, our research focuses on the co-occurrence of
two topics, and we also design a framework which
satisfies our requirements. The notations used in this
paper are listed as TABLE I.
Symbol
K
T
D
Vk
tfkw(t )
Figure 1 the Wonder Wheel function with two keywords “Macau”
and “Macau Casinos”
From Figure 1, we can see that the Google Wonder
Wheel is a graphic function which roughly
demonstrates the relationship between two keywords,
as “ 澳 门 ” (Macau) and “ 澳 门 赌 场 ” (Macau
Casinos) in our example. From another point of view,
these two keywords are relevant because they
frequently appear together. Based on static analysis on
Nk (t )
Nk
tfkw
G
TABLE I. NOTATIONS IN THIS PAPER
Description
Topic Set
Time Slots Set
Documents Set
Words Set (Vocabulary) for topic k
Term frequency of the word w for topic k during
the time slot t
Number of documents for topic k during the time
slot t
Number of documents for topic k during the
whole time period
Term frequency of the word w for topic k during
the whole time period
The pre-defined time granularity set
Our framework operates in the following procedure:
1) For each topic k in K and each time slot t in T,
the framework forms a Query, in human
language, its meaning likes “I want to get a
link list including all the WebPages which are
related to topic k and published on time t”
2) According to results returned by Search
Engine, the framework crawls all the
WebPages, and organizes them in time
dimension. The alternative expression is that ,
for each topic k and time slots t, the
framework
crawls
the
corresponding
Documents Set Nk(t)
3) For each element in Nk(t), the framework
performs the Phrase Extraction, in order to
extract the most valuable phrases; and the
output is Vk as well as the tfkw(t)
4) For each word w in Vk, the tf trend curve could
be drawn over the time
A. Motivation Case
Suppose we are interested in the topic “ 澳 门 ”
(Macau), we give this keyword to the Google Search
Engine and crawl all the returned WebPages, and these
documents are coordinated by time. Through analyzing
those WebPages, we expect to find different words
(topics) which are related to our interested topic (main
topic). Based on our experimental analysis, these topics
could be divided into three classes:
1. non-periodical topic
2. periodical topic
3. burst topic
For the main topic “澳门” (Macau), it is convenient
to find the represented cases for these three topic
classes. In non-periodical class, the “ 赌 博 ”
(Gambling), “赌场” (Casino) are good examples, these
topics (words) are not affected by the temporal factor,
in another word, no matter in which time slot; the cooccurrences of these topics with the main topic are
similar. The non-periodical topic normally will match
the results provided by Google Wonder Wheel,
because the later one generates the results based on
analyzing to the static and global data.
For the last two topic classes, without considering
the Freshness, it is difficult for Google Wonder Wheel
to extract the related topics. In periodical topic class,
most of them may not be extracted by global analysis
because the corresponding co-occurrence is low,
through the introducing of time factor; some temporal
topics may be selected as hot topic on certain months.
For instance, the “回归” (Reunification) could be a hot
topic on December, the “ 黄 金 周 ” (Golden Week)
should be extracted on May and October.
The burst topic differs from the periodical topic;
because the later one appears regularly, through
analyzing subset of the data, the analyzing approach is
similar to the non-periodical topic. There are plenty of
busty event detection researches in recent years, but
few of them consider the Association between two
topics because they only focus on detecting the event
globally, the association between two events (topics in
this paper) is not taken into account. For the burst topic
detection in our approach, the basic idea is simple:
check the term frequency increment within certain
period. Give an example, the “ 赌 权 开 放 ” (blind
hookey's opening) is a burst topic in 2002, although it
is already an historic topic; however it could be
detected by analyzing the WebPages which are relevant
to “澳门” (Macau) on 2002. On 2009, there would
definitely be two burst events: “行政长官选举” (Chief
Executive Election) and “ 横 琴 校 区 ” (Hengqin
Campus). These topics do not happen regularly, but
they should be extracted by analyzing the relationship
with main topic.
B. Extract topic candidates
As mentioned in previous section, the TF (term
frequency) of the word w for topic k during the time
slot t (tfkw(t)) is available after processing the
corresponding documents; thus the TFIDF like formula
which is used to measure the association between word
w and keyword k on time slot t would be defined as:
Assockw(t ) 
tfkw(t )
Nk
 log(
)
Nk (t )
tfkw
(1)
Then the derivative of the formula (1) would be
expressed as formula (2), and the main purpose for
setting it is to evaluate the emergency of a word w on:
tfkw(t  t ) tfkw(t )
Nk

)  log(
)
Nk (t  t )
Nk (t )
tfw
Assoc ' kw(t , t ) 
t
(
(2)
The formula (1) is used to evaluate the association
of a word (topic) specific to keyword k on time slot t.
one should be noted is that, normally the Document
Frequency in TFIDF is represented by the number of
documents that contains the word w, and in our work, it
is replaced by the global term frequency. Formula (2) is
used to detect the periodical topic and burst topic,
when the time granularity ( t ) is fixed, it is
convenient to measure the instantaneous “heat” of a
topic. To distinguish the periodical topic and burst
topic, actually the roles played by the temporal factor
in two kind’s relationships are not exactly the same.
For the periodical topic, the time granularity should be
set appropriately, because the heat appears regularly,
for an Annual Celebration topic, if the time granularity
is too large (e.g. one year), it may be treated as a nonperiodical topic because it is “hot” in every year, thus
the time granularity should be relatively small in this
case. For the bursty topic, no matter the size of the time
granularity, it would be detected by formula (2),
because the temperatures of the topic in different time
slots are different, it is insensitive to time granularity
size (if the size is reasonable). However, still the time
granularity is a critical issue for the Burst Topic
detection; in general, the actual duration of the Burst
topic is short even we count the topic influence
duration as part of it. For example, the topic “澳门奥运
纪 念 钞 ” (Commemorative Olympic Banknotes in
Macau) lasts about one month, and the topic “奥运圣
火 传 递 ” (Olympic Torch Relay) lasts for couple
weeks. As the result, the large time granularity is
inappropriate for the Burst topic detection because the
burst topics are swift and compact inherently.
We design the following procedure to select
different topic categories; it contains three basic steps:
1) Eliminate the insignificant and meaningless
topics; the remaining topics are selected as
popular topic candidates.
2) Separate the non-periodical topics from the
popular topic candidates, the remaining topics
are considered as summation of periodical
topics and burst topics.
3) Subdivide the result of step 2 into periodical
topics and burst ones
C. Eliminate Insignificant Topics
For any main topic k, the design framework utilizes
the formula (1) as well as the value tfkw(t)) to select
the interesting associated topic. There are several filters
are set to eliminate the meaningless topics:
 tfkw(ti )   1, ti  T

T
 tf
kw
(ti )   2, ti  T
i 1

Assokw(ti )   3, ti  T
The  1 ,  2 and  3 are three pre-defined
thresholds. The first and second requirements are
fundamental; they aim to eliminate the topics which are
relatively insignificant or some noise data. The third
requirement could guarantee that within the time lost ti,
the word w is strongly associates with Keyword k.
Through setting these three filters, the framework could
find the popular topics which appear frequently with
the main topic k. Some noisy data and out-of-date
topics will be “removed” from this step.
D. Select non-periodical topics
After eliminating the insignificant topics, the
remaining ones are the topic set which contains the
three topics categories which we are interested. In this
step, we attempt to separate the non-periodical topics
from the topics set, the main reason is that the nonperiodical topics normally have high term frequencies,
and they rarely fluctuate over the time, thus if the TF
trends tend to be a constant, because it is not affected
by the temporal factor. Intuitively, Formula (2) could
be used here to recognize the non-periodical
topics, t varies as several predefined constants (e.g.
one day, one week, one month), for each t we
calculate the mean value change of the value:
T
Ckw(t , t ) 
 | Asso'
t 1
kw
(t , t ) |
(3)
T
If Ckw(t , t )   4, t  G , we define word w as a
non-periodical candidate; otherwise, word w is
considered as periodical topics or burst topics, and
enter the further selection. The basic idea is that, the
average change value of word w is small, and it yields
to a situation that it constantly occurred with the
keyword k since the insignificant words are already
removed in the Section C. The last step for Nonperiodical topics Selection is to eliminate the common
words. The common words are the phrases which are
too general to be the hot topics; they have the similar
attributes as the non-periodical topics: (1) high term
frequency (2) insulated from time factor, but few
people are interested in them. The example of common
words for keyword “澳门” (Macau) would be “行政特
区” (Special Administrative Region), it always appear
as the form Macau SAR, no user will pay attention to
the association between Macau and Macau SAR and
even the change of it, because in most of the cases, they
describe the same concept. The common words
elimination would be one of our future researches,
concurrently; there are two ways to eliminate them:
1. By utilizing some technologies such as
Machine Learning, build a model which can
learn common words incrementally. Thus for
the data processing, the common words could
be ignored in the earlier stage of the selection
2. Analyze the user feedback in terms of log
record, select the topic which users pay more
attentions to, and weak the less popular ones.
E. Separate Periodical Topics from Burst Topics
Both Burst Topic and Periodical Topic occur
suddenly, the corresponding term frequencies change
over the time, the main difference between these two
types is the regularity. The periodical topics may be
detected in different time periods, for example, the
topic “黄金周” (Golden Week) may be detected about
every 5-6 months because of the National holiday and
International Labor Day. Follow this idea, it is possible
to detect the same topic in different time slot, then we
define this kind of topic as periodical. The burst event
has no relationship with the time factor; however its
temporal change varies significantly. The main
objective of this step is that: select the burst events and
leave the remaining as Periodical Topics. In order to
evaluate the irregularity of a word (topic) w, we use
Ckw(t , t )   5, t  G,  5   4 to separate the
select the burst event, if the value is larger than  5 , it
means word w irregularly occurs during the whole time
period, thus it leads the change value to be extremely
high. As the result, the remaining words would be the
relevant to the periodical events because they are more
regular than burst events.
V. CONCLUSION AND FUTURE WORK
In this paper, we describe a framework to detect the
popular relevant topics specific to a main topic
(keyword) on certain periods. Three different kinds of
relevant topics could be selected by our work which are
non-periodical topic, periodical topic and burst topic
respectively. By considering the power of the time, it is
possible for us to extract different relevant topics to
specific keyword on different time. Many information
systems will benefit from our framework, such as SQL
extension, query Suggestion and so on. Based on
extracted topics and temporal trend patterns, it is
possible to predict the occurrences of some popular
topics or the future duration of the current topic. For
the user, our design could provide power and
convenient functions: for example, the integration of
the Google Wonder Wheel and the Time Line: given a
keyword, the system could demonstrate the user the
relevant hot topics of certain periods.
As mentioned in previous section, the high
frequency common words elimination and temporal
pattern mining will be our future researches, based on
the user interaction, we could build a machine learning
system which can help to recognize less significant
topic. For the temporal pattern mining, it is necessary
to construct a mechanism which can seamlessly switch
among different time granularities, as the result, the
framework is more flexible to mine temporal pattern in
different sizes.
REFERENCES
[1] Mirko Böttcher, Frank Höppner and Myra Spiliopoulou, “On
exploiting the power of time in data mining”, ACM SIGKDD
Explorations Newsletter, New York, NY, USA, vol. 10, pp. 3-11,
December, 2008
[2] John F. Roddick, Myra Spiliopoulou, Daniel Lister and Aaron
Ceglar, “Higher order mining”, ACM SIGKDD Explorations
Newsletter, , New York, NY, USA, vol. 10, pp. 5-17, June, 2008
[3] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Philip S. Yu and
Hongjun Lu, “Parameter free burst events detection in text streams”,
Proceedings of the 31st international conference on Very large data
bases, Trondheim, Norway, pp 181-192, 2005
[4] Nish Parikh and Neel Sundaresan, “Scalable and near real-time
burst detection from eCommerce queries”, Proceeding of the 14th
ACM SIGKDD international conference on Knowledge discovery
and data mining, Las Vegas, Nevada, USA, pp. 972-980, 2008
[5] Ramesh M. Nallapati, Susan Ditmore, John D. Lafferty and Kin
Ung, “Multiscale topic tomography”, Proceedings of the 13th ACM
SIGKDD international conference on Knowledge discovery and data
mining, San Jose, California, USA, pp. 520-529, 2007
[6] Qi He, Kuiyu Chang and Ee-Peng Lim, “Analyzing feature
trajectories for event detection”, Proceedings of the 30th annual
international ACM SIGIR conference on Research and development
in information retrieval, Amsterdam, The Netherlands, pp. 207-214,
2007
[7] Xuanhui Wang, ChengXiang Zhai, Xiao Hu and Richard Sproat,
“Mining correlated burst topic patterns from coordinated text
streams”, Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining, San Jose,
California, USA, pp. 784-793, 2007
[8] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Huan Liu and Philip S.
Yu, “Time-dependent event hierarchy construction”, Proceedings of
the 13th ACM SIGKDD international conference on Knowledge
discovery and data mining, San Jose, California, USA, pp. 300-309,
2007
[9] Xuerui Wang and Andrew McCallum, “Topics over time: a nonMarkov continuous-time model of topical trends”, Proceedings of
the 12th ACM SIGKDD international conference on Knowledge
discovery and data mining, Philadelphia, PA, USA, 2006
[10] Google Caffeine, http://www2.sandbox.google.com/