Download 7630_0_report 20100601 draft

A Framework: Event Extraction from the Temporal Web Yiyang Yang1 and Zhiguo Gong2 Faculty of Science and Technology University of Macau Macau [email protected], [email protected], Abstract—Temporal-based mining is an attractive direction which is newly generated from the Data Mining field. By taking the time factor into account, some knowledge and interesting information, such as burst events and topic durations, can be mined out from data collections which are coordinated according to their duration (timestamp). Given the huge web as a temporal data collection, in this paper, we introduce a framework based on our current work. The main task is to find the association between two topics in different time slots (durations). Given a keyword as the main topic, we expect to find three kinds of topics which are relevant to the main topic: periodical topic, non-periodical topic and burst topic. These three types of topics can satisfy the needs of users with different requirements. I. INTRODUCTION The value of the knowledge is acknowledged by many researchers. Most of them are hidden in the mass of data, how to discover them is a critical issue. Traditionally, the data pattern detection is performed based on the static data. Moreover, the data are grouped according to their contents, thus different technologies are invited, such as documents clustering and classification, summarization. In other words, from the researchers’ point of views, such information, documents are static, some significant information stored in subset of the data are normally ignored. In recent years, temporal-based mining becomes a hot topic, there are many publications focusing on this direction, and we will introduce them in the later section. In industry, time-based search is also an attractive topic. For instance, Google provides a service named Time Line, which offers users the temporal information about given keyword; user can view the “heat” of a topic in different time periods. Together with the Time Line function, the Wonder Wheel function will also be introduced in the later section. These two functions are critical to our method, because our approach absorbs their ideas as well as advances, and attempts to integrate them into a more interesting task. The structure of this paper is that: in Section II, we will introduce the related work. In Section III, we will give a brief introduction to our current work, the problem and its solution. For the Section IV, we design a framework which extends from our current work, the main task for the paper is to mine the hot topics specific to a keyword. Section V is the summary of this paper. II. RELATED WORKS More and more researchers pay attentions on the data mining with taking time factor into account, generally speaking, data are coordinated by time, instead of analyzing the data globally, the research focus on the information hidden within certain time slot/period/transaction. One note is that most of the recent researches concentrate on the English Event detection, few of them involves the Chinese-based environment. In the [1], M. Böttcher at el. proposed the new generation of the data mining: Change Mining, the data evolves in terms of changes, when they occur, it is necessary to detect these changes. M. Böttcher et al. describe a four steps procedure to build a Change Mining method including goal specification, time and change modeling, and detecting mechanism. The authors attempt to integrate the advances of incremental mining, temporal mining and stream mining into the Change Mining. J.F. Roddick at el. in [2] also mentioned the detecting changes in pattern. They pointed out that the change mining is one type of the high order mining, more and more technologies such as trends analysis, classification and clustering are integrated in order to provide user better services, and the knowledge discovery in temporal perspective would also benefit the integration. In [3], G.P.C Fung et al. studied a new problem: hot burst event detection, and the source data are constrict to English text streams. In this publication, they found that the Documents clustering have several problems such as inaccurate in event modeling, incapable in handling burst occurrence. The authors believed that representing the event by several features would enhance the bursts of the events. The English text stream is coordinated over the time dimension, then the burst features are identified, through grouping the selected features, the events in certain hot periods are finally extracted. N. Parikh and N. Sundaresan in [4] introduced an approach to detect burst from the near real-time ecommerce Queries (eBay). The Burst Extraction is incremental, and the Wavelet transformation is used to preserve the amplitude, time and frequency information for non-stationary signals. In the algorithm, the query are treated the same as the text stream which takes the time variable as one of the main identifies. R.M. Nallapati et al. in [5] modeled the topics with time by a new designed method. The most important feature of this method is that it could handle the topics evaluation at multiple time scales, thus the time granularity is not fixed to one or several constants. The whole period could be represented by a binary tree, as the node goes deeper, the time granularity becomes smaller, the root of the tree is the whole periods, and its two children just cut its duration into half, and so on. Through this representation, the time variable is extremely scalable. In [7], the authors also studied on the features detection of the Event, two variables are used to describe the event features: periodic and in-periodic, frequent and in-infrequent. Thus totally four different classes of features are defined. In this paper, the DFIDF is used to measure the feature frequency, and moreover, the discrete Fourier transformation is applied to decompose the feature trends so that the original time series could be represented as linear combination of complex sinusoids. C.P.C. Fung at el. in [8] designed a temporal-based hierarchical event detection method, they claimed that the time is the main dimension of the burst event detection; the corresponding features should be extracted based on the burst time, and then use the related documents which are highly relevant to the busty features, to form the event hierarchy. When the features are identified, each feature may satisfy a query and be related to a group of documents, thus for the documents groups, there may be some overlaps. The authors evaluate the similarity between documents groups and use them to represent the relation between corresponding features. With this relationship among features, it is easy to construct the event hierarchy. X. Wang and A. McCallum in [9] proposed a model of Topics over Time (TOT); it extends Latent Dirichlet Allocation (LDA) model. This model is utilized to handle the attribute of the time: continuous. By introducing the LDA, the model avoids discretization by associating with each topic a continuous distribution over time. Finally, the performance of the TOT is better than the LDA. To summarize, the event extraction approaches introduced above, have a main drawback: following the time sequence, the events (knowledge) are extracted statically, thus it is difficult to find some events for a specific topic. For the users who might be interested in some local relevant news or events, in more details, in case they want to retrieve the historical events that are restricted to certain topic, the mentioned methods will only provide the information which is current and globally extracted. It is also one of our contributions in this paper. For a given topic, we aim to find the events strongly associated, and arrange them through their significant attribute: occurring time. III. OUR APPROACH The forms of the WebPages available on the internet are not regular; they are not well organized in terms of both document content and structure. Give an example, our WebPages Crawler downloaded a mass of illegal WebPages which do not contain the published date information (they just set the related attribute to be 0); it is even difficult for the documents publisher to tell the actual information. One possible solution is that we detect the newly generated WebPages periodically and record the corresponding temporal information, however it is costly to build such a system, normally only the professional companies (e.g. search engine service provider) can afford it. It leads to another solution: Through setting the parameters of the Query and utilizing the API (Almost every Search Engine supports it), we crawl the links which are returned by the Search Engines as results. Unfortunately, there are too many limitations for the public API; we take Google Search Engine API as an example. It has several restrictions:  The number of returned results is limited to 30  Only allow the user to set the temporal restriction like ‘Since Date 12/01/2009’, and moreover, it always returned the most recent WebPages because the corresponding scores are higher  The temporal information of the returned results are also encrypted, there is no open standard for user to access. A. Google Search Engine API. On 12th Aug. 2009, Google release its next generation of the Search Engine for open testing (Beta) [10]. Besides the increased speed, accuracy and efficiency, it also brings an updated feature: temporal relevance, which provides user the temporal information about the Google returned WebPages. The new version of Google Search provides us the opportunity to crawl the time coordinated WebPages.  Specify some interested topics (keywords)  Choose the appropriate time period and granularity (e.g. one year and one day respectively)  For each time granularity in the chosen period, form the corresponding query and “ask” the Google API  For the returned WebPages, fixed its temporal information according to the chosen time granularity and discard its relevant attributes such Publish Date, Last Modified Date and so on. Because the values of the relevant attributes normally are incorrect, especially for the WebPages in the small web site. B. Google Option: The Google Option adds two new features on 13th May 2009: Wonder Wheel (神奇罗盘) and Time Line (时光隧道). The followings are the output of the keyword “澳门” (Macau) in Wonder Wheel: data such as Query log, Google found that the users who are interested in the former keyword also show interesting on the later one, vice versa, then Google defines these two keywords are associated. According to this new function, it is easy to find that how to mine the relation of two topics (keywords) is one of hottest topics in the future and the trend of the up-to-date researches also approves it. Figure 2 the Time Line View function with keyword “Macau” in on Month level This function provides the global point of view for the topic “heat”; actually it is simply represented by the query frequency. Our work could be viewed as the enhanced version which integrates the Wonder Wheel and Time Line. The output is not produced simply by combining results of two functions. In the Wonder Wheel, normally the result is the super-phrase of the keyword; more strictly, which embeds the keyword as prefix; thus for some in-contextual results, fewer are selected. For the Time Line, it only considers the frequency of the corresponding input, the information about cooccurrence between keyword and results, is simply ignored. IV. APPROACH AND FRAMEWORK Our approach aims to find the temporal-based association between two different topics. In current progress, our research focuses on the co-occurrence of two topics, and we also design a framework which satisfies our requirements. The notations used in this paper are listed as TABLE I. Symbol K T D Vk tfkw(t ) Figure 1 the Wonder Wheel function with two keywords “Macau” and “Macau Casinos” From Figure 1, we can see that the Google Wonder Wheel is a graphic function which roughly demonstrates the relationship between two keywords, as “ 澳门 ” (Macau) and “ 澳门赌场 ” (Macau Casinos) in our example. From another point of view, these two keywords are relevant because they frequently appear together. Based on static analysis on Nk (t ) Nk tfkw G TABLE I. NOTATIONS IN THIS PAPER Description Topic Set Time Slots Set Documents Set Words Set (Vocabulary) for topic k Term frequency of the word w for topic k during the time slot t Number of documents for topic k during the time slot t Number of documents for topic k during the whole time period Term frequency of the word w for topic k during the whole time period The pre-defined time granularity set Our framework operates in the following procedure: 1) For each topic k in K and each time slot t in T, the framework forms a Query, in human language, its meaning likes “I want to get a link list including all the WebPages which are related to topic k and published on time t” 2) According to results returned by Search Engine, the framework crawls all the WebPages, and organizes them in time dimension. The alternative expression is that , for each topic k and time slots t, the framework crawls the corresponding Documents Set Nk(t) 3) For each element in Nk(t), the framework performs the Phrase Extraction, in order to extract the most valuable phrases; and the output is Vk as well as the tfkw(t) 4) For each word w in Vk, the tf trend curve could be drawn over the time A. Motivation Case Suppose we are interested in the topic “ 澳门 ” (Macau), we give this keyword to the Google Search Engine and crawl all the returned WebPages, and these documents are coordinated by time. Through analyzing those WebPages, we expect to find different words (topics) which are related to our interested topic (main topic). Based on our experimental analysis, these topics could be divided into three classes: 1. non-periodical topic 2. periodical topic 3. burst topic For the main topic “澳门” (Macau), it is convenient to find the represented cases for these three topic classes. In non-periodical class, the “ 赌博 ” (Gambling), “赌场” (Casino) are good examples, these topics (words) are not affected by the temporal factor, in another word, no matter in which time slot; the cooccurrences of these topics with the main topic are similar. The non-periodical topic normally will match the results provided by Google Wonder Wheel, because the later one generates the results based on analyzing to the static and global data. For the last two topic classes, without considering the Freshness, it is difficult for Google Wonder Wheel to extract the related topics. In periodical topic class, most of them may not be extracted by global analysis because the corresponding co-occurrence is low, through the introducing of time factor; some temporal topics may be selected as hot topic on certain months. For instance, the “回归” (Reunification) could be a hot topic on December, the “ 黄金周 ” (Golden Week) should be extracted on May and October. The burst topic differs from the periodical topic; because the later one appears regularly, through analyzing subset of the data, the analyzing approach is similar to the non-periodical topic. There are plenty of busty event detection researches in recent years, but few of them consider the Association between two topics because they only focus on detecting the event globally, the association between two events (topics in this paper) is not taken into account. For the burst topic detection in our approach, the basic idea is simple: check the term frequency increment within certain period. Give an example, the “ 赌权开放 ” (blind hookey's opening) is a burst topic in 2002, although it is already an historic topic; however it could be detected by analyzing the WebPages which are relevant to “澳门” (Macau) on 2002. On 2009, there would definitely be two burst events: “行政长官选举” (Chief Executive Election) and “ 横琴校区 ” (Hengqin Campus). These topics do not happen regularly, but they should be extracted by analyzing the relationship with main topic. B. Extract topic candidates As mentioned in previous section, the TF (term frequency) of the word w for topic k during the time slot t (tfkw(t)) is available after processing the corresponding documents; thus the TFIDF like formula which is used to measure the association between word w and keyword k on time slot t would be defined as: Assockw(t )  tfkw(t ) Nk  log( ) Nk (t ) tfkw (1) Then the derivative of the formula (1) would be expressed as formula (2), and the main purpose for setting it is to evaluate the emergency of a word w on: tfkw(t  t ) tfkw(t ) Nk  )  log( ) Nk (t  t ) Nk (t ) tfw Assoc ' kw(t , t )  t ( (2) The formula (1) is used to evaluate the association of a word (topic) specific to keyword k on time slot t. one should be noted is that, normally the Document Frequency in TFIDF is represented by the number of documents that contains the word w, and in our work, it is replaced by the global term frequency. Formula (2) is used to detect the periodical topic and burst topic, when the time granularity ( t ) is fixed, it is convenient to measure the instantaneous “heat” of a topic. To distinguish the periodical topic and burst topic, actually the roles played by the temporal factor in two kind’s relationships are not exactly the same. For the periodical topic, the time granularity should be set appropriately, because the heat appears regularly, for an Annual Celebration topic, if the time granularity is too large (e.g. one year), it may be treated as a nonperiodical topic because it is “hot” in every year, thus the time granularity should be relatively small in this case. For the bursty topic, no matter the size of the time granularity, it would be detected by formula (2), because the temperatures of the topic in different time slots are different, it is insensitive to time granularity size (if the size is reasonable). However, still the time granularity is a critical issue for the Burst Topic detection; in general, the actual duration of the Burst topic is short even we count the topic influence duration as part of it. For example, the topic “澳门奥运纪念钞 ” (Commemorative Olympic Banknotes in Macau) lasts about one month, and the topic “奥运圣火传递 ” (Olympic Torch Relay) lasts for couple weeks. As the result, the large time granularity is inappropriate for the Burst topic detection because the burst topics are swift and compact inherently. We design the following procedure to select different topic categories; it contains three basic steps: 1) Eliminate the insignificant and meaningless topics; the remaining topics are selected as popular topic candidates. 2) Separate the non-periodical topics from the popular topic candidates, the remaining topics are considered as summation of periodical topics and burst topics. 3) Subdivide the result of step 2 into periodical topics and burst ones C. Eliminate Insignificant Topics For any main topic k, the design framework utilizes the formula (1) as well as the value tfkw(t)) to select the interesting associated topic. There are several filters are set to eliminate the meaningless topics:  tfkw(ti )   1, ti  T  T  tf kw (ti )   2, ti  T i 1  Assokw(ti )   3, ti  T The  1 ,  2 and  3 are three pre-defined thresholds. The first and second requirements are fundamental; they aim to eliminate the topics which are relatively insignificant or some noise data. The third requirement could guarantee that within the time lost ti, the word w is strongly associates with Keyword k. Through setting these three filters, the framework could find the popular topics which appear frequently with the main topic k. Some noisy data and out-of-date topics will be “removed” from this step. D. Select non-periodical topics After eliminating the insignificant topics, the remaining ones are the topic set which contains the three topics categories which we are interested. In this step, we attempt to separate the non-periodical topics from the topics set, the main reason is that the nonperiodical topics normally have high term frequencies, and they rarely fluctuate over the time, thus if the TF trends tend to be a constant, because it is not affected by the temporal factor. Intuitively, Formula (2) could be used here to recognize the non-periodical topics, t varies as several predefined constants (e.g. one day, one week, one month), for each t we calculate the mean value change of the value: T Ckw(t , t )   | Asso' t 1 kw (t , t ) | (3) T If Ckw(t , t )   4, t  G , we define word w as a non-periodical candidate; otherwise, word w is considered as periodical topics or burst topics, and enter the further selection. The basic idea is that, the average change value of word w is small, and it yields to a situation that it constantly occurred with the keyword k since the insignificant words are already removed in the Section C. The last step for Nonperiodical topics Selection is to eliminate the common words. The common words are the phrases which are too general to be the hot topics; they have the similar attributes as the non-periodical topics: (1) high term frequency (2) insulated from time factor, but few people are interested in them. The example of common words for keyword “澳门” (Macau) would be “行政特区” (Special Administrative Region), it always appear as the form Macau SAR, no user will pay attention to the association between Macau and Macau SAR and even the change of it, because in most of the cases, they describe the same concept. The common words elimination would be one of our future researches, concurrently; there are two ways to eliminate them: 1. By utilizing some technologies such as Machine Learning, build a model which can learn common words incrementally. Thus for the data processing, the common words could be ignored in the earlier stage of the selection 2. Analyze the user feedback in terms of log record, select the topic which users pay more attentions to, and weak the less popular ones. E. Separate Periodical Topics from Burst Topics Both Burst Topic and Periodical Topic occur suddenly, the corresponding term frequencies change over the time, the main difference between these two types is the regularity. The periodical topics may be detected in different time periods, for example, the topic “黄金周” (Golden Week) may be detected about every 5-6 months because of the National holiday and International Labor Day. Follow this idea, it is possible to detect the same topic in different time slot, then we define this kind of topic as periodical. The burst event has no relationship with the time factor; however its temporal change varies significantly. The main objective of this step is that: select the burst events and leave the remaining as Periodical Topics. In order to evaluate the irregularity of a word (topic) w, we use Ckw(t , t )   5, t  G,  5   4 to separate the select the burst event, if the value is larger than  5 , it means word w irregularly occurs during the whole time period, thus it leads the change value to be extremely high. As the result, the remaining words would be the relevant to the periodical events because they are more regular than burst events. V. CONCLUSION AND FUTURE WORK In this paper, we describe a framework to detect the popular relevant topics specific to a main topic (keyword) on certain periods. Three different kinds of relevant topics could be selected by our work which are non-periodical topic, periodical topic and burst topic respectively. By considering the power of the time, it is possible for us to extract different relevant topics to specific keyword on different time. Many information systems will benefit from our framework, such as SQL extension, query Suggestion and so on. Based on extracted topics and temporal trend patterns, it is possible to predict the occurrences of some popular topics or the future duration of the current topic. For the user, our design could provide power and convenient functions: for example, the integration of the Google Wonder Wheel and the Time Line: given a keyword, the system could demonstrate the user the relevant hot topics of certain periods. As mentioned in previous section, the high frequency common words elimination and temporal pattern mining will be our future researches, based on the user interaction, we could build a machine learning system which can help to recognize less significant topic. For the temporal pattern mining, it is necessary to construct a mechanism which can seamlessly switch among different time granularities, as the result, the framework is more flexible to mine temporal pattern in different sizes. REFERENCES [1] Mirko Böttcher, Frank Höppner and Myra Spiliopoulou, “On exploiting the power of time in data mining”, ACM SIGKDD Explorations Newsletter, New York, NY, USA, vol. 10, pp. 3-11, December, 2008 [2] John F. Roddick, Myra Spiliopoulou, Daniel Lister and Aaron Ceglar, “Higher order mining”, ACM SIGKDD Explorations Newsletter, , New York, NY, USA, vol. 10, pp. 5-17, June, 2008 [3] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Philip S. Yu and Hongjun Lu, “Parameter free burst events detection in text streams”, Proceedings of the 31st international conference on Very large data bases, Trondheim, Norway, pp 181-192, 2005 [4] Nish Parikh and Neel Sundaresan, “Scalable and near real-time burst detection from eCommerce queries”, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Las Vegas, Nevada, USA, pp. 972-980, 2008 [5] Ramesh M. Nallapati, Susan Ditmore, John D. Lafferty and Kin Ung, “Multiscale topic tomography”, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, San Jose, California, USA, pp. 520-529, 2007 [6] Qi He, Kuiyu Chang and Ee-Peng Lim, “Analyzing feature trajectories for event detection”, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, The Netherlands, pp. 207-214, 2007 [7] Xuanhui Wang, ChengXiang Zhai, Xiao Hu and Richard Sproat, “Mining correlated burst topic patterns from coordinated text streams”, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, San Jose, California, USA, pp. 784-793, 2007 [8] Gabriel Pui Cheong Fung, Jeffrey Xu Yu, Huan Liu and Philip S. Yu, “Time-dependent event hierarchy construction”, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, San Jose, California, USA, pp. 300-309, 2007 [9] Xuerui Wang and Andrew McCallum, “Topics over time: a nonMarkov continuous-time model of topical trends”, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, 2006 [10] Google Caffeine, http://www2.sandbox.google.com/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 7630_0_report 20100601 draft