Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
International Journal of Applied Environmental Sciences ISSN 0973-6077 Volume 11, Number 1 (2016), pp. 259-266 © Research India Publications http://www.ripublication.com A Comprehensive Survey of Approaches Used For Detecting Events in Twitter 1 Jaya Sree R.V., 2 Dhanalakshmi G., 3 Poonkuzhali S.M., 4 Sharmila Banu A. 1 Assistant Professor, Department of Information Technology, Panimalar Institute of Technology, Varadharajapuram, Poonamalle, Chennai, India. E-mail: [email protected] 2 Assistant Professor (Grade-1), Department of Information Technology, Panimalar Institute of Technology, Varadharajapuram, Poonamalle, Chennai, India. E-mail:[email protected] 3 Assistant Professor, Department of Computer Science and Engineering, Panimalar Institute of Technology, Varadharajapuram, Poonamalle, Chennai, India. E-mail: [email protected] 4 Assistant Professor, Department of Information Technology, Panimalar Institute of Technology, Varadharajapuram, Poonamalle, Chennai, India. E-mail: [email protected] Abstract Micro bloggers are web applications which acts as a broad cast medium in which peoples are allowed to share their statuses, information, links, images , videos and opinions in short messages. They offer a light weight, easy and fastest way of communication among them. Some of the prevalent micro blogging services are twitter, face book, Google + etc. The information posted on twitter is called tweets. These messages provide the information that varies from daily life time events to latest worldwide news and events. Analyzing such rich source of user generated data can yield unprecedentedly valuable information. Mining such valuable information helps to identify the events that occurred over space and time. Event detection from twitter data has many new challenges when compared to event detection from traditional media. This paper provides a survey of various techniques used for detecting events from twitter. Keywords: Micro blogging, event detection, event identification, analyzing social media, twitter data streams Introduction Social networking sites also called as micro bloggers are the fastest emerging in recent years. It has become a new type of real time communication channel. They 260 Jaya Sree R.V. et al are very popular because of its characteristics like immediacy, portability and easiness of use. A person around the world uses such media to communicate instantly about the real time happenings and to express their opinion on a common topic publicly through their computers or mobile phones [14]. This network allows each user to create their own identity and share their thoughts among the other users. This helps the people to build a social relationship and to identify the users with same interest and to view the content posted by other users. [15] Event detection has been a prominent research topic for long time [24]. Social networks or social media is widely used as a source of information for the detection of events like traffic congestion, incidents, natural disaster (earth quakes, storms, floods, fires, etc.), or other events. An event is defined as the real life happenings that occur over a period of time and space. [14] Actually, event detection in Twitter is used for various real-time notifications such as that necessary for help during a large-scale disaster emergency and updates of real time traffic. Twitter is one of the most prevalent micro blog in recent times, which produces more than 400 million tweets by more than 140 million users per day. Twitter allows the users to post messages called tweets which is no longer than 140 characters. Each users have several followers, who can re- tweet their post. Because of its huge users, the content they tweet also varies based on their interest and behavior. [11] Unlike other media, tweets provide timely and fine grained information about any kind of events. Twitter has also arisen as a fastest communication medium for collecting and broadcasting breaking news. [3][17][16] Tweets can be called as dynamic source of information. Mining or analyzing such rich source of data can provide unknown information. It has also become a powerful tool to monitor terrorist activity and to predict crime. [21] Event detection in traditional media has long been addressed in a research program called Topic Detection and Tracking (TDT) which determines the events from news articles[1][23]. But event detection from twitter data poses new challenge because tweets are unstructured, informal and meaningless and contain abbreviated words. Social media content may also contain a polluted and false data which affects the event detection algorithm. This article focuses on a survey of approaches and techniques for event detection. The techniques are categorized based on the type of event, detection method and detection tasks. Overview of Twitter As described in introduction, twitter is the fastest growing and popular micro blogging service. Twitter was developed by Biz Stone, Jack Dorsey, Evan Williams, and Noah Glass in March 2006 and launched by July 2006. It has been ranked as eighth most popular site. Twitter allows posting short and concise ideas of any kind of information, their thoughts and opinions of real time happenings in and around them. A Comprehensive Survey Of Approaches Used For Detecting Events In Twitter 261 Twitter allows the user to create their own profile, and connect with other users available in twitter. A user in twitter can follow any other users without getting their approval. No limitations are imposed by the twitter on the number of followers for a user, but one user must typically follow only up to 2000 users. A registered users can post or read tweets. But unregistered users can only read; however one can set privacy for their content to hide the content. User can read the timeline of other users to know their chronological streams of tweets. Twitter allows the users to use the symbols like @, # etc. Hashtags are used to group the topic in which any keyword or phrase is preceded by the symbol #. So these hashtags are used to collect tweets of different users in a same subject of interest. The symbol @ before a username creates a mention or reply link. The users can also repost the tweets of another users using RT (retweet). Twitter provides an API called twitter streaming API that allows users to access the public data and many services programmatically. The content of twitter is categorized as follows, Table 1: Categories of twitter data Category Pointless babblers Conversational Pass along value Self-promotion Spam News Percentage 40% 38% 9% 6% 4% 4% Figure 1: Architecture of Twitter streaming 262 Jaya Sree R.V. et al Twitter messages are reported in timely manner during disaster or in any emergency situation. Some examples are bomb blast in Mumbai in 2008, flooding of red river valley in US and Canada in 2009, US airway plane crash on the Hudson river in 2009. Thus twitter users are considered as information source or information seekers. Challenges of analyzing twitter messages: The twitter contains meaningless messages called pointless blabbers or rumors [5]. So the major challenge in event detection from twitter is to remove this polluted content from real world events. Types of Events Event has been classified as temporal event and spatio-temporal event. Temporal events are related to time series. Spatio- temporal events corresponds to events occurred over a period of time at a specific space. They have been also classified as retrospective events and new events [22, 2]. Retrospective events (RED) are historical events that occurred in past. New events (NED) occur in live streams. Event Detection Algorithms In Traditional Media: Event detection algorithm in traditional media can be classified as document pivot method and feature pivot method. a) Document pivot method By using document pivot method, the events are detected by clustering of similar documents. The similarity of the document is determined based on the semantic distance between the documents and its textual similarity. In traditional media, the event detection is done on news documents. The research project TDT focuses on this issue [1]. TDT (Topic Detection and Tracking) consists of three major tasks: Segmentation Detection Tracking TDT attempts to identify the important events from the newswire and present them to the users in a concise manner. The corpus of news documents is preprocessed and the data are represented in a specific format for further investigation. The preprocessing step includes stop word removal, stemming and tokenization technique. The preprocessed data are represented in term vector or bag of words format. Each term of the document are weighed and represented in tf-idf matrix (term frequency- inverse document frequency) [19]. Another form of data representation is named entity representation [12]. Sometimes, both representations are combined together into mixed vector [23, 12]. A Comprehensive Survey Of Approaches Used For Detecting Events In Twitter 263 b) Feature pivot method By using feature pivot method, distribution of words in the corpus is analyzed and the words are grouped together to detect events. An infinite state automaton is modeled to identify the bursty documents in a stream over a particular period of time [10]. The appearance of individual words is modeled using binomial distribution and the bursty words are detected by some threshold based heuristics. Such bursty words are grouped together to identify the bursty events [6]. The above two algorithm detects the events by analyzing words from time domain. The words can also be analyzed by frequency domain using discrete Fourier transformation (DFT) [8] and wavelet transformation [9, 7]. In Twitter: Events from twitter messages are generally detected using machine learning approaches. Either it can be of supervised or unsupervised learning. Events to be detected falls under any of the two types (NED or RED) mentioned in section 3. Most of the research emphasis on new event detection (NED), but in recent year’s interest in RED also increased like query based event retrieval [13]. The targeted events can either be specified or unspecified. Specifie d events include planned social events or known events (ex: Any disaster in particular region). In specified event detection all information like time, venue, and location are known. Unspecified events includes unplanned or unknown events (ex: tweets about late breaking news) [20]. Detection of events from twitter stream is done by various techniques like machine learning and data mining, NLP (natural language processing), text mining, information extraction and information retrieval. Unsupervised learning: Like traditional media, unsupervised learning for unspecified events in twitter is detected using clustering algorithms. This unsupervised learning of unspecified event is most appropriate for NED events than RED events. Clustering NED events produces accurate and efficient events. [20, 4, 17] Supervised Learning: NED for specified events relies on supervised learning. Many supervised learning algorithms have been proposed for event detection. Some of them are Naïve Bayes classification algorithm, SVM (support vector machines), decision tree [18] In some research, both supervised clustering and unsupervised classification is combined together for event detection. Relevant or important tweets are detected using classification before applying the clustering algorithm. Performance One of the major issues in event detection of twitter messages is performance evaluation. In traditional media, the performance metric includes precision, recall and F-measure. 264 Jaya Sree R.V. et al Precision is defined as fraction of detected events that are relevant. Precision=|{relevant events}∩{detected events}| |{detected events}| Recall is defined as fraction of relevant events that are detected. Recall=| {relevant events} ∩ {detected events}| |{relevant events}| F-measure is the harmonic mean of precision and recall. Recall measure is not suitable for noisy data sets. And it is infeasible for twitter messages. Thus precision metric in most of the existing research work Conclusion This paper focuses on event detection from micro-blogs. Event detection finds real life happenings that occurred over period of time and space. A micro blogging website provides valuable unknown knowledge from user generated content. However, tweets contain humdrum messages (meaningless, polluted, and useless). Thus event detection helps to determine relevant information of specific interest. This paper provides a brief survey of various approaches used for detecting events from twitter messages. We also presented the characteristics of twitter and challenges of analyzing the tweets. We discussed various types of events (NED, RED) and its suitable detection techniques. The method of detection events includes unsupervised and supervised learning which uses clustering algorithms and classification algorithms respectively. Finally, the performance metric of event detection algorithm is also discussed. References [1] [2] [3] Allan, J., J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. 1998. Topic detection and tracking pilot study final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, pp. 194–218. Allan, J., R. Parka, and Lavrenko V. 1998. Online new event detection and tracking. In Proceedings of the 21st Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, ACM, New York, NY, pp. 37–45. Amer-Yahia, S., S. Anjum, A. Ghenai, A. Siddique, S. Abbar, S. Madden, A. Marcus, And M.El-Haddad. 2012. Maqsa: A system for social analytics on news. In Proceedings of the 2012 ACM SIGMOD International Conference A Comprehensive Survey Of Approaches Used For Detecting Events In Twitter [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] 265 on Management of Data, SIGMOD ’12, ACM, New York, NY, pp. 653–656. Becker, H., M. Naaman, and L. Gravano. 2011a. Beyond trending topics: Real-world event identification on Twitter. In ICWSM, Barcelona, Spain. Castillo, C., M. Mendoza, and B. Poblete. 2011. Information credibility on Twitter. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, ACM, New York, NY, pp. 675–684. Fung, G. P. C., J. Xu Yu, P. S. Yu, and H. Lu. 2005. Parameter free bursty events detection in text streams. In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ’05, VLDB Endowment, pp. 181–192. Gerald Kaiser. A friendly guide to wavelets. Birkhauser Boston Inc., Cambridge, MA, USA, 1994. He, Q., K. Chang, and E.-P. Lim. 2007. Analyzing feature trajectories for event detection. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’07, ACM, New York, NY, pp. 207–214. Ingrid Daubechies. Ten lectures on wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. Kleinberg, J. 2002. Bursty and hierarchical structure in streams. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, ACM, New York, NY, pp. 91–101. Krishnamurthy, B., P. Gill, and M. Arlitt. 2008. A few chirps about Twitter. In Proceedings of the First Workshop on Online Social Networks, WOSN ’08, ACM, New York, NY, pp. 19–24. Kumaran, G., and J. Allan. 2004. Text classification and named entities for new event detection. In Research and Development in Information Retrieval, Sheffield, UK, pp. 297–304. Metzler, D., C. Cai, and E. H. Hovy. 2012. Structured event retrieval over micro blog archives. In HLTNAACL, pp. 646–655. Milstein .S, Chowdhury .A, Hochmuth. G, Lorica. B, and Magoulas. R. Twitter and the micro-messaging revolution: Communication, connections, and immediacy.140 characters at a time. O’Reilly Media, 2008. Mislove. A, Marcon.M, Gummadi K. P., Druschel.P and Bhattacharjee. B, “Measurement and analysis of online social networks,” in Proc. 7th ACM SIGCOMM Conf. Internet Meas., San Diego, CA, USA, 2007, pp. 29–42. Pak, A., and P. Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), European Language Resources Association (ELRA), Valletta, Malta. Phuvipadawat, S., and T. Murata. 2010. Breaking news detection and tracking in Twitter. In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol. 3, Toronto, ON, pp. 120–123. 266 Jaya Sree R.V. et al [18] Popescu, A. M., M. Pennacchiotti, and D. Paranjpe. 2011. Extracting events and event descriptions from Twitter. In Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11, pp. 105–106. [19] Salton, G. 1988. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley: Boston. [20] Sankaranarayanan, J., H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling. 2009. TwitterStand: News in tweets. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, GIS ’09, ACM, New York, NY, pp. 42–51. [21] Wang, X., M. S. Gerber, and D. E. Brown. 2012. Automatic crime prediction using events extracted from Twitter posts. In Proceedings of the 5th International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction, SBP’12. Springer Verlag: Berlin, Heidelberg, pp. 231–238. [22] Yang, Y., T. Pierce, and J. Carbonell. 1998. A study of retrospective and online event detection. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’98, ACM, New York, NY, pp. 28–36. [23] Yang, Y., J. G. Carbonell, R. D. Brown, T. Pierce, B. T. Archibald, and X. Liu. 1999. Learning approaches for detecting and tracking news events. IEEE Intelligent Systems, 14(4): 32–43. [24] Yiming Yang, Tom Pierce, and Jaime Carbonell. A study of retrospective and online event detection. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 28–36, New York, NY, USA, 1998.ACM.