Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics (CSI-KDD) June, 28, 2009, Paris, France held in conjunction with SIGKDD’09 Workshop Organizers Hsinchun Chen Marc Dacier Marie-Francine Moens Gerhard Paass Christopher C. Yang The Association for Computing Machinery, Inc. 2 Penn Plaza, Suite 701 New York, NY 10121-0701 Copyright © 2009 by the Association for Computing Machinery, Inc (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permission to republish from: Publications Dept. ACM, Inc. Fax +1-212-869-0481 or E-mail [email protected]. For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Notice to Past Authors of ACM-Published Articles ACM intends to create a complete electronic archive of all articles and/or other material previously published by ACM. If you have written a work that was previously published by ACM in any journal or conference proceedings prior to 1978, or any SIG Newsletter at any time, and you do NOT want this work to appear in the ACM Digital Library, please inform [email protected], stating the title of the work, the author(s), and where and when published. ACM ISBN: 978-1-60558-669-4 Additional copies may be ordered prepaid from: ACM Order Department P.O. BOX 11405 Church Street Station New York, NY 10286-1405 Phone: 1-800-342-6626 (U.S.A. and Canada) +1-212-626-0500 (All other countries) Fax: +1-212-944-1318 E-mail: [email protected] ACM Order Number: Printed in the U.S.A. ii Preface Computer supported communication and infrastructure are integral parts of modern economy. Their security is of incredible importance to a wide variety of practical domains ranging from Internet service providers to the banking industry and e-commerce, from corporate networks to the intelligence community. The CSI-KDD workshop focuses on novel knowledge discovery methods addressing CyberSecurity and intelligence issues as well as innovative applications demonstrating the effectiveness of data mining in solving real-world security problems. The challenge for novel methods originates from the emergence of new types of contents and protocols, and only an integrated view on all modes promises optimal results. Innovative applications are essential as IT-communication as well as computer-supported technical and social infrastructure have an extremely complex structure and require a comprehensive approach to prevent criminal activities. As an invited speaker we welcome André Bergholz, Fraunhofer IAIS. He will report on the lessons learnt on phishing filtering in the specific targeted research project AntiPhish funded by the European Union. He reports on filter methodologies evaluated in a test laboratory setting, and describes the application of this technology to real world email streams, to be used to filter all email traffic online in real time. In the afternoon there will be an invited talk entitled “Data Security and Integrity: Developments and Directions” given by Bhavani Thuraisingham. The talk is concentrated ensuring that only authorized individuals have address to data and data is protected from malicious corruption. The workshop is organized in two tracks. In the first track "Novel Knowledge Discovery Methods for the Security Domain" is targeted to advanced data mining approaches for CyberSecurity. The second track "Innovative Techniques and Applications in Intelligence Informatics" concentrates on large-scale security applications. Despite the fact that various types of security mechanisms have been defined and are widely deployed to prevent malicious users from launching attacks, cyber criminals remain pretty successful in misusing useful protocols and applications for their own benefits. Spam and phishing campaigns are routinely carried out. Drive by download attacks are among the major threats facing normal users surfing the web. BGP hijacks corrupting internet routing tables are becoming known to the public as well. Etc. The first track considers data mining approaches, which can help addressing security issues according to, at least, three distinct axes. First of all, thanks the very large amount of application logs of various kinds available, it can be a valid approach to better understand the attacks we are facing and to help designing better preventive and, or, detection mechanisms in order to respond to these attacks. On the other hand, by analyzing traffic and other attack related traces data mining can be helpful in getting a better picture of who is attacking us tackling the ”attack iii attribution problem” under the umbrella of e-forensics techniques. Finally data mining can be employed to detect attacks by classifying content, e.g. by filtering spam and phishing messages. Still a number of research issues remain. Different types of content have to be analyzed (e.g. email, websites, embedded images, transmitted code, activity logs), and only an integrated view on all modes promises optimal results. This, for instance, involves mining the different media associated with an email and combining the results to improve accuracy. Spammers, hackers and producers of fraudulent content continuously change their tactics, requiring adaptive and even anticipatory mining techniques. As intelligent opponents aim at thwarting analyses, specific approaches for analysis are required. In addition, many new forms of messaging (e.g., SMS, MMS), often anchored in a mobile environment, become a victim of malicious manipulations. For the first track we accepted four very interesting oral presentations covering topics such as intrusion and malware detection, attack attribution and spam filtering. The training time of intrusion detection models is often computationally expensive, hence the interest in efficient models while still assuring a high predictive accuracy of the intrusion detection. Chen Yo-Shu and Chen Yi-Ming present this view in the paper "Combining Incremental Hidden Markov Model and Adaboost Algorithm for Anomaly Intrusion Detection". In the paper entitled “Addressing the Attack Attribution Problem using Knowledge Discovery and Multi-criteria Fuzzy Decision-Making", O. Thonnard, W. Mees and M. Dacier propose an analysis framework to reason about the root causes of attacks observed on the Internet. They apply it to some large real world datasets and derive interesting insights on what they call armies of zombies. "A Data Mining Framework for Malware Detection Using Statistical Analysis of Byte-Level File Content" by S. Momina Tabish, M. Zubair Shafiq and Muddassar Farooq discusses non-signature based malware detection. The approach is successful and assumes that benign files are quite distinct from malware files considering a byte-level format. A very novel and interesting approach inspired by gaming theories is presented and applied on phishing mail filtering in an adversarial setting by Gaston L’Huillier, Richard Weber and Nicolas Figueroa (Online Phishing Classification Using Adversarial Data Mining and Signaling Games). The second track is called "Innovative Techniques and Applications in Intelligence Informatics". Intelligence Informatics is concerned with the study of the development and use of advanced information technologies and systems for national, international, and societal security-related applications. The annual IEEE International Conference series on Intelligence and Security Informatics with over two hundred attendants was started in 2003. In addition, the Pacific Asia Workshop on Intelligence and Security Informatics with over eighty attendees has been started in 2006. These intelligence and security informatics events have brought together academic researchers, law enforcement and intelligence experts, information technology consultants and iv practitioners to discuss their research and practice related to various intelligence and security informatics topics. Among these research topics in intelligence security, there is a strong focus on data mining and knowledge discovery. It is the first attempt to introduce intelligence informatics to the ACM SIGKDD community. The four major topics of intelligence security include (a) information sharing and data/text/web mining, (b) infrastructure protection and emergency responses, (c) terrorism informatics, and (d) enterprise risk management and information system security. In information sharing and data/text/Web mining, we focus on criminal data mining, criminal/intelligence information sharing and visualization, cyber crime detection and analysis, authorship analysis, deception detection and analysis, and information sharing governance. We investigate how to use advanced data sharing and mining techniques to support law enforcement and intelligent experts in their investigations so that effective results can be achieved efficiently. In infrastructure protection and emergency responses, we explore several interesting infrastructure problems such as bioterrorism information infrastructure, transportation and communication infrastructure protection, cyber-infrastructure design and protection, border safety, disaster prevention, detection and management, and emergency response and management. As we can see in recent natural disasters and terror attacks, a good infrastructure protection and emergency response management will minimize damages and recover from devastation in a shorter amount of time. In terrorism informatics, we investigate several terrorism related informatics problems. For instances, we investigate terrorism related analytical methodologies and software tools, terrorism knowledge portals and databases, terrorist incident chronology databases, terrorism root cause analysis, social network analysis, forecasting and countering terrorism and measuring the effectiveness of counter-terrorism campaigns. In the recent years, we have also included enterprise risk management and information systems security, in which we examine information security management standards, information systems security policies, fraud detection, board activism and influence, corporate sentiment surveillance, market influence analytics and medial intelligence, and consumer-generated media and social media analytics. The program committee has selected five papers in intelligence informatics for presentation. Park and Treglia developed a model and theory of intelligence information sharing through a literature review, experience and interviews with practitioners. Yang and Tang proposed a subgraph generalization approach to share and integrate terrorist or criminal social network data between different intelligence and law enforcement units and preserve the privacy of individuals in social networks. Such social network sharing and integration technique improves the performance of social network analysis such as centrality measurements. Bhavani et al. investigated the information management component for military stabilization and reconstruction operations. The temporal service oriented architecture system (TG-SOA), which utilized the v temporal geosocial semantic web to manage the lifecycle of stabilization and reconstruction operations, was developed. Senator examined the common criticisms on data mining applications for security. These criticisms argued that the data mining applications were ineffective and threatening civil liberties. He analyzed these criticisms by modeling a phenomena and proposing alternative designs. Kwok et al. studied the security problems in public companies’ web servers against cyber attacks. The study included ten Hong Kong Hang Send Index companies and ten Hong Kong China Enterprises Index companies. A pyramid risk analysis tool was also proposed. This workshop would not be possible without the invited speakers, the authors and the 37 members of the program committee. We express our gratitude towards them. We would thank also Fraunhofer IAIS, who was in care of the CSI-KDD 2009 website as well of the submission system. Hsinchun Chen Marc Dacier Marie-Francine Moens Gerhard Paass Christopher C. Yang vi CSI-KDD 2009 Organizers and Program Committee Organizers Hsinchun Chen Marc Dacier Marie-Francine Moens Gerhard Paass Christopher C. Yang The University of Arizona, Tucson, USA Symantec Research Labs Europe, France K.U. Leuven, Belgium Fraunhofer IAIS, St. Augustin, Germany (contact) Drexel University, Philadelphia, USA Program Committee Adedeji B. Badiru Yigal Arens John Aycock Antonio Badia Andre Bergholz Ulf Brefeld Patrick S. Chen Robert W.P. Chang Domenico Dato Yuval Elovici Uwe Glaesser Nazli Goharian Mark Goldberg Henrik Grosskreutz David Hicks Thorsten Holz Patrick Horkan Sotiris Ioannidis Latifur Khan Engin Kirda Christopher Kruegel Sheau-Dong Lang Ee-Peng Lim Evangelos Markatos Robert Moskovitch William Pottenger Raghav Rao Elliot Rich Stefan Rüping Bracha Shapira David Skillicorn Randy Smith Paul Thompson Cedric Ulmer Nalini Venkatasubramanian Zhao Xu Urko Zurutuza Air Force Institute of Technology, Dayton, OH, USA. USC/ISI, USA University of Calgary, Canada University of Louisville, USA Fraunhofer IAIS, Germany MPI Saarbrücken, Germany Tatung University, Taiwan Criminal Investigation Bureau, Taiwan Tiscali Services, Italy Ben-Gurion University, Israel Simon Fraser University, Canada Illinois Institute of Technology, USA RPI, USA Fraunhofer IAIS, Germany Aalborg University Esbjerg, Denmark University of Mannheim, Germany Symantec, Ireland Institute of Computer Science, Greece University of Texas at Dallas, USA Eurecom, France UC Santa Barbara, USA University of Central Florida, USA Singapore management University, Singapore Institute of Computer Science, Greece Ben-Gurion University, Israel Rutgers University, USA State University of New York at Buffalo, USA University at Albany, SUNY, USA Fraunhofer IAIS, Germany Ben-Gurion University, Israel Queen's University, Canada University of Alabama, USA Dartmouth College, USA SAP Research, France University of California, Irvine, USA Fraunhofer IAIS, Germany Mondragon University, Spain vii Track 2 Track 1 Track 1 Track 1 Track 2 Table of Contents Preface.......................................................................................................................................... iii Workshop Organizers and Program Committee ......................................................................... vii Workshop Program ...................................................................................................................... ix Invited Talk AntiPhish – Lessons Learnt André Bergholz .................................................................................................................. 1 Track1: Novel Knowledge Discovery Methods for the Security Domain Combining Incremental Hidden Markov Model and Adaboost Algorithm for Anomaly Intrusion Detection Chen Yo-Shu, Chen Yi-Ming ............................................................................................. 3 Addressing the Attack Attribution Problem using Knowledge Discovery and Multi-criteria Fuzzy Decision-Making O. Thonnard, W. Mees and M. Dacier ................................................................................ 11 A Data Mining Framework for Malware Detection Using Statistical Analysis of Byte-Level File Content S. Momina Tabish, M. Zubair Shafiq and Muddassar Farooq............................................ 23 Online Phishing Classification Using Adversarial Data Mining and Signaling Games Gaston L’Huillier, Richard Weber and Nicolas Figueroa................................................... 33 Invited Talk Data Security and Integrity: Developments and Directions Bhavani Thuraisingham ..................................................................................................... 43 Track 2: Innovative Techniques and Applications in Intelligence Informatics Towards Trusted Intelligence Information Sharing, Joon Park and Joseph Treglia.............................................................................................. 45 Social Networks Integration and Privacy Preservation using Subgraph Generalization, Christopher Yang and Xuning Tang ................................................................................... 53 Novel Knowledge Discovery Methods for the Security Domain, Bhavani Thuraisingham, Latifur Khan and Murat Kantarcioglu ........................................ 63 On the Efficacy of Data Mining for Security Applications Ted Senator ......................................................................................................................... 75 A Study of Online Service and Information Exposure of Public Compani Sai Ho Kwok, Cheuk Tung Lai and Jason Yeung............................................................... 85 viii Workshop Program 09:00 -10:00 Invited Talk ! 09:00 -10:00: AntiPhish – Lessons Learnt André Bergholz 10:00-10:30 Coffee Break 10:30-12:30 Track1: Novel Knowledge Discovery Methods for the Security Domain ! 10:30-11:00 ! 11:00-11:30 ! 11:30 – 12:00 ! 12:00 – 12:30 Combining Incremental Hidden Markov Model and Adaboost Algorithm for Anomaly Intrusion Detection, Chen Yo-Shu and Chen Yi-Ming Addressing the Attack Attribution Problem using Knowledge Discovery and Multi-criteria Fuzzy Decision-Making, O. Thonnard, W. Mees and M. Dacier A Data Mining Framework for Malware Detection Using Statistical Analysis of Byte-Level File Content S. Momina Tabish, M. Zubair Shafiq and Muddassar Farooq Online Phishing Classification Using Adversarial Data Mining and Signaling Games Gaston L’Huillier, Richard Weber and Nicolas Figueroa 12:30-14:00 Lunch 14:00-14:40 Invited Talk ! 14:00 -14:40: Data Security and Integrity: Developments and Directions Bhavani Thuraisingham 14:40-15:30 Track 2A: Innovative Techniques and Applications in Intelligence Informatics ! 14:40-15:05: ! 15:05-15:30: Towards Trusted Intelligence Information Sharing, Joon Park and Joseph Treglia Social Networks Integration and Privacy Preservation using Subgraph Generalization, Christopher C. Yang and Xuning Tang 15:30-16:00 Coffee Break 16:00-17:00 Track 2B: Innovative Techniques and Applications in Intelligence Informatics ! 16:00-16:30: ! 16:30-17:00 ! 17:00-17:30 Novel Knowledge Discovery Methods for the Security Domain, Bhavani Thuraisingham, Latifur Khan and Murat Kantarcioglu On the Efficacy of Data Mining for Security Applications Ted Senator A Study of Online Service and Information Exposure of Public Companies Sai Ho Kwok, Cheuk Tung Lai and Jason Yeung ix x AntiPhish – Lessons Learnt André Bergholz Fraunhofer Institute Intelligent Analysis and Information Systems (IAIS) Schloss Birlinghoven St. Augustin, Germany [email protected] ABSTRACT Phishing emails usually contain a message from a credible looking source requesting a user to click a link to a website where she/he is asked to enter a password or other confidential information. Most phishing emails aim at withdrawing money from financial institutions or getting access to private information. Phishing has increased enormously over the last years and is a serious threat to global security and economy. There are a number of possible countermeasures to phishing. These range from communication-oriented approaches like authentication protocols over blacklisting to content-based filtering approaches [3]. We argue that the first two approaches are currently not broadly implemented or exhibit deficits. Therefore content-based phishing filters are necessary and widely used to increase communication security. A number of features are extracted capturing the content and structural properties of the email. Subsequently a statistical classifier is trained using these features on a training set of emails labeled as ham (legitimate), spam or phishing. This classifier may then be applied to an email stream to estimate the classes of new incoming emails. AntiPhish is a specific targeted research project funded under Framework Program 6 by the European Union. It is aims at developing improved anti-phishing technologies that help to protect and secure the global email communication infrastructure. The project on the one hand developed the filter methodology in a test laboratory setting, but on the other hand implemented this technology in real world settings, to be used to filter all email traffic online in real time. In this talk we summarize our experience with phishing filtering with benchmark data and in addition with different real-life email streams. First we describe a number of novel features that are particularly well-suited to identify phishing emails [1]. These include statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSI-KDD'09, June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-4…$5.00. the detection of embedded logos as well as indicators for hidden salting [2]. Hidden salting is the intentional addition or distortion of content not perceivable by the reader. For empirical evaluation we have obtained a large realistic corpus of emails pre-labeled as spam, phishing, and ham (legitimate). In experiments with benchmark data our methods outperform other published approaches for classifying phishing emails. The second part of the talk describes the application of these approaches to real-life email streams. On the one hand we investigate how we can identify new phishing emails arriving from a honeypot system. This allows to spot new types of phishing mails. Subsequently the characteristics of these new phishing emails can be used to update client-based phishing filters. A second experiment investigates the capabilities of the AntiPhish system when monitoring emails in an ISP framework. It turns out that active learning approaches are very efficient to maintain and improve filtering accuracy. We discuss the implications of these results for the practical application of this approach in the workflow of an email provider. Finally we describe a strategy how the filters may be updated and adapted to new types of phishing. ACKNOWLEDGMENTS This talk is based upon work performed within the FP6027600 project AntiPhish (http://www.antiphishresearch.org/). The authors would like to thank the European Commission for partially funding the AntiPhish project as well as all the AntiPhish project partners for their interest, support, and collaboration in this initiative. REFERENCES [1] Andre Bergholz, Jan De Beer, Sebastian Glahn, MarieFrancine Moens, Gerhard Paass, Siehyun Strobel 2009. New Filtering Approaches for Phishing Email. Accepted for publication for Journal of Computer Security (JCS) [2] Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun Strobel, Marie-Francine Moens and Brian Witten 2008. Detecting Known and New Salting Tricks in Unwanted Emails. Fifth Conference on Email and Anti-Spam, CEAS 2008, Aug 21-22, 2008. [3] Markus Jakobson and Steven Myers 2007. Phishing and Countermeasures - Understanding the Increasing Problem of Electronic Identity Theft. Wiley, Hoboken, New Jersey. BIOGRAPHY André Bergholz is a senior research engineer in the text mining group at Fraunhofer IAIS. He is interested in text and data analysis and management. Prior to joining Fraunhofer André Bergholz worked as a research engineer at Xerox Research Centre Europe in the domain of document management. André Bergholz holds a PhD and a German diploma degree, both from HumboldtUniversity Berlin. At the time he specialized in management of semistructured data, where he also did a postdoc at Stanford University. !"#$%&%&'()&*+,#,&-./(0%11,&(2.+3"4(2"1,/(.&1( 51.$""6-(5/'"+%-7#(8"+(5&"#./9()&-+:6%"&(;,-,*-%"&( ! !"#$%"&'(%)*' !4#54*6&'(%)*' +),-./0)*/'12'3*21.0-/41*'5-*-6)0)*/' 7-/41*-8'()*/.-8'9*4:).;4/<' =>>&'?%1*6@-'A@B&'?%1*684&'C-4D-*&'=E>FG&'ABHB(B' +),-./0)*/'12'3*21.0-/41*'5-*-6)0)*/' 7-/41*-8'()*/.-8'9*4:).;4/<' =>>&'?%1*6@-'A@B&'?%1*684&'C-4D-*&'=E>FG&'ABHB(B' IJGE>=>EEKLLB*L"B)@"B/D' L<0KLLB*L"B)@"B/D' ' ' !"#$%!&$! +--!7(#!$)(:$*8!&)'#53&()!%,',6'&();! "#$%&'&()$*!+&%%,)!-$#.(/!-(%,*!0+--1!2$3!4,,)!3566,3375**8! $99*&,%! '(! $)(:$*8! &)'#53&()! %,',6'&();! <)6#,:,)'$*! +--! 0<+--1! 75#'2,#! &:9#(/,3! '2,! '#$&)&)=! '&:,! (7! +--;! +(>,/,#?! 4('2! +--! $)%! <+--! 3'&**! 2$/,! '2,! 9#(4*,:! (7! 2&=2! 7$*3,! 9(3&'&/,! #$',;! <)! '2&3! 9$9,#?! >,! 9#(9(3,! $)! @%$4((3'A<+--! '(! 6(:4&),!<+--!$)%!$%$4((3'!7(#!$)(:$*8!&)'#53&()!%,',6'&();!@3! $%$4((3'! 7'*8! 53,3! :$)8! <+--3! '(! 6(**,6'&/,*8! 6*$33&78! 3$:9*,3! '2,)! %,6&%,3! '2,! #,35*'3! (7! 3$:9*,3!! 6*$33&7&6$'&()3?! '2,! @%$4((3'A<+--!6$)!&:9#(/,!'2,!$665#$',!#$',!(7!6*$33&7&6$'&()3;! BC9,#&:,)'$*! #,35*'3! >&'2! D'&%,! %$'$3,'3! 32(>! '2$'! '2,! 9#(9(3,%! :,'2(%! 6$)! 3&=)&7&6$)'*8! &:9#(/,! '2,! 7$*3,! 9(3&'&/,! #$',! 48! EFG! >&'2(5'! %,6#,$3&)=! %,',6'&()! #$',;! H,3&%,3?! >,! $*3(! 9#(9(3,! $! :,'2(%! '(! $%I53'! '2,! )(#:$*! 9#(7&*,! 7(#! $/(&%&)=! ,##(),(53! %,',6'&()!6$53,%!48!62$)=,3!(7!)(#:$*!4,2$/&(#;!J,!9,#7(#:!>&'2! ,C9,#&:,)'3! >&'2! #,$*&3'&6! %$'$3,'3! ,C'#$6',%! 7#(:! '2,! 53,! (7! 9(95*$#! 4#(>3,#3;! K(:9$#,%! >&'2! '#$%&'&()$*! +--! :,'2(%?! (5#! :,'2(%! 6$)! &:9#(/,! '2,! '#$&)&)=! '&:,! 48! LFG! '(! 45&*%! $! ),>! )(#:$*!9#(7&*,;!! J$##,)%,#! PEQ! 7'! $99*&,%! +--! '(! :(%,*! 383',:! 6$**! 3,[5,)6,3!(7!)(#:$*!9#(6,33!7(#!$)(:$*8!&)'#53&()!%,',6'&()!$)%! =('! $! #,:$#.$4*,! %,',6'&()! #,35*';! +(>,/,#?! +--! 2$3! 9#(4*,:3! (7!>$3'&)=!$:(5)'!(7!'2,!'#$&)&)=!'&:,!$)%!2&=2!7$*3,!9(3&'&/,!#$',! PVQ;!"2,!*$#=,!$:(5)'!(7!'#$&)&)=!'&:,!7(#!+--!6(:,3!7#(:!'2$'! '2,! +--! $%(9'3! '2,! :$C&:5:! *&.,*&2((%! 9#&)6&9*,! (7! H$5:A J,*62! $*=(#&'2:! '(! '#$&)! $! :(%,*;! "2,! '#$&)&)=! 6(:9*,C&'8! &3! \0TU"1?! >2,#,! T! &3! '2,! )5:4,#! (7! 2&%%,)! 3'$',3! $)%! "! &3! '2,! (43,#/$4*,!3,[5,)6,!*,)='2;!! &'()*+,-)./'01/#234)5(/6).5,-7(+,.! M;N;O!P87),'(-0*/#9.():.QR!D,65#&'8!$)%!S#(',6'&()! ;)0),'</$),:.! D,65#&'8! =)9>+,1.! @)(:$*8!<)'#53&()!M,',6'&()?!T(#:$*!S#(7&*,?!<+--?!@%$4((3'! ?@! AB$%86C&$A8B/ @66(#%&)=! '(! '2,! UFFV! #,9(#'! 7#(:! W$39,#3.8! PXQ?! '2,! =*(4$*! :$*>$#,3! $)%! &)'#53&()3! =#(>! 32$#9*8Y! 2,)6,! &'! &3! &:9(#'$)'! '(! %,/,*(9! ,77,6'&/,! &)'#53&()! %,',6'&()! 383',:3! 0<MD31! '(! 3'(9! '2,! %,3'#56'&()3! 6$53,%! 48! '2,3,! :$*>$#,3;! <MD3! %,',#:&),! >2,'2,#! '2,! 65##,)'! 383',:! &3! &)65##,%! &)'#53&()!48! $)$*8Z&)=! 383',:! 6$**! 3,[5,)6,3?! 383',:! *(=3! (#! ),'>(#.! 9$6.,'3;! @**! (7! '2,3,! %$'$! &)6*5%,! '2,! '&:,! 3,#&,3! ,/,)'3;! +--! 2$3! '2,! =#,$'! 6$9$4&*&'8! '(! %,36#&4,!'2,!'&:,!3,#&,3!%$'$?!3(!:$)8!#,3,$#62,3!PUQAPOQ!$99*8!'2,! ! S,#:&33&()!'(!:$.,!%&=&'$*!(#!2$#%!6(9&,3!(7!$**!(#!9$#'!(7!'2&3!>(#.!7(#! 9,#3()$*!(#!6*$33#((:!53,!&3!=#$)',%!>&'2(5'!7,,!9#(/&%,%!'2$'!6(9&,3!$#,! )('! :$%,! (#! %&3'#&45',%! 7(#! 9#(7&'! (#! 6(::,#6&$*! $%/$)'$=,! $)%! '2$'! 6(9&,3! 4,$#! '2&3! )('&6,! $)%! '2,! 75**! 6&'$'&()! ()! '2,! 7'! 9$=,;! "(! 6(98! ('2,#>&3,?! (#! #,954*&32?! '(! 9(3'! ()! 3,#/,#3! (#! '(! #,%&3'#&45',! '(! *&3'3?! #,[5&#,3!9#&(#!39,6&7&6!9,#:&33&()!$)%a(#!$!7,,;! "#$%&''()*?!b5),!UV?!UFFL?!S$#&3?!]#$)6,;!! K(98#&=2'!UFFL!@K-!LEVAXAOF``VAOOLAN;;;c`;FF;! ! +(>,/,#?! >2,)! '2,! (43,#/$4*,! 3,[5,)6,! *,)='2! 4,6(:,3! /,#8! *()="# $%&# '# ()*+&# ,-# $%&# .)/01)23# 425/&3+2&# 56# 7)+8AJ,*62! $*=(#&'2:! 6$)! 4,! $99#(C&:$',%! >&'2! *&''*,! ,77,6'3! '(! :(%,*! 6(##,6'),33;! J&'2! '2&3! (43,#/$'&()?! ]*(#,Z! PLQ! 9#(9(3,%! $)! <)6#,:,)'$*!+--!0<+--1!'(!#,%56,!'2,!'#$&)&)=!(7!'!6(:9*,C&'8! '(!\0TU1!7(#!32(#',)&)=!'2,!'#$&)&)=!'&:,;!^)7(#'5)$',*8?!$*'2(5=2! '2,! <+--! #,%56,3! '2,! '#$&)&)=! '&:,?! &'! 3'&**! 2$3! '2,! &)2,#,)'! 9#(4*,:!(7!2&=2!7$*3,!9(3&'&/,!#$',;!\)!'2,!('2,#!2$)%?! $%$4((3'?! 32(#'! 7(#! $%$9'&/,! 4((3'&)=?! &3! $)! ,)3,:4*,! :$62&),! *,$#)&)=! $*=(#&'2:!PXFQ!PXXQ!7(#!95#35&)=!'2,!:&)&:5:!6*$33&7&6$'&()!,##(#;! @%$4((3'!7(653,3!:(#,!()!2$#%!3$:9*,3!'(!#,&)7(#6,!*,$#)&)=!$)%! 7&)$**8!6(:4&),3!:$)8!>,$.!6*$33&7&,#3!'(!$!3'#()=!6*$33&7&,#;!D(!&'! 6$)! %,6#,$3,! '2,! 7$*3,! 9(3&'&/,! #$',! (7! <+--! 48! &)',=#$'&)=! '2,! $%$4((3'!'(!'2,!<+--;!J,!9#(9(3,!'(!'$.,!'2,!$%/$)'$=,!(7!'2&3! &)',=#$'&()! &)! $)(:$*8! <MD! $)%! )$:,! '2&3! &)',=#$'&()! $3! @%$4((3'A<+--;! @99*&6$'&()! 4,2$/&(#! 62$)=,3! $#,! 6(::();! 0,;=;! 9#(=#$:! 59%$',Y!'2,!62$)=,3!(7!53,#!4,2$/&(#1;!"2,#,7(#,?!&)!$)(:$*8!<MD?! &$!9# ,8452$)-$# $5# )3:+9$# -528)*# 4256,*&# 42584$*;<# =&# 9#(9(3,! '(! )354$# >?)!9# 5-A*&),! $%$4((3'! PXUQ! PX_Q! &)! (5#! :,'2(%! $)%! 53,! ,C9,#&:,)'3!'(!/$*&%$',!'2$'!'2&3!$%(9'&()!6$)!&)%,,%!&:9#(/,!'2,! %,',6'&()!#$',!$3!>,**!$3!%,6#,$3,!'2,!7$*3,!9(3&'&/,!#$',;! "2,!#,:$&)%,#!(7!'2&3!9$9,#!&3!(#=$)&Z,%!$3!7(**(>3;!<)!D,6'&()! U?!>,!=&/,!$!4#&,7!(/,#/&,>!(7!'2,!#,*$',%!>(#.;!<)!D,6'&()!_?!>,! 9#(9(3,! (5#! :,'2(%! $)%! ,C9*$&)! 2(>! '(! 6(:4&),! <+--! $)%! $%$4((3'!7(#!$)(:$*8!&)'#53&()!%,',6'&();!J,!$*3(!9#(9(3,!$)!()A *&),! )(#:$*! 9#(7&*,! $%I53':,)'! :,'2(%! &)! '2&3! 3,6'&();! BC9,#&:,)'$*! #,35*'3! $#,! 32(>)! &)! D,6'&()! N;! ]&)$**8?! D,6'&()! `! =&/,3!(5#!6()6*53&()3!$)%!'2,!%&#,6'&()!(7!75'5#,!>(#.;! D@! %EF!$E6/G8%=/ D@?! A05,):)0('</HII/JAHIIK/ +--! PXNQ! PX`Q! &3! $! 7&)&',! 3'$',! $5'(:$'&()! >&'2! %(54*8! 3'(62$3'&6! 9#(6,33;! <'! &3! $! 9#(4$4&*&'8! 75)6'&()!(7! -$#.(/! 62$&)3;! J&'2! '2,! =#,$'! %,36#&9'&/,! 6$9$4&*&'8! 7(#! '&:,! 3,#&,3! %$'$?! &'! &3! 9(95*$#*8! $99*&,%! &)! $6(53'&6! #,6(=)&'&()?! &:$=,! #,6(=)&'&()?! 3&=)$*!9#(6,33&)=!(#!4&(*(=&6$*!&)7(#:$'&()?!,'6;! @!+--!&)6*5%,3!$!2&%%,)!3'$',!*$8,#!$)%!$)!(43,#/$4*,!(5'95'! *$8,#;! +&%%,)! 3'$',! *$8,#! &3! $! 3'$4*,! -$#.(/! 62$&)3;! <'3! 3'$',! 9#(4$4&*&'8! $)%! 3'$',! '#$)3&'&()! 9#(4$4&*&'8! $#,! %,6&%,%! 7#(:! '2,! &)&'&$*!3'$',!9#5.).,*,$;#(&/$52#@#)-3#'2,!3'$',!'#$)3&'&()!9#(4$4&*&'8! :$'#&C! @;! \43,#/$4*,! (5'95'! *$8,#! &3! %,6&%,%! 7#(:! '2,! (43,#/,%! 38:4(*3!9#(4$4&*&'8!:$'#&C!H!>2&62!&3!%,#&/,%!7#(:!'2,!(43,#/,%! 38:4(*3! (7! ,$62! 2&%%,)! 3'$',;! d,'! T?! -! #,9#,3,)'! '2,! )5:4,#! (7! 2&%%,)!3'$',3!$)%!'2,!)5:4,#!(7!(43,#/,%!38:4(*3!#,39,6'&/,*8?!$! +--! :(%,*! &3! 535$**8! %,36#&4,%! $3! ABC@"# D"# 7E?! >2,#,! '2,! 8&)-,-F9#56##@"#D!$)%!H!$#,!$3!7(**(>3R!! @ R! &)&'&$*! 3'$',! 9#(4$4&*&'8! &)6*5%&)=! !" >2,#,!X!&!T;! H,3&%,3?! $%$4((3'! 4,*()=3! '(! 4$'62! '#$&)&)=;! "2$'! :,$)3! &7! $! ),>!%$'5:!,)',#3?!'2,!(#&=&)$*!'#$&),%!:(%,*!32(5*%!4,!%&36$#%,%! $)%! '2,! '#$&)&)=! 4,! #,3'$#',%! $=$&);! "2,! 4$'62! '#$&)&)=! >$3',3! $:(5)'!(7!'&:,?!,39,6&$**8! >2,)!%$'$!62$)=,%!7#,[5,)'*8!(#!%$'$! 6()'&)5(53*8! ,C9$)%3;! "(! 3(*/,! '2&3! 9#(4*,:?! \Z$! 9#(9(3,%! $)! ()A*&),! $%$4((3'! :,'2(%! PXUQ! PX_Q;! <)! '2&3! $99#($62?! ,$62! ),>! 3$:9*,! &3! 3,)'! '(! 3543,[5,)'! >,$.! 6*$33&7&,#3?! $)%! '2,)! ,$62! 6*$33&7&,#! $%I53'3! '2,! >,&=2'! (7! '2,! ),>! 3$:9*,! $)%! &'3! (>)! 6()7&%,)6,! #$',! $66(#%&)=! '(! '2,! 6*$33&7&6$'&()! #,35*'! (7! '2,! ),>! 3$:9*,;! H('2! '2,! ),>! 3$:9*,! $)%! '2,! >,&=2'! (7! '2,! $%I53',%! !,*,:,)'3?!!!!!!!!!!!!!!!! 3$:9*,!$#,!3,)'!'(!'2,!),C'!6*$33&7&,#;! @R! 3'$',! '#$)3&'&()! 9#(4$4&*&'8! :$'#&C! &)6*5%&)=! !"# !,*,:,)'3?! >2,#,!X!&?!I!T; ! ! HR! (43,#/,%! 38:4(*3! 9#(4$4&*&'8! :$'#&C! &)6*5%&)=! 4I0.1! ,*,:,)'3?!>2,#,!X!!I!T!$)%!X!!.!-;! e&/,)! $)! (43,#/,%! 3,G+&-/&# 5# )-3# )# 853&*# A?! >,! 6$)! 45&*%! $! 9&G+&-/&# 2&*)$,5-# 853&*# .&$1&&-# 5# )-3# A<# H%&# H$5:AJ,*62! $*=(#&'2:! PX`Q?! >2&62! &3! 4$3,%! ()! '2,! :$C&:5:! *&.,*&2((%! 9#&)6&9*,?! 6()'&)5(53*8! *,$#)3! '(!$%$9'! :(%,*!9$#$:,',#3! (7! @"# D! $)%! H! $5# 8)0&# IC5# J# AE# 8)K,8+8<# H%,9# 425/&99# ,9# /)**&3# $%&# '#$&)&)=!(7!+--!:(%,*;! +(>,/,#?!'2,!6(:9*,C&'8!(7!H$5:AJ,*62!$*=(#&'2:!&3!\0TU"1;! "2,! *()=,#! '2,! (43,#/$4*,! 3,[5,)6,! *,)='2! 4,6(:,3?! '2,! :(#,! '&:,! >,! ),,%! '(! '#$&)! $! +--! :(%,*;! +,)6,?! ]*(#,Z! 9#(9(3,%! <+--! PLQ! 48! $%(9'&)=! $)! &)6#,:,)'$*! H$5:AJ,*62! $*=(#&'2:;! f,6$**! '2$'! &)! '2,! 4$6.>$#%! 9$#'! (7! H$5:AJ,*62! $*=(#&'2:?! '2,! /$#&$4*, $#$ %"& % '%($&' ( ($&) ( ) ( (* $*$+%$& % ," & ?! >2,#,! +%$& !&3! '2,!383',:!2&%%,)!3'$',$," !$'!'!'2!'&:,3',9;!+,!#,9*$6,%!'2,!(#&=&)$*! '# ()*+&# 56# LG<CME# .;# $%&# )4425K,8)$&3# '# ()*+&# ,-# LG<CNE"! >2,#,! X!&?!I!T!$)%!X!'!"AX;!"2,!&)&'&$*!'" %&&gX;! 1 #$ %"& % - ."/ 0/ %($&' &#$&' %/&! 0X1! / %' 1 #$ %"& % - ."/ 0/ %($&' &$ :$)8!6*$33&7&,#3?!3(!&'!:$.,3!6*$33&7&,%!#,35*'3!:(#,!$665#$',!'2$)! 5)&[5,!6*$33&7&,#;!+(>,/,#?!$%$4((3'!&3!3,)3&'&/,!'(!)(&38!%$'$?!3(! &'!&)6*&),3!'(!&)'#(%56,!(/,#7&''&)=!9#(4*,:;!-$)8!#,3,$#62,3!PXOQ! PXEQ!PXVQ!2$/,!4,,)!%,/(',%!'(!3(*/,!'2&3!9#(4*,:;! 0U1! / %' "2,!'#$&)&)=!6(:9*,C&'8!(7!'!&)!4$6.>$#%!9#(6,%5#,!(7!H$5:A J,*62! $*=(#&'2:! &3! '253! #,%56,%! '(! \0TU1! 48! '2,! ),>! )4425K,8)$&3#&9$,8)$&3#'#()*+&<!!"2,!)$:,!(7!<+--!6(:,3!7#(:! '2$'!,$62!'&:,!$!),>!(43,#/$4*,!3,[5,)6,!&3!3,)'!'(!'2,!:(%,*?!'2,! +--!9$#$:,',#3!$#,!#,A,3'&:$',%;!D(!'2,!*,$#)&)=!7(#!'2,!:(%,*! &3!&)6#,:,)'$*;! D@D! !1'3++.(/ @%$4((3'?! 9#(9(3,%! 48! ]#,5)%! $)%! K2$9&#,! PXFQ! PXXQ?! &3! 9(95*$#*8!$99*&,%!&)!&:$=,!#,6(=)&'&();!@%$4((3'!&3!$!359,#/&3,%! *,$#)&)=! $*=(#&'2:;! <'! 6(:95',3! 6()7&%,)6,! #$',3! (7! >,$.! 6*$33&7&,#3!$)%!$%I53'3!'2,!>,&=2'!(7!,$62!3$:9*,?!3(!'2$'!'2,!),C'! '#$&)&)=! 7(653,3! ()! '2,! 2$#%! 3$:9*,3;! <'! 7&)$**8! 6(:4&),3! ,$62! >,$.! 6*$33&7&,#! $)%! '2,! 6()7&%,)6,! #$',! (7! ,$62! >,$.! 6*$33&7&,#;! "2,!7,$'5#,!(7!$%$4((3'!&3!'2$'!&'3!6*$33&7&,%!#,35*'3!$#,!%,6&%,%!48! D@L! !1'3++.(/!HII/ ]((! PXLQ! 6(:4&),%! +--! $)%! $%$4((3'! '(! 4,! $99*&,%! '(! $5'(:$'&6! 39,,62! #,6(=)&'&();! D9,,62! 3$:9*,3! $#,! %&/&%,%! &)'(! :$)8!6*$33,3?!$)%!$%$4((3'A+--!6*$33&7&,#3!$#,!'#$&),%!'(!6(/,#! %&77,#,)'! =#(593! (7! '#$&)&)=! 3$:9*,3;! @3! $%$4((3'! '#$&)3! :$)8! >,$.! 6*$33&7&,#3! 48! $%I53'&)=! '2,! >,&=2'! %&3'#&45'&()! (7! 3$:9*,3;! ]((! #,9*$6,%! '#$%&'&()$*! H$5:AJ,*62! $*=(#&'2:! >&'2! H&$3,%! H$5:AJ,*62! $*=(#&'2:! '(! 6(:95',! 6*$33&7&,#3;! H8! $%I53'&)=! '2,! >,&=2'3! (7! 3$:9*,3?! 2&3! :,'2(%! 6$)! =,'! +--3! >&'2! B[;0_1! $)%! B[;0N1;!! .2"/ % 3+5%' 45 *5 :' $5 $5 5 %/& 3 6 %"&."/ 0/ 78$&' 9#$&' '5 $%' $ ! 4 5 :' 5 %/& 3+5%' 5 3*$%' 6$$5 %"&#$&' '5 3+5%' 02/ %(; & % 45 5 %/ %/& 3 $ 5 :' 6$ '5 $%'<* 5 4 3+5%' 5 '5 ,,$,8$ %8; *5 :' $5 3$%' 6$ %/&#$5 %/& $ 0_1! 0N1! <)! B[;0_1! $)%! B[;0N1?!45 !3'$)%3! 7(#! '2,! >,&=2'! (7! ,$62! 3$:9*,;! "2,! $5'2(#3! &)! PUFQ! 9#(/,%! '2$'! '2,! >,&=2'3! (7! '#$&)&)=! 3$:9*,3! 35-!$#(,5*)$&#$%&#/5-(&2F&-/႞&2$;#56#$2,!:$C&:5:!*&.,*&2((%! '#$&)&)=;!@3!'2&3!$99#($62!2$3!'(!'#$&)!:$)8!+--3!'(!=,'!4,'',#! #,35*'3?!&'!),,%3!:562!:(#,!'&:,!'(!'#$&)!'2,!:(%,*!'2$)!'#$%&'&()$*! +--;!"2&3!32(#'6(:&)=!:('&/$',3!53!'(!53,!<+--!PLQ!'(!#,9*$6,! +--!7(#!3$/&)=!'#$&)&)=!'&:,;!H5'!(5#!:,'2(%!&3!$!*&''*,!%&77,#,)'! 7#(:!]((!3;!]((!2$)%*,3!'2,!9#(4*,:!(7! :5*'&A6*$33&7&6$'&()?!$)%! 2&3! '#$&)&)=! %$'$! &)6*5%,3! :5*'&A6*$33! *$4,*3;! \5#! :,'2(%! %,$*3! >&'2!'2,!4&)$#8!6*$33&7&6$'&()!9#(4*,:?!&)!>2&62!'2,!'#$&)&)=!%$'$! ()*8!2$3!)(#:$*!6*$33!*$4,*;!+(>!'(!&)',=#$',!<+--!$)%!$%$4((3'! >&**!4,!&)'#(%56,%!&)!),C'!3,6'&();! L@! !6!"88#$MAHII/ L@?! "'.-5/!1'3++.(MAHII/ H,6$53,! $%$4((3'! $*=(#&'2:! ),,%3! '(! 6(**,6'! :$)8! >,$.! 6*$33&7&,#3!'(!6*$33&78!3$:9*,3?!'(!&:9#(/,!'2,!'#$&)&)=!6(3'!(7!'2,! +--?! >,! #,9*$6,! '2,! +--! >&'2! '2,! <+--;! D(! >,! %,/,*(9! $! 4&$3,%! &)6#,:,)'$*! H$5:AJ,*62! $*=(#&'2:! 48! 6(:4&)&)=! H&$3,%! H$5:AJ,*62! $*=(#&'2:! PXLQ! >&'2! <)6#,:,)'$*! H$5:AJ,*62! $*=(#&'2:!PLQ;!H5'!&'!&3!)(',%!'2$'!<+--!&3!$!9#(4$4&*&'8!75)6'&();! "2$'!:,$)3!>2,)!$!3$:9*,!&3!95'!&)'(!'2,!<+--?!&'3!(5'95'!&3!'2,! 9#(4$4&*&'8!(7!'2,!3$:9*,;!H,6$53,!>,!()*8!2$/,!)(#:$*!4,2$/&(#! 3$:9*,3! '(! :(%,*! $! )(#:$*! 9#(7&*,! &)! <MD! 9#(4*,:?! >,! 2$/,! '(! '#$)37(#:! '2,! <+--! '(! 2$)%*,! 4&)$#8! 6*$33&7&6$'&()! 9#(4*,:;! <7! '2&3! '#$)37(#:$'&()! 3566,33,3?! >2,)! >,! &)95'! $! 3$:9*,! '(! '2,! $%$4((3'?! >,! =,'! $)! (5'95'! >2&62! &3! $! :$'62! (7! )(#:$*! 9#(7&*,! 4,2$/&(#!6*$33!(#!$!:&3:$'62!6*$33;! D(! '2,#,! ,C&3'3! $! 9#(4*,:! ()! 2(>! '(! '#$)37(#:! '2,! <+--3! '(! %,$*! >&'2! ,$62! >,$.! 6*$33&7&,#! (7! $%$4((3'h! +,#,! >,! 3,'! $! '2#,32(*%! /$*5,! =' !'(! 3(*/,! '2&3! 9#(4*,:;! <7! '2,! 3$:9*,3! 9#(4$4&*&'8!(7!'2,!<+--!&3!*,33!'2$)!'2#,32(*%$=' ?!&'!&3!%,',#:&),%! $3!$!:&3:$'62!6*$33Y!('2,#>&3,?!&'!&3!$!:$'62!6*$33;!0D,,!B[;0`1;1! '( "A$5B'%?" $*$C$ & D E' $ F >$ %?" & % @ ! :'( "A$5B'%?" $*$C$ & - E' $ 0`1! ! "2,! 9#(6,%5#,3! (7! 6(:4&)&)=! <+--3! $)%! $%$4((3'! '(! '#$&)! $! )(#:$*!9#(7&*,!$#,!%,36#&4,%!$3!7(**(>3R! ! X;! D,'!)(#:$*!3$:9*,!39$6,!. % %/' ( ) ( /" ( ) ( /0 &$$)%!'2,!&)&'&$*! >,&=2'!1' %"& % '20!(7!,$62!3$:9*,;! '( ,"QB%?& % @ :'( ^3,! '2,! 4&$3,%! &)6#,:,)'$*! H$5:AJ,*62! $*=(#&'2:! '(! '#$&)! '2,! )(#:$*! 3$:9*,3! 0D,,! B[;0U1?! B[;0_1! $)%! B[;0N11!'(!=,'!G3$ 45!<+--;! 4;! ^3,! B[;0`1! '(! 45&*%! $! >,$.! 6*$33&7&,#!63 !'(! 6*$33&78! 3$:9*,3!'(!ijX?AXk!$66(#%&)=!'(!'2#,32(*%!=' ,! 6;! ^3,!B[;0O1!'(!6(:95',!'2,!,##(#!6*$33&7&6$'&()!#$',!(7! '2,!)(#:$*!3$:9*,3!6*$33&7&,%!48!'2,!>,$.!6*$33&7&,#;! <7! ,##(#! #$',! H3 7 8,9$(#!H3 % 8( !4#,$.! *((9;! B[;0E1! 6(:95',3!'2,!6()7&%,)6,!#$',!(7!'2,!>,$.!6*$33&7&,#;! B I$ % 4$ %"&$$! "%' $> $ %? " &%:' ' ' : K$ 6$ % 5B J L$ ) K$ %;! 0O1! \5#! #,3,$#62! $%(9'3! \Z$!3! ()A*&),! $%$4((3'! '(! $%I53'! )(#:$*! 9#(7&*,;!+(>,/,#?!&)!\Z$!3!9#(6,%5#,3!(7!()A*&),!$%$4((3'?!$!),>! 3$:9*,!>&**!4,!'#$&),%!:(#,!'2$)!(),!'&:,3!&)!,$62!>,$.!6*$33&7&,#! '(! :$.,! '2&3! 3$:9*,! 4,! 6(##,6'*8! 6*$33&7&,%?! 3(! &'! :$8! 2$/,! '2,! 9#(4*,:! (7! :(%,*! (/,#A'#$&)&)=;! "(! $/(&%! '2&3! 9#(4*,:?! &)! (5#! #,3,$#62?!,$62!),>!3$:9*,!&3!()*8!'#$&),%!()6,;! d,'!;3< !#,9#,3,)'! '2,! )5:4,#! (7! 6(##,6'! 6*$33&7&,%! 3$:9*,3! $)%! ;3= $4,!'2,!)5:4,#!(7!,##(#!6*$33&7&,%!3$:9*,3;!"2,!/$*5,!(7!'2,3,! '>(! )5:4,#3! >&**! 4,! $77,6',%! 48! '2,! ,##(#! 6*$33&7&6$'&()! #$',! (7! ,$62!>,$.!6*$33&7&,#!$)%!'2,!'('$*!)5:4,#!(7!3$:9*,3?!>,!9#(9(3,! B[;0XX1! '(! #,9#,3,)'! '2,&#! #,*$'&()3;! "2,! &)&'&$*! >,&=2'! (7! ),>! 3$:9*,!&3!X?!3(!'2$'!&'!6$)!$%I53'!)(#:$*!9#(7&*,;! 1$U % %' : I$ & > *>M$1V;0MW$8A$*8$.5$X.;N5M, ! 1$Y % I$ > *>M$1V;0MW$8A$*8$.5$X.;N5M, ! "2,! 9#(6,%5#,3! (7! ()A*&),! )(#:$*! 9#(7&*,! $%I53':,)'! $#,! %,36#&4,%!$3!7(**(>3R! U;! D,'!&)&'&$*!>,&=2'!=" !(7!,$62!),>!3$:9*,!/" !$3!X!! _;! \)A*&),!59%$',?!!](#!,$62!>,$.!6*$33&7&,#!63 ?!!%(!*((9! 0E1! 0V1! O$ % - 4$ %"& M?N7:6$ >$ %?" &9$ 0L1! * P%?& % ,"QB R- 6$ >$ %?&S! 0XF1! $%' !!!!!!!!>2,#,!3&=)!#,9#,3,)'3!3&=)5:!75)6'&();! "2,!3&=)5:!75)6'&()! (7!$!#,$*!)5:4,#!C!&3!%,7&),%!$3!7(**(>3R! <7!63 %/" & % '?!'2,)!6$*65*$',!B[;0XU1! 1$U % 1$U & Y" 1$Y I$ % U 1$ & 1$Y ! ' Y" % Y" J L )%' : I$ & 4;! N;! 0XU1! B*3,?!6$*65*$',!B[;0X_1! 1$Y % 1$Y & Y" 1$Y I$ % U 1$ & 1$Y ! ' Y" % Y" J L )I$ "%' J&'2! '2,! ,)%! (7! '2,! *((9?! >,! 6(:4&),! "! <+--3! '(! =,'! $! 3'#()=!6*$33&7&,#!+?!3,,!B[;0XF1;! = ^3,! B[;0XX1! '(! 6(:95',!;3 !$)%!;3 $ 7#(:! '2,! )5:4,#3! (7! '('$*!3$:9*,3;! B _;! < X;! $;! 4$ %"&M?N7:6$ >$ %?" &9 ! O$ 0XX1! <)! =,),#$*?! '2,! $%I53':,)'! (7! )(#:$*! 9#(7&*,! &3! 7'*8! 3,)%&)=! '2,! ),>! 3$:9*,! '(! ,$62! >,$.! 6*$33&7&,#! $)%! ,$62! >,$.! 6*$33&7&,#! #,9,$',%*8! $%I53'3! '2,! >,&=2'! (7! '2,! 3$:9*,! $)%! '2,! 6()7&%,)6,! #$',!(7!&'3!(>)?!'2,)!'2,!3$:9*,!$)%!&'3!$%I53',%!>,&=2'!$#,!3,)'!'(! ),C'!6*$33&7&,#;! ^3,! B[;0V1! '(! $%I53'! '2,! >,&=2'3! (7! '2,! 3$:9*,3! >2&62!2$/,!4,,)!6*$33&7&,%?!:3 !&3!$!)(#:$*&Z,%!7$6'(#?! 3,,!B[;0L1;! 4$&' %"& % ! L@D! !1'3++.(MAHII/>-(N/80M<-0)/!142.(-0*/ &'7'3-<-(9/ U;! O52C$BM"N"PHE?!*((9! $;! "A$? 7 8 F ! "A$? T 8 0X_1! @'! '2,! ,)%! (7! '2,! *((9?! '2,! )(#:$*! 9#(7&*,! &3! 6(:9*,',*8! $%I53',%;!^3,!B[;0E1!'(!6(:95',!'2,!6()7&%,)6,!#$',!(7!!63 !?! $)%!'2,!7&)$*!6*$33&7&,#!+!48!B[;0XF1;! O@! EPQE%AIEB$#/ @)(:$*8! &)'#53&()! %,',6'&()! &3! %&/&%,%! &)'(! '>(! 3'$=,3R! '2,! )(#:$*! 4,2$/&(#! '#$&)&)=! 3'$=,! $)%! '2,! $)(:$*8! &)'#53&()! %,',6'&()! 3'$=,;! "2,! 7(#:,#! '#$&)3! '2,! )(#:$*! 9#(7&*,! 48! 53&)=! )(#:$*!'#$&)&)=!%$'$!$)%!'2,!*$'',#!%,',#:&),3!>2,'2,#!'2,!9#(6,33! &3! $)(:$*8;! J2,)! '2,! )(#:$*! 4,2$/&(#! &3! 62$)=,%?! '2,! )(#:$*! 9#(7&*,!),,%3!'(!4,!$%I53',%!'(!#,7*,6'!'2,3,!62$)=,3;! H,7(#,! >,!9,#7(#:!(5#!,C9,#&:,)'3?!>,!2$/,!'(!%,',#:&),!'2,! /$*5,!(7!3(:,!7#,,!9$#$:,',#3!(7!@%$4((3'A<+--R!'2,!)5:4,#!(7! 2&%%,)! 3'$',3! (7! <+--?!'2,! '#$&)&)=! '&:,3! (7! 4&$3,%! &)6#,:,)'$*! H$5:AJ,*62!$*=(#&'2:?!'2,!3,''&)=!(7!'2#,32(*%!(7!B[;0`1?!$)%!'2,! )5:4,#!(7!*((93!&)!$%$4((3'!$*=(#&'2:;!D(:,!(7!'2,3,!9$#$:,',#! /$*5,3!%,9,)%!()!'2,!6$3,!7(#!>2&62!>,!),,%!'(!,/$*5$',;!"$.,!'2,! ^T-!%$'$3,'3!PUXQ?!>2&62!6()'$&)!$!*$#=,!/(*5:,!(7!383',:!6$**3! &)!)(#:$*!$)%!$4)(#:$*!9#(6,33,3!$)%!$#,!7$:(53!4,)62:$#.!7(#! &)'#53&()!%,',6'&()?!$3!$)!,C$:9*,;!"2,!)5:4,#!(7!2&%%,)!3'$',3!&3! '2,! )5:4,#! (7! 5)&[5,! 383',:! 6$**3! 53,%! 48! '2,! 9#(6,33! PEQ! PUUQ! PU_Q;! "2,! )5:4,#! (7! (43,#/,%! 38:4(*3! &3! '2,! )5:4,#! (7! 383',:! 6$**3!48!(9,#$'&)=!383',:;!J&'2!#,39,6'!'(!'2,!('2,#!3,''&)=3!(7!'2,! 9$#$:,',#3!/$*5,3?!'2,!'#$&)&)=!'&:,3!(7!4&$3,%!&)6#,:,)'$*!H$5:A J,*62! $*=(#&'2:! &3! 3,'! '(! `;! "2,! '2#,32(*%! $=' $ (7! B[;0`1! &3! %,',#:&),%!48!3(#'&)=!9#(4$4&*&'&,3!(7!'#$&)&)=!3$:9*,3!6(5)',%!48! <+--! $)%! 3,*,6'&)=! '2,! 3$:9*,! 9#(4$4&*&'8! 7#(:! `G! '(! X`G! '#$&)&)=!3$:9*,3;!J,!3,'!'2,!)5:4,#!(7!*((93!&)!$%$4((3'!'(!4,!`! 48! '2,! #,$3()! (7! 3$/&)=! 6(:95'$'&()! '&:,! $3! >,**! $3! '2,! (43,#/$'&()! '2$'! '2,! ,##(#! #$',! (7! 6*$33&7&6$'&()! &)! ,$62! *((9! %,6#,$3,3;!! J,! 7'! 53,! D'&%,! $)%! D,)%:$&*! 383',:! 6$**! %$'$3,'3! 7#(:! ^T-! PUXQ! $3! (5#! ,C9,#&:,)'! %$'$;! "2,3,! %$'$3,'3! &)6*5%,! :$)8! )(#:$*! $)%! &)'#53&()! '#$6,3;! B$62! '#$6,! &3! $! *&3'! (7! 383',:! 6$**3! &335,%!48!$!3&)=*,!9#(6,33!7#(:!&'3!4,=&))&)=!(7!,C,65'&()!'(!'2,! ,)%!(7!,C,65'&();!d&.,!'2,!%$'$!9#,9#(6,33&)=!&)!PU_Q!PUNQ?!>,!53,! $!3*&%&)=!>&)%(>!'(!6()3'#56'!383',:!6$**!3,[5,)6,3;!J,!3,'!'>(! '2#,32(*%3! =' !$)%! $=) ;! =' !%,',#:&),3! >2,'2,#! '2,! 383',:! 6$**! 3,[5,)6,3! :$'62! '2,! )(#:$*! 9#(7&*,;!=) !%,',#:&),3! >2,'2,#! ,$62! 9#(6,33! &3! $)(:$*8;! <7! :&3:$'62! #$',! 0(#! 6$**,%! $)(:$*8! #$',1! (7! 383',:! 6$**! 3,[5,)6,3! (7! '2,! 9#(6,33! ,C6,,%3!=) ?! '2&3! 9#(6,33! &3! 3$&%!'(!4,!$)!$)(:$*8!9#(6,33;!<)!$%%&'&()!'(!'2,!^T-!%$'$3,'3?!'(! /$*&%$',!'2,!$99*&6$'&()!(7!(5#!:,'2(%!&)!$!:(#,!#,$*&3'&6!6$3,?!>,! $99*8!(5#!:,'2(%!'(!-&6#(3(7'! J&)%(>3!$99*&6$'&()3!'(!/$*&%$',! '2,!$%I53'&)=!)(#:$*!9#(7&*,!:,'2(%;! %,',6'&()! #$',;! J,! 6$)! 3,,! '2$'! (5#! :,'2(%! 2$3! '2,! *(>,#! 7$*3,! 9(3&'&/,! #$',! '2$)! '2,! ('2,#3! 5)%,#! '2,! 3$:,! %,',6'&()! #$',;! ](#! ,C$:9*,?! &)! D'&%,! >&)%(>! 3&Z,! O?! (5#! :,'2(%! &:9#(/,3! '2,! 7$*3,! 9(3&'&/,! #$',! 48! OVG! $'! '2,! 3$:,! %,',6'&()! #$',! (7! LFG;! ]&=;U! 32(>3!'2,!3&:&*$#!#,35*'3;!](#!,C$:9*,?!&)!D'&%,!>&)%(>!3&Z,!XX?! (5#! :,'2(%! &:9#(/,3! '2,! 7$*3,! 9(3&'&/,! #$',! 48! EFG! 6(:9$#,%! >&'2!'2,!('2,#3!$'!'2,!3$:,!%,',6'&()!#$',!(7!LFG;! ! ! T-*2,)/?@/%8&/52,U)./5+:7',)1/>-(N/G'0*/)(/'<@/'01///// T<+,)V/)(/'<@/201),/#(-1)/>-01+>/.-V)/W@/ / O@?! #(-1)/!0+:'<9/6)()5(-+0/ <)! '2,! 7'! ,C9,#&:,)'?! >,! 7(653! ()! '2,! 9,#7(#:$)6,! (7! %,',6'&()! $)%! 6(:9$#,! &'! >&'2! '2,! 9,#7(#:$)6,! (7! '2,! +--! 9#(9(3,%! 48! PU_Q! $)%! '2,! <+--! 9#(9(3,%! 48! PLQ;! J,! 3,'! '2,! *,)='2!(7!3*&%&)=!>&)%(>!'(!4,!O!$)%!XX?!$)%!',3'!4('2!6$3,3;!"2,! )5:4,#! (7! 2&%%,)! 3'$',3! &3! XL! $)%! '2,! )5:4,#! (7! (43,#/,%! 38:4(*3! &3! XVU;! "$4*,! X! &3! '2,! 6()7&=5#$'&()! (7! '2,! 7'! ,C9,#&:,)';! $'3<)/?@/#(-1)/)R7),-:)0(/5+0S-*2,'(-+0@/ #(-1)/6'('.)(. B2:3),/+S/Q,+5)..). ! T-*2,)/D@/%8&/52,U)./5+:7',)1/>-(N/G'0*/)(/'<@/'01////// T<+,)V/)(/'<@/201),/#(-1)/>-01+>/.-V)/??@/ B2:3),/+S/#9.():/&'<<. B+,:'</$,'-0-0*/6'(' LO DX?YW / B+,:'</$).(/6'(' WWZ [[\OZ\ A0,2.-+0/$).(/6'(' ?Y\ DY\XL\ ]&=;X!$)%!]&=;U!32(>!'2$'!'2,!f\K!65#/,3!(7!'2,!<+--!$)%!'2,! +--! $#,! #(5=2*8! 3&:&*$#;! "2,! #,$3()! &3! '2$'! '2,! <+--! :$&)*8! &:9#(/,3! '2,! '#$&)&)=! '&:,! (7! +--?! 45'! )('! &:9#(/&)=! 9,#7(#:$)6,?!3(!&'!2$3!'2,!3&:&*$#!#,35*'3!>&'2!+--;!<)!$%%&'&()?! >,! 6$)! 3,,! '2$'! '2,! %&77,#,)'! >&)%(>! 3&Z,3! >&**! &)7*5,)6,! '2,! 9,#7(#:$)6,! (7! %,',6'&();! "2,! #,$3()! &3! '2$'! 32(#',#! 3,[5,)6,3! (665#!:562!:(#,!7#,[5,)'*8!'2$)!'2,!*()=,#!3,[5,)6,3?!45'!(7',)! $#,!)('!$3!$665#$',!$3!*()=,#!3,[5,)6,3;!D(!'2,!32(#',#!3,[5,)6,3! ! ! J,!$)$*8Z,!(5#!,C9,#&:,)'$*!#,35*'!>&'2!('2,#!#,*$',%!>(#.!48! /$#8&)=! '2#,32(*%!=) !7#(:! F;FFF`! '(! X;! "2,! f,6,&/,#! (9,#$'&)=! 62$#$6',#&3'&6!0f\K1!(7!,$62!:,'2(%!&3!32(>)!&)!]&=;X!$)%!]&=;U;! <)! ]&=;X?! CA$C&3! &3! '2,! 7$*3,! 9(3&'&/,! #$',! $)%! 8A$C&3! &3! '2,! $#,! 2$#%! '(! %&77,#,)'&$',! '2,! )(#:$*! 9#(6,33! 7#(:! '2,! &)'#53&()! 9#(6,33;! J,!$*3(!,/$*5$',!'2,!'#$&)&)=!'&:,!(7!:(%,*&)=!$!)(#:$*!9#(7&*,;! "2,!,/$*5$'&()!#,35*'3!$#,!32(>)!&)!"$4*,!U;!J,!6$)!3,,!'2$'!(5#! :,'2(%! 39,)%3! _l`! '&:,3! 6(:9$#,%! >&'2! '2,! ('2,#! >(#.3;! "2,! #,$3()! &3! '2$'! '2,! *((9! (7! $%$4((3'! &3! `! $)%! &'! ),,%3! '(! '#$&)! `! <+--3?!&'!'2,#,7(#,!39,)%3!2&=2,#!'#$&)&)=!6(3'!'2$)!'#$&)&)=!'2,! 5)&[5,!+--!(7!('2,#!#,*$',%!>(#.;! $'3<)/D@/#(-1)/)R7),-:)0(/(,'-0-0*/(-:)@/ G'0*/)(/'<@ JHIIK T<+,)V/)(/'<@ JAHIIK Q,+7+.)1/:)(N+1 J!1'3++.(MAHIIK G-01+>/#-V)?? ?OW@W\X. ?OD@DDX. OO[@W\O. G-01+>/#-V)/W Z\@[WO. ZO@L\. L[W@[[O. ! O@D! #)01:'-</!0+:'<9/6)()5(-+0// J,!53,!'2,!D,)%:$&*!%$'$3,'3!(7!PU_Q!'(!9#(/,!'2$'!(5#!:,'2(%! 6$)! 6*,$#*8! %&36#&:&)$',! '2,! )(#:$*! $)%! &)'#53&()! 9#(6,33,3;! T(#:$*! 9#(6,33,3! 2$/,! *(>,#! $)(:$*8! #$',! $)%! &)'#53&()! 9#(6,33,3! 2$/,! 2&=2,#! $)(:$*8! #$',;! "$4*,! _! &3! '2,! 6()7&=5#$'&()! (7!D,)%:$&*;!<)'#53&()!',3'!9#(6,33,3!$#,!D83*(=Af,:(',X?!D83*(=A f,:(',U?! D83*(=Ad(6$*X?! D83*(=Ad(6$*U?! 3:`O`$! $)%! 3:`C;! "2,! )5:4,#! (7! 2&%%,)! 3'$',3! &3! `_! $)%! '2,! )5:4,#! (7! (43,#/,%! 38:4(*3!&3!XVU;! "$4*,! N! 32(>3! '2$'! (5#! :,'2(%! 2$3! 6*,$#,#! 3,9$#$'&()! (7! $)(:$*8!#$',3!4,'>,,)!)(#:$*!$)%!$4)(#:$*!9#(6,33,3!'2$)!('2,#! >(#.3;! ](#! ,C$:9*,?! &)! (5#! :,'2(%! '2,! $)(:$*8! #$',! (7! )(#:$*! ',3'!%$'$!&3!F;VOG!>2&*,!'2,! :&)&:5:! $)(:$*8!#$',!(7!$4)(#:$*! 9#(6,33! &3! XN;NXG;! "2,! %&77,#,)6,! &3! XE! '&:,3;! D(! &'! 6$)! 6*,$#*8! %&36#&:&)$',!)(#:$*!$)%!&)'#53&()!9#(6,33,3!;! O@L! !0+:'<9/6)()5(-+0/+S/AE/A0(,2.-+0@/ "(! ,/$*5$',! '2,! 9,#7(#:$)6,! (7! @%$4((3'A<+--! &)! #,$*&3'&6! 6$3,?!>,!'$.,!<BO!0<)',#),'!BC9*(#,#!/,#3&()!O1!$3!',3'!'$#=,';!J,! 7'!,/$*5$',!>2,'2,#!'2,!@%$4((3'A<+--!6$)!,77,6'&/,*8!%,',6'! $)!&)'#53&()!(665##,%!&)!<BO;!"2,)!>,!6(:9$#,!'2,!$)(:$*8!#$',3! %,',6',%! 48! 95#,! <+--! $)%! @%$4((3'A<+--;! "2,! <BO! )(#:$*! %$'$! :,$)3! '2,! 383',:! 6$**3! =,),#$',%! 48! '2,! <BO! >2&62! 2$3! )(! &)'#53&();! "2,! &)'#53&()! %$'$3,'3! (7! -DFOAFFX?! -DFOAFX_! $)%! -DFEAFFN! $#,! (4'$&),%! 7#(:! '2,! ,C9*(&'! (7! /5*),#$4&*&'8! &)! =#$92&63! #,)%,#&)=! ,)=&),?! '2,! ,C9*(&'! (7! /5*),#$4&*&'8! &)! 6#,$',',C'#$)=,!4577,#!(/,#7*(>!$)%!'2,!,C9*(&'!(7!/5*),#$4&*&'8!&)! /,6'(#! :$#.59! *$)=5$=,! (7! <BO! #,39,6'&/,*8;! "2,! >&)%(>! 3&Z,! &3! 3,'!'(!4,!XX?!'2,!)5:4,#!(7!2&%%,)!3'$',3!&3!NF!$)%!'2,!)5:4,#!(7! (43,#/,%! 38:4(*3! &3! UVN;! "2,! <BO! '#$&)&)=! %$'$! 6()'$&)3! UE?VUV! 383',:! 6$**3;! "2,! ('2,#! ',3'! %$'$! $)%! ,C9,#&:,)'$*! #,35*'3! $#,! 32(>)!&)!"$4*,!`;! $'3<)/\@/$N)/'0+:'<9/,'()./+S/AEW/-0(,2.-+0/1)()5()1/39/ !1'3++.(MAHII@/ $).(/6'(' B2:3),/+S #9.():/&'<<. !0+:'<9/%'()JbK B2:3),/+S I-.:'(5N #)`2)05).c$+('< B2:3),/+S #)`2)05). AEW/B+,':</6'(' L\dYXL D@\Z XY?cL\Y[L I#YWMYY? /A0(,2.-+0/6'(' DLdL\L Z[@L\ ?[DXYcDLLOL I#YWMY?L /A0(,2.-+0/6'(' ?DdY[O Z?@X[ [WX?c?DYZO I#YZMYYO /A0(,2.-+0/6'(' DWdW\O ?L@\X LWDDcDWWOO $'3<)/L@/#)01:'-</)R7),-:)0(/5+0S-*2,'(-+0@/ 6).5,-7(-+0/+S/(N)/1'('.)(./2.)1/J#':)/'./G'0*/)(/'<@/^DL_K ?Y\/(,'5)./+S/(N)/<'((),/1'('/-0 #)01:'-<@-0- $,'-0-0*/6'(' ?D/(,'5)./+S/(N)/S+,:),/1'('/-0 #)01:'-<@1'):+0@-0OD/(,'5)./+S/(N)/S+,:),/1'('/-0 #)01:'-<@-0- B+,:'< \/(,'5)./+S/(N)/<'((),/1'('/-0 $).(/6'(' #)01:'-<@1'):+0@-0O/.9.<+*/'(('5]./'01 !30+,:'< D/20.255)..S2</'(('5]. / / J,! 6(:9$#,! (5#! ,C9,#&:,)'$*! #,35*'3! >&'2! ('2,#! #,*$',%! >(#.! $)%! ,/$*5$',! '2,! $)(:$*8! #$',3! (7! ,$62! >(#.;! "$4*,! N! 32(>3! '2,! ,C9,#&:,)'$*!#,35*'3;! ! <)!"$4*,!`?!'2,!%&77,#,)6,!&)!$)(:$*8!#$',3!4,'>,,)!)(#:$*!',3'! %$'$! $)%! &)'#53&()! %$'$! &3! `;_l_F;`! '&:,3?! 3(! &'! 6$)! 6*,$#*8! %&36#&:&)$',!>2,'2,#!$!9#(6,33!&3!$)(:$*8!(#!)(';!"$4*,!O!32(>3! '2$'! @%$4((3'A<+--! 6$)! %&36#&:&)$',! )(#:$*! 9#(6,33,3! 7#(:! &)'#53&()!9#(6,33,3!:(#,!6*,$#*8!'2$)!'2,!%,',6'&()!#$',!(7!<+--! >2&62!&3!(4'$&),%!7(#:!'2,!#,35*'!(7!D2&)!PU`Q;! $'3<)/W@/!0+:'<9/,'()./1)()5()1/39/AHII/'01///////// !1'3++.(MAHII/ $'3<)/O@/$N)/'0+:'<9/,'()./5+:7',)1/>-(N/+(N),.@/ #9.():/&'<< #)`2)05). JG-01+>/#-V)a/??K T+,,).(/)(/'<@ F))/)(/'<@ G'0*/)(/'<@ J#(-1)K J%AQQE%K JHIIK JbK JbK JbK Q,+7+.)1/:)(N+1 J!1'3++.(MAHIIK JbK #9.<+*M%):+()? \@? ?L@X ?O@LD DY@W #9.<+*M%):+()D ?@Z ?Y@X ??@ZL ?O@O? #9.<+*MF+5'<? O Z@D W@L? ?\@LD #9.<+*MF+5'<D \@L X W@LX ?Z@WO .:\W\' Y@W X@O O@X? DD@YO .:\R D@Z ?Y@? D@Z\ ?X@\\ B+,:'</$).(/6'(' Y ?@D Y Y@[W ! ! A0(,2.-+0/$).(/6'(' I#YWMYY?/A0(,2.-+0 6'(' I#YWMY?L/A0(,2.-+0 6'(' I#YZMYYO/A0(,2.-+0 6'(' G)0MT2/#N-0 JAHIIK JbK Q,+7+.)1/:)(N+1 J!1'3++.(MAHIIK JbK W@OW Z[@L\ ?[@[? Z?@X[ ?L@OW ?L@\X ! ! / "2,! $)(:$*8! #$',3! (7! &)'#53&()! %$'$! &)! (5#! :,'2(%! $#,! 2&=2,#! '2$)!'2$'!(7!PU`Q?!,C6,9'!7(#!-DFEAFFN!&)'#53&()!%$'$!()*8!2&=2,#! F;X_?! '2,! ,*3,! (7! &)'#53&()! %$'$! (7! (5#! :,'2(%! %&77,#! NlXU! '&:,3! 7#(:! PU`Q;! "2,#,7(#,?! (5#! :,'2(%! 2$3! 4,'',#! 9,#7(#:$)6,! &)! &)'#53&()!%,',6'&();! O@O! 80M<-0)/!142.(-0*/B+,:'</Q,+S-<)// J2,)! '2,! )(#:$*! 4,2$/&(#! >$3! 62$)=,%?! '2,! )(#:$*! 9#(7&*,! 32(5*%! 4,! $%I53',%! 35&'$4*8! '(! $/(&%! ,##(),(53! %,',6'&()! 6$53,%! 48!'2,!62$)=,;! J,!,C',)%!<BO!,C9,#&:,)'!$)%!%&/&%,!&'!&)'(!'>(! 3&'5$'&()3R! 0X1! ^9=#$%,! '2,! 9#(=#$:! /,#3&();! <7! '2,! <BO! >$3! 59=#$%,%! '(! <BE! (#! <BV?! '2,! )(#:$*! 9#(7&*,! (7! <BO! 32(5*%! 4,! $%I53',%;!0U1!"2,!62$)=,3!(7!53,#!4,2$/&(#;!](#!,C$:9*,?!>2,)!$! 53,#!#,9*$6,3!'2,!<BO!>&'2!'2,!]&#,7(C!$3!2&3a2,#!4#(>3,#?!)(#:$*! 9#(7&*,!(7!<BO!32(5*%!4,!$%I53',%!'((;!<)!'2&3!,C9,#&:,)'?!>,!53,! <BO?!<BE?!<BV?!e((=*,!K2#(:,?!]&#,7(C!$)%!\9,#$!$3!,C9,#&:,)'! (4I,6'3!$)%!>,!53,!'2,!'#$&)&)=!%$'$!(7!<BO!'(!',3'!'2,!%$'$!(7!('2,#! 4#(>3,#3;!"2,!<BO!'#$&)&)=!%$'$!6()'$&)3!UE?VUV!383',:!6$**3;!"2,! ',3'!%$'$!$)%!,C9,#&:,)'$*!#,35*'3!$#,!32(>)!&)!"$4*,!E;! J,!$*3(!'$.,!'2,!\9,#$!$3!$)('2,#!,C$:9*,!'(!32(>!'2,!,77,6'3! (7!$%I53'&)=!'2,!)(#:$*!9#(7&*,;! "2,!\9,#$!'#$&)&)=!%$'$!6()'$&)3! XL?`VN! 383',:! 6$**3;! "2,! ',3'! %$'$! $)%! ,C9,#&:,)'$*! #,35*'3! $#,! 32(>)! &)! "$4*,! L;! @7',#! $%I53'&)=! <BO! )(#:$*! 9#(7&*,?! \9,#$! $)(:$*8! #$',! &3! %(>)! 7#(:! NF;F_G! '(! F;XVG;! <'! 9#(/,3! '2$'! (5#! :,'2(%! 6$)! %,6#,$3,! $)(:$*8! 6$53,%! 48! '2,! 62$)=,3! (7! 53,#! 4,2$/&(#;! $'3<)/X@/$N)/'142.(-0*/)R7),-:)0(// +0/(N)/5N'0*)./+S/2.),/3)N'U-+,@/ $'3<)/Z@/$N)/'0+:'<9/,'()./+S/AEW/5+:7',)1///////////////////////// >-(N/+(N),/3,+>.),.@! ! $).(/6'(' B2:3),/+S #9.():/&'<<. !0+:'<9/%'()JbK AEW/ L\dYXL D@\Z B2:3),/+S I-.:'(5N #)`2)05).c$+('< B2:3),/+S #)`2)05). XY?cL\Y[L AEZ/ [ZdO[X ?O@O ?DWYYc[ZOZX AE[/ DYd[[O ?O@LZ DXXZcDY[\O ;++*<)/&N,+:) ?Y?dOOD LW@YO LW\\[c?Y?OLY T-,)S+R [\d[W? D[@L\ DOLLZc[\[\? 87),' LXdY[L OY@YL ?\WODcLXYZL B2:3),/+S/#9.(): &'<<. !0+:'<9/%'()JbK B2:3),/+S/I-.:'(5N #)`2)05).c/$+('< B2:3),//+S/#)`2)05). ")S+,)/!142.(-0* 87),'//6'(' LXdY[L OY@YL ?\WODcLXYZL !S(),/!142.(-0* 87),'//6'(' LXdY[L Y@?[ ZYcLXYZL $).(/6'(' ! ]&)$**8?!^3,!\9,#$!$3!$)!,C$:9*,?!>,!,/$*5$',!'2,!'#$&)&)=!6(3'! (7!(5#!:,'2(%!6(:9$#,%!>&'2!+--!PU_Q!$)%!<+--!PLQ;!D,,!]&=;! _!7(#!'2,!#,35*'3;! ! ! "$4*,! E! 32(>3! '2$'! '2,! $)(:$*8! #$',! (7! <BO!%&77,#3! 7#(:! ('2,#! 4#(>3,#3! 48! `lXO! '&:,3;! @3! <BO?! <BE?! <BV! $#,! $**! %,/,*(9,%! 48! -&6#(3(7'?! '2,! $)(:$*8! #$',3! (7! <BE! $)%! <BV! $#,! *(>,#! '2$)! '2,! ('2,#3;! "2,! $)(:$*8! #$',! (7! \9,#$! &3! '2,! 2&=2,3'! 0NF;F_G1;! <'! &:9*&,3! '2,! 4,2$/&(#!(7! \9,#$! &3! 3&=)&7&6$)'*8! %&77,#,)'! 7#(:! '2$'! (7!<BO;! ]#(:! "$4*,! E?! >,! 6$)! 3,,! '2$'! &7! >,! %()!'! $%I53'! '2,! )(#:$*! 9#(7&*,! (7! 9#(6,33! 4,2$/&(#! 9#(:9'*8?! '2,! <MD! >&**! :&3'$.,)*8! %,',#:&),! $! )(#:$*! 9#(6,33?! 3562! $3! \9,#$?! $3! $)! &)'#53&()! 9#(6,33;! D(! &)! '2,! 3,6()%! 92$3,! (7! '2&3! ,C9,#&:,)'?! >,! 53,! ),>! '#$&)&)=! %$'$! (7! <BV! '(! $%I53'! <BO! )(#:$*! 9#(7&*,! $)%! ,/$*5$',! $)(:$*8!#$',!(7!<BV!',3'!%$'$!$7',#!$%I53',%;!"2,!<BV!'#$&)&)=!%$'$! 6()'$&)3! X`?_NU! 383',:! 6$**3;! "2,! ',3'! %$'$! $)%! ,C9,#&:,)'$*! #,35*'3!$#,!32(>)!&)!"$4*,!V;!@7',#!$%I53'&)=!<BO!)(#:$*!9#(7&*,?! <BV!$)(:$*8!#$',!&3!%(>)!7#(:!XN;_EG!'(!U;OG;!"2&3!9#(/,3!'2$'! (5#! :,'2(%! 6$)! %,6#,$3,! $)(:$*8! 6$53,%! 48! 59%$'&)=! 9#(=#$:! /,#3&();! $'3<)/[@/$N)/'142.(-0*/)R7),-:)0(/+0/27*,'1-0*/AEW/(+/AE[@/ $).(/6'(' B2:3),/+S/#9.(): !0+:'<9/%'()JbK &'<<. B2:3),/+S/I-.:'(5N #)`2)05).c/$+('< B2:3),//+S/#)`2)05). ")S+,)/!142.(-0* AE[/6'(' DYd[[O ?O@LZ DXXZcDY[\O !S(),/!142.(-0* AE[/6'(' DYd[[O D@W \OLcDY[\O ! T-*2,)/L@/$,'-0-0*/'/0)>/0+,:'</7,+S-<)/+S/87),'@/ ! ]#(:!]&=;!_?!$*'2(5=2!'2,!<+--!#,%56,!'2,!6(:95'$'&()! 6(3'! (7! '! 7#(:! \0TU"1! '(! \0TU1?! &'! 3'&**! ),,%3! '(! 6(:95',! Q! &)! '2,! 7(#>$#%!9#(6,%5#,;!D(!6(:9$#,%!>&'2!'2,!+--?!&'3!9,#7(#:$)6,! ()*8!&:9#(/,3!$!*&''*,;!+(>,/,#?!(5#!:,'2(%!()*8!),,%3!'(!$%I53'! '2,!>,&=2'!(7!,$62!>,$.!6*$33&7&,#!$)%!'2,!>,&=2'3!(7!3$:9*,3;!D(! 6(:9$#,%! >&'2! ('2,#! >(#.3?! (5#! :,'2(%! &:9#(/,3! '2,! '#$&)&)=! 6(3'!48!LFG!&)!#,=$#%&)=!'(!'#$&)&)=!'2,!),>!)(#:$*!4,2$/&(#;! \@! &+05<2.-+0./ <)! '2&3! 9$9,#?! >,! 2$/,! 9#(9(3,%! $)! @%$4((3'A<+--! 7(#! $)(:$*8! &)'#53&()! %,',6'&()! $)%! 9#(9(3,%! $! :,62$)&3:! 7(#! ()A *&),!$%I53'&)=!)(#:$*!9#(7&*,;! "2,#,!$#,!'>(!6()'#&45'&()3!(7!(5#!#,3,$#62R! ! X;! D566,3375**8! $%(9'! @%$4((3'A<+--! '(! %,6#,$3,! '2,! 7$*3,! 9(3&'&/,! #$',! (7! $)(:$*8! &)'#53&()! %,',6'&();! <)! D'&%,! ,C9,#&:,)'?!>,!&:9#(/,!'2,!7$*3,!9(3&'&/,!#$',!48!EFG!>&'2! )(!*(33!(7!%,',6'&()!#$',;! U;! \)A*&),! $%I53'&)=! )(#:$*! 9#(7&*,! >2,)! )(#:$*! 4,2$/&(#! >$3! 62$)=,%;! BC9,#&:,)'3! 32(>! '2$'! 9#(:9'*8! $%I53'&)=! )(#:$*!9#(7&*,!6$)!%,6#,$3,!'2,!9(33&4&*&'8!(7!7$*3,!9(3&'&/,! $)%! (5#! ()A*&),! $%I53':,)'! :,'2(%! 6$)! #,%56,! '2,! #,A '#$&)&)=!'&:,!(7!$!),>!)(#:$*!9#(7&*,!48!LFG!&)!'2,!6$3,!(7! \9,#$!4#(>3,#;!! ! !!]#(:! (5#! ,C9,#&:,)'3?! >,! 7&)%! '2$'! '2,! 3,''&)=! (7! '2#,32(*%! $=' $$ &)! B[;0`1! &3! 3,)3&'&/,?! 4,6$53,! &'! >&**! $77,6'! '2,! %,',6'&()! 9,#7(#:$)6,;!<7!'2,!'2#,32(*%!/$*5,!&3!*(>?!4('2!'2,!%,',6'&()!#$',! $)%! '2,! 7$*3,! 9(3&'&/,! #$',! $#,! #,*$'&/,*8! *(>;! K5##,)'*8?! '2&3! '2#,32(*%!&3!3,'!48!'#&$*!$)%!,##(#;!"253?!2(>!'(!3,'!$)!$66,9'$4*,! '2#,32(*%! ),,%!'(!4,!75#'2,#!#,3,$#62,%;!H,3&%,3?!$3! >&)%(>!3&Z,! &3!'2,!*,)='2!(7!3,[5,)6,?!&'!>&**!$*3(!$77,6'!'2,!%,',6'&();!D(!2(>! '(!62((3,!'2,!4,'',#!>&)%(>!3&Z,!'(!%,',6'!&)'#53&()!,77,6'&/,*8!&3! $)('2,#!#,3,$#62!%&#,6'&();! W@! !&=B8GFE6;IEB$#/ J,!/,#8!$99#,6&$',!'2,!$)()8:(53!#,/&,>,#3!7(#!'2,&#!/$*5$4*,! 6(::,)'3!$)%!35==,3'&()3;!"2&3!#,3,$#62!>$3!3599(#',%!&)!9$#'!48! '2,! "$&>$)! T$'&()$*! D6&,)6,! K(5)6&*! 5)%,#! 6()'#$6'3! TDK! L`A UUUXABAFFV!AF`L!$)%!TDK!LOAUOUVABAFFV!AFFVA-m_;! Z@! %ETE%EB&E#/ PXQ! W$39,#3.8!D,65#&'8!H5**,'&)R!-$*>$#,!B/(*5'&()!UFFV;! 2''9Raa>>>;/5*&3';6(:a,)a$)$*83&3h?!@66,33,%!()!-$#62! FU?!UFFL;! PUQ! m()=Z2()=!d&?!m$)=!e,?!n5!b&)=?!o2$(!H(;!@!),>!&)'#53&()! %,',6'&()!:,'2(%!4$3,%!()!75ZZ8!+--;!<)%53'#&$*! B*,6'#()&63!$)%!@99*&6$'&()3?!UFFV;! P_Q! f$25*!W2$))$?!+5$9&)=!d&5;!K()'#(*!'2,(#,'&6!$99#($62!'(! &)'#53&()!%,',6'&()!53&)=!$!%&3'#&45',%!2&%%,)!:$#.(/!:(%,*;! J&#,*,33!K(::5)&6$'&()3?!<BBB!UFFV;! PNQ! WI,'&*!+$3*5:?!-$#&,!B;!e;!-(,!$)%!D/,&)!b;!W)$93.(=;! f,$*A'&:,!&)'#53&()!9#,/,)'&()!$)%!3,65#&'8!$)$*83&3!(7! ),'>(#.3!53&)=!+--3;!d(6$*!K(:95',#!T,'>(#.3?!UFFV;! P`Q! WI,'&*!+$3*5:?!@I&'2!@4#$2$:!$)%!D/,&)!W)$93.(=;!]5ZZ8! ()*&),!#&3.!$33,33:,)'!7(#!%&3'#&45',%!&)'#53&()!9#,%&6'&()! $)%!9#,/,)'&()!383',:3;!K(:95',#!-(%,*&)=!$)%!D&:5*$'&()?! UFFV;! POQ! K25)!m$)=?!],&[&!M,)=?!+$&%()=!m$)=;!@)!5)359,#/&3,%! $)(:$*8!%,',6'&()!$99#($62!53&)=!354'#$6'&/,!6*53',#&)=!$)%! 2&%%,)!:$#.(/!:(%,*;!K(::5)&6$'&()3!$)%!T,'>(#.&)=!&)! K2&)$?!UFFE;! PEQ! K;!J$##,)%,#?!D;!](##,3'?!H;!S,$#*:5'',#;!M,',6'&)=! &)'#53&()3!53&)=!383',:!6$**3R!$*',#)$'&/,!%$'$!:(%,*3;!<)! S#(6,,%&)=3!(7!'2,!XLLL!<BBB!D8:9(3&5:!()!D,65#&'8!$)%! S#&/$68?!9$=,3!X__AX`U?!\$.*$)%?!K$*&7(#)&$?!XLLL;! PVQ! D;!K2(?!D;!+$);!">(!3(92&3'&6$',%!',62)&[5,3!'(!&:9#(/,! 2::A4$3,%!&)'#53&()!%,',6'&()!383',:3;!S#(6,,%&)=3!(7! <)',#)$'&()$*!D8:9(3&5:!()!f,6,)'!@%/$)6,3!&)!<)'#53&()! M,',6'&()?!UFF_;! PLQ! e,#:$)!]*(#,ZAd$##$2()%(?!D53$)!H#&%=,3!$)%!B#&6!@;! +$)3,);!<)6#,:,)'$*!,3'&:$'&()!(7!%&36#,',!2&%%,)!:$#.(/! :(%,*3!4$3,%!()!$!),>!4$6.>$#%!9#(6,%5#,;!<)!S#(6,,%&)=3! (7!'2,!">,)'&,'2!T$'&()$*!K()7,#,)6,!()!@#'&7&6&$*! <)',**&=,)6,?!UFF`;! PXFQ!m($/!]#,5)%!$)%!f(4,#'!B;!D62$9&#,;!@!%,6&3&()A'2,(#,'&6! =,),#$*&Z$'&()!(7!()A*&),!*,$#)&)=!$)%!$)!$99*&6$'&()!'(! 4((3'&)=;!b(5#)$*!(7!K(:95',#!$)%!D83',:!D6&,)6,3!``?!XXLA X_L?!XLLE;! PXXQ!m($/!]#,5)%!$)%!f(4,#'!B;!D62$9&#,;!@!32(#'!&)'#(%56'&()!'(! 4((3'&)=;!b(5#)$*!(7!b$9$),3,!D(6&,'8!7(#!@#'&7&6&$*! <)',**&=,)6,?!XN0`1REEXAEVF?!D,9',:4,#?!XLLL;! PXUQ!T&.5)I!K;!\Z$!$)%!D'5$#'!f533,**;!\)*&),!4$==&)=!$)%! 4((3'&)=;!<)!@#'&7&6&$*!<)',**&=,)6,!$)%!D'$'&3'&63!UFFX?!W,8! J,3'?!]d?!^D@?!99;!XF`AXXU;!b$)5$#8!UFFX;! PX_Q!T&.5)I!K;!\Z$;!\)*&),!,)3,:4*,!*,$#)&)=;!M,9$#':,)'!(7! B*,6'#&6$*!B)=&),,#&)=!$)%!K(:95',#!D6&,)6,?!^)&/,#3&'8!(7! K$*&7(#)&$?!H,#.,*,8?!UFFX;! PXNQ!d;!f;!f$4&),#?!H;!+;!b5$)=;!@)!&)'#(%56'&()!'(!2&%%,)! :$#.(/!:(%,*3;!<BBB!@DDS!-$=$Z&),?!b$)5$#8!XLVO;! PX`Q!d;!f;!f$4&),#;!@!'5'(#&$*!()!2&%%,)!:$#.(/!:(%,*3!$)%! 3,*,6',%!$99*&A6$'&()3!&)!39,,62!#,6(=)&'&();!S#(6;!<BBB?!/(*;! EE?!99;!U`ERUVO?!],4!XLVL;! PXOQ!"(:!H8*$)%,#!$)%!d&3$!"$',;!^3&)=!/$*&%$'&()!3,'3!'(!$/(&%! (/,#7&''&)=!&)!$%$4((3';!@:,#&6$)!@33(6&$'&()!7(#!@#'&7&6&$*! <)',**&=,)6,?!UFFO;! PXEQ!@*,C$)%,#!p,Z2),/,'3!$)%!\*=$!H$#&)(/$;!@/(&%&)=! 4((3'&)=!(/,#7&''&)=!48!#,:(/&)=!6()753&)=!3$:9*,3;! D9#&)=,#Ap,#*$=!H,#*&)!+,&%,*4,#=!UFFE;! PXVQ!J,&:&)=!+5?!J,&!+5?!$)%!D',/,!-$84$).;!@%$4((3'A4$3,%! $*=(#&'2:!7(#!),'>(#.!&)'#53&()!%,',6'&();!<BBB!"#$)3$6'&()3! ()!D83',:3?!-$)?!$)%!K84,#),'&63S9$#'!HR!K84,#),'&63?!/(*;! _V?!T\;!U?!@9#&*!UFFV;! PXLQ!D$8!J,&!]((?!m()=!d&$)?!$)%!d&$)=!M()=;!f,6(=)&'&()!(7! /&35$*!39,,62!,*,:,)'3!53&)=!$%$9'&/,*8!4((3',%!2&%%,)! :$#.(/!:(%,*3;!<BBB!"#$)3$6'&()3!()!KA&'3!$)%!D83',:3! 7(#!p&%,(!",62)(*(=8?!/(*;!XN?!T\;!`?!-$8!UFFN;! PUFQ!d,/,)'!-;!@#3*$)?!$)%!b(2)!+;!d;!+$)3,);!D,*,6'&/,!'#$&)&)=! 7(#!2&%%,)!:$#.(/!:(%,*3!>&'2!$99*&6$'&()3!'(!39,,62! 6*$33&7&6$'&();!<BBB!"#$)3$6'&()3!()!D9,,62!$)%!@5%&(! S#(6,33&)=?!/(*;E?!T\;X?!b$)5$#8!XLLL;! PUXQ!^T-!D83',:!K$**!M$'$3,'3;! 2''9Raa>>>;63;5):;,%5al&::3,6a383',:6$**3;2':;! PUUQ!n&$(Aq5)=!o2$)=r?!o2()=Ad*$)=!o!+!^;!K(:4&)&)=!'2,! 2::!$)%!'2,!),5#$*!),'>(#.!:[%,*3!'(!#,6(=)&Z,!&)'#53&()3;! S#(6,,%&)=3!(7!'2,!"2&#%!<)',#)$'&()$*!K()7,#,)6,!()! -$62&),!d$:&)=!$)%!K84,#),'&63?!D2$)=2$&?!UOAUL!@5=53'! UFFN;! PU_Q!J;!J$)=?!n;+;!e5$)?!n;d;!o2$)=;!-(%,*&)=!9#(=#$:! 4,2$/&(#3!48!2&%%,)!:$#.(/!:(%,*3!7(#!&)'#53&()!%,',6'&();! <)!S#(6,,%&)=3!(7!UFFN!<)',#)$'&()$*!K()7,#,)6,!()!-$62&),! d,$#)&)=!$)%!K84,#),'&63?!@5=!UFFN;! PUNQ!B;B3.&)?!J;d,,?!$)%!D;b;D'(*7(;!-(%,*&)=!383',:!6$**3!7(#! &)'#53&()!%,',6'&()!>&'2!%8)$:&6!>&)%(>!3&Z,3;!<)! S#(6,,%&)=3!(7!M@fS@!<)7(#:$'&()!D5#/&/$4&*&'8! T5-6&2&-/&#U#LK459,$,5-#VV"NWWM<XVYTLZ!FX?!b5),!UFFX;! PU`Q!J,)A]5!D2&);!@)!$%$9'&/,!$)(:$*8!%,',6'&()!:,'2(%!4$3,%! ()!&)6#,:,)'$*!2&%%,)!:$#.(/!:(%,*!$)%!>&)%(>3!)$'&/,! @S<;!M,9$#':,)'!(7!<)7(#:$'&()!-$)$=,:,)'?!T$'&()$*! K,)'#$*!^)&/,#3&'8?!"$&>$);!-$3',#!'2,3&3!UFFE;!0&)!K2&),3,;1! Addressing the Attack Attribution Problem using Knowledge Discovery and Multi-criteria Fuzzy Decision-Making Olivier Thonnard Royal Military Academy Polytechnic Faculty Brussels, Belgium Wim Mees Royal Military Academy Polytechnic Faculty Brussels, Belgium [email protected] [email protected] ABSTRACT In network traffic monitoring, and more particularly in the realm of threat intelligence, the problem of “attack attribution” refers to the process of effectively attributing new attack events to (un)-known phenomena, based on some evidence or traces left on one or several monitoring platforms. Real-world attack phenomena are often largely distributed on the Internet, or can sometimes evolve quite rapidly. This makes them inherently complex and thus difficult to analyze. In general, an analyst must consider many different attack features (or criteria) in order to decide about the plausible root cause of a given attack, or to attribute it to some given phenomenon. In this paper, we introduce a global analysis method to address this problem in a systematic way. Our approach is based on a novel combination of a knowledge discovery technique with a fuzzy inference system, which somehow mimics the reasoning of an expert by implementing a multi-criteria decision-making process built on top of the previously extracted knowledge. By applying this method on attack traces, we are able to identify largescale attack phenomena with a high degree of confidence. In most cases, the observed phenomena can be attributed to so-called zombie armies - or botnets, i.e. groups of compromised machines controlled remotely by a same entity. By means of experiments with real-world attack traces, we show how this method can effectively help us to perform a behavioral analysis of those zombie armies from a long-term, strategic viewpoint. Keywords Intelligence monitoring and analysis, attack attribution. 1. INTRODUCTION In the field of threat intelligence, “attack attribution” refers to the process of effectively attributing new attack events to known or unknown phenomena by analyzing the traces they Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSI-KDD ’09, June 28, Paris, France Copyright 2009 ACM 978-1-60558-669-4 ...$5.00. Marc Dacier Symantec Research Sophia Antipolis France [email protected] have left on sensors or monitoring platforms deployed on the Internet. The objectives of such a process are twofold: i) to get a better understanding of the root causes of the observed attacks; and ii) to characterize emerging threats from a global viewpoint by producing a precise analysis of the modus operandi of the attackers on a longer time scale. In this paper, we introduce a global threat analysis method to address this problem in a systematic way. We present a knowledge mining framework that enables us to identify and characterize large-scale attack phenomena on the Internet, based on network traces collected with very simple and easily deployable sensors. Our approach relies on a novel combination of knowledge discovery (by means of maximum cliques) and a multi-criteria decision-making algorithm that is based on a fuzzy inference system (FIS). Interestingly, a FIS does not need any training prior making inferences. Instead, it takes advantage of the previously extracted knowledge to make sound inferences, so as to attribute incoming attack events to a given phenomenon. A key aspect of the proposed method is the exploitation of external characteristics of malicious sources, such as their spatial distributions in terms of countries and IP subnets, or the distribution of targeted sensors. We take advantage of these statistical characteristics to group events that seem a priori unrelated, whereas most current techniques used for anomalous traffic correlation rely only on the intrinsic properties of network flows (e.g., protocol characteristics, IDS alerts or signatures, firewall logs, etc) [1, 31]. Our research builds also on prior work in malicious traffic analysis, also referred to as Internet background radiation [17, 4]. We acknowledge also the seminal work of Yegneswaran et al. on “Internet situational awareness” [30], in which they explore ways to integrate honeypot data into daily network security monitoring. Their approach aims at providing tactical information, for daily operations, whereas our approach is more focused on strategic information revealing the long-term behaviors of large-scale phenomena. Furthermore, many of these large-scale phenomena are apparently related to the ubiquitous problem of zombie armies - or botnets, i.e. groups of compromised machines that are remotely controlled and coordinated by a same entity. Still today, zombie armies and botnets constitute, admittedly, one of the main threats on the Internet, and they are used for different kinds of illegal activities (e.g., bulk spam sending, online fraud, denial of service attack, etc) [3, 18]. While most previous studies related to botnets have focused on un- In Section 2, we present the first component of our method, namely the extraction of cliques of attackers. This step aims at discovering knowledge by identifying meaningful correlations in a set of attack events. In Section 3, we present a multi-criteria decision-making algorithm that is based on a fuzzy inference system. The purpose of this second component consists in combining intelligently the previously extracted knowledge, so as to build sequences of attack events that can be very likely attributed to the same global phenomena. Then, in Section 4, we present our experimental results and the kind of findings we can obtain by applying this analysis method to a set of attack events. Finally, we conclude in Section 5 and we suggest some future directions. 2. 2.1 KNOWLEDGE DISCOVERY IN ATTACK TRACES Introduction We need first to introduce the notion of “attack event”. Our dataset is made of network attack traces collected from a distributed set of sensors (e.g., server honeypots), which are deployed in the context of the Leurre.com Project [14, 22]. Since honeypots are systems deployed for the sole purpose of being probed or compromised, any network connection that they establish with a remote IP can be considered as malicious, or at least suspicious. We use a classical clustering algorithm to perform a first low-level classification of the traffic. Hence, each IP source observed on a sensor is attributed to a so-called attack cluster [21] according to its network characteristics, such as the number of IP addresses targeted on the sensor, the number of packets and bytes sent to each IP, the attack duration, the average inter-arrival time between packets, the associated port sequence being probed, and the packet payload (when available). Therefore, all IP sources belonging to a given attack cluster have left very similar network traces on a given sensor and consequently, they can be considered as having the same attack profile. This leads us then to the concept of attack event, which is defined as follows: An attack event refers to a subset of IP sources having the same attack profile on a given sensor, and whose coordinated activity has been observed within a specific time window. Fig. 1 illustrates this notion by representing the time series (i.e., the number of sources per day) of three coordinated attack events observed on two different sensors in the same time interval, and targeting three different ports. The identification of those events can be easily automated by using the method presented in [20]. By doing so, we are able to extract interesting events from this spurious, nonproductive traffic collected by our sensors (previously termed “Internet background radiation” in [17]), and we can focus on the most 450 AE103 (139T) on sensor 45 AE171 (1433T) on sensor 45 AE173 (5900T) on sensor 9 400 350 300 Nr of sources derstanding their inner working [23, 6, 2], or on techniques for detecting bots at the network-level [8, 9], we are instead more interested in studying the global behaviors of those armies from a strategic viewpoint, i.e.: how long do they stay alive on the Internet, what is their average size, and more importantly, how do they evolve over time with respect to different criteria such as their origins, or the type of activities (or scanning) they perform. 250 200 150 100 50 0 82 84 86 88 90 92 94 Time (by day) 96 98 100 102 Figure 1: Illustration of 3 attack events observed on 2 different sensors, and targeting 3 different ports. important events that might originate from coordinated phenomena. In the rest of this Section, we show how to take advantage of different characteristics of such attack events to discover knowledge by means of an unsupervised cliquebased clustering technique. 2.2 Defining Attack Characteristics In most knowledge discovery applications, the first step consists in selecting certain key characteristics from the dataset, i.e., salient features that may (hopefully) provide meaningful patterns [11]. We give here an overview of different attack characteristics we have selected to perform the extraction of knowledge from our set of attack events. In this specific case, we consider these characteristics as useful to analyze the root causes of global phenomena observed on our sensors. However, we do not pretend that they are the only ones that could be used in threat monitoring, and other characteristics might certainly prove even more relevant in the future. For this reason, the framework is built such that other attack features could be easily included when necessary. So, the two first characteristics retained are related to the origins of the attackers, i.e. their spatial distributions. First, the geographical location can be used to identify attack activities having a specific distribution of originating countries. Such information can be important to identify, for instance, botnets that are located in a limited number of countries. It is also a way to confirm the existence, or not, of so-called safe harbors for cybercriminals or hackers. Somehow related to the geographical location, the IP network blocks provide also an interesting viewpoint on the attack phenomena. Indeed, IP subnets can give a good indication of the spatial “uncleanliness” of certain networks, i.e., the tendency for compromised hosts (e.g., zombie machines) to stay clustered within unclean networks [5]. So, for each attack event, we can create a feature vector representing either the distribution of originating countries, or of IP addresses grouped by Class A-subnet (i.e., by /8 prefix). The next attack characteristic deals with the targets of the attackers, namely the distribution of sensors that have been targeted by the sources. Botmasters may indeed send commands at a given time to all zombies to instruct them to start scanning (or attacking) one or several IP subnets, which of course will create coordinated attack events on specific sensors. Therefore, it seems important to look at relationships that may exist between attack events and the sensors they have been observed on. Since attack events are defined per sensor, we decided to group all strongly correlated attack events that occurred within the same time window of existence (as explained in [20]), and we then use each group of attack events to create the feature vector representing the proportion of sensors that have been targeted. Besides the origins and the targets, the type of activity performed by the attackers seems also relevant to us. In fact, bot software is often crafted with a certain number of available exploits targeting a reduced set of TCP or UDP ports. In other words, we might think of each botnet having its own attack capability, which means that a botmaster will normally issue scan or attack commands only for vulnerabilities that he might exploit to expand his botnet. So, it seems to make sense to take advantage of this feature to look for similarities between the sequences of ports that have been targeted by the sources of the attack events. Let us remind that, in our low-level classification of the network traffic [21], each source is associated to the complete sequence of ports that it has targeted on a given sensor for the whole duration of the attack session (e.g., less than 24 hours), which allows us to compute and compare the distributions of port sequences for the observed attack events. Finally, we have also decided to compute, for each pair of events, the ratio of common IP addresses. We are aware of the fact that, as time passes, some zombie machines of a given botnet might be cured while others may get infected and join the botnet. Additionally, certain ISPs apply a quite dynamic policy of IP address allocation to residential users, which means that bot-infected machines can have different IP addresses when we observe them at different moments. Nevertheless, and according to our domain experience, it is reasonable to expect that if two distinct attack events have a high percentage of IP addresses in common, then the probability that those two events are somehow related to the same global phenomenon is increased (assuming that the time difference between the two events is not too large). 2.3 Extracting Cliques of Attackers 2.3.1 Principles In our global threat analysis method, we have developed a knowledge discovery component that involves an unsupervised graph-theoretic correlation process. The idea consists in discovering all groups of highly similar attack events (through their corresponding feature vectors) in a reliable and consistent manner, and for each attack characteristic that can bring an interesting viewpoint on the root causes. In a clustering task, we typically consider the following steps [11]: i) feature selection and/or extraction; ii) definition of a similarity measure between pairs of patterns; iii) grouping similar patterns; iv) data abstraction (if needed), to provide a compact representation of each cluster; and v) the assessment of the clusters quality and coherence. In the previous Section, we have already described the attack features that are of interest in this paper; so now we need to measure the similarity between two such input vectors (or distributions, in our case). Clearly, the choice of a similarity metric is very important, as it has an impact on the properties of the final clusters, such as their size, quality, and consistency. To reliably compare the kind of empirical distributions mentioned here above, we have chosen to rely on strong statistical distances. As we do not know the real underlying distribution from which the observed samples were drawn, we use non-parametric statistical tests, such as Pearson’s χ2 , to determine whether two one-dimensional probability distributions differ in a significant way (with a significance level of 0.05). The resulting p-value is then validated against the Jensen-Shannon divergence (JSD) [15], which derives itself from the KullbackLeibler divergence [12]. Let p1 and p2 be for instance two probability distributions over a discrete space X, then the K-L divergence of p2 from p1 is defined as: X p1 (x) DKL (p1 ||p2 ) = p1 (x) log p 2 (x) x which is also called the information divergence (or relative entropy). DKL is commonly used in information theory to measure the difference between two probability distributions p1 and p2 , but it is not considered as a true metric since it is not symmetric, and does not satisfy the triangle inequality. For this reason, we can also define the Jensen-Shannon divergence as: JS(p1 , p2 ) = DKL (p1 ||p̄) + DKL (p2 ||p̄) 2 where p̄ = (p1 + p2 )/2. In other words, the Jensen-Shannon divergence is the average of the KL-divergences to the average distribution. The JSD has the following notable properties: it is always bounded and non-negative; JS(p1 , p2 ) = JS(p2 , p1 ) (symmetric), and JS(p1 , p2 ) = 0 when p1 = p2 (idempotent). To be a true metric, the JSD must also satisfy the triangular inequality, which is not true for all cases of (p1 , p2 ). Nevertheless, it can be demonstrated that the square root of the Jensen-Shannon divergence is a true metric [7], which is what we need for our application. Finally, we take advantage of those similarity measures to group all attack events whose distributions look very similar. We simply use an unsupervised graph-based approach to formulate the problem: the vertices of the graph represent the patterns (or feature vectors) of all attack events, and the edges express the similarity relationships between those vertices, as calculated with the distance metrics described here above. Then, the clustering is performed by extracting so-called maximal cliques from the graph, where a maximal clique is defined as an induced sub-graph in which the vertices are fully connected and it is not contained within any other clique. To perform this unsupervised clustering, we use the dominant sets approach of Pavan et al. [19], which proved to be an effective method for finding maximal weighted cliques. This means that the weight of every edge (i.e., the relative similarity) is also taken into consideration by the algorithm, as it seeks to discover maximal cliques whose total weight is maximized. This generalization of the MCP is also known as the maximum weight clique problem (MWCP). We refer the interested reader to [27, 26] for a more detailed description of this clique-based clustering technique applied to our honeynet traces. 2.3.2 Some Experimental Clique Results Our data set comes from a 640-day attack trace obtained with the Leurre.com honeynet in the time period from September 2006 to June 2008. This trace was collected by 36 platforms located in 20 different countries and belonging to 18 different class A-subnets. We have selected only the most prevalent types of activities observed on the sensors, i.e. about 130 distinct attack profiles for which an activity involving a sufficient number of IP sources had been observed at least once on a given day during the whole period. This data set comprises totally 1,195,254 distinct sources, which have sent about 3,423,577 packets to the sensors. By using the technique described in [20], we have extracted 351 attack events that were somehow coordinated on at least two different sensors. This reduced set of attack events still accounts for 282,363 unique sources (23.6 % of the data set), or 741,349 packets (21.5%). For the set of attack characteristics considered above, we applied our clique-based clustering on those attack events. Table 1 on page 5 presents a high-level overview of the cliques obtained for each attack dimension separately. As we can see, a relatively high volume of sources could be classified into cliques for each dimension. The last colon with the most prevalent patterns gives an indication of which countries or class A-subnets (e.g., originating or targeted IP subnets) are most commonly observed in the cliques that lie in the upper quartile with respect to the number of sources. Interestingly, it seems that many coordinated attack events are coming from a given IP subspace. Regarding the targeted platforms, several cliques involve a single class A-subnet. About the type of activities, we can observe some commonly targeted ports (e.g., Windows ports used for SMB or RPC, or SQL and VNC ports), but also a large number of uncommon high TCP ports that are normally unused on standard (and clean) machines (such as 6769T, 50286T, 9661T, . . . ). A non-negligeable volume of sources is also due to UDP spammers targeting Windows Messenger popup service (ports 1026 to 1028/UDP). 2.4 Consolidation of the Knowledge In order to assess the consistency of the resulting cliques of attack events, it can be useful to see them charted on a twodimensional map so as to i) verify the proximities among clique members (intra-clique consistency), and ii) understand potential relationships between different cliques that are somehow related (i.e. inter-clique relationships). Moreover, the statistical distances used to compute those cliques make them intrinsically coherent, which means also that certain cliques of events may be somehow related to each other, although they were separated by the clique algorithm. Since most of the feature vectors we are dealing with have a high number of variables (e.g., a geographical vector has more than 200 country variables), obviously the structure of such high-dimensional data set cannot be displayed directly on a 2D map. Multidimensional scaling (MDS) is a set of methods that can help to address this problem. MDS is based on dimensionality reduction techniques, which aim at converting a high-dimensional dataset into a two or threedimensional representation that can be displayed, for example, in a scatter plot. The aim of dimensionality reduction is to preserve as much of the significant structure of the highdimensional data as possible in the low-dimensional map. As a consequence, MDS allows an analyst to visualize how far observations are from each other for different kinds of similarity measures, which in turn can deliver insights into !" % &'(&) *+(&) &)(*+ &)(*+ &)(0122 &)(&' &)(,#" ,-(&) ./(,&)(:4 &)(*+ &)(*+ *+(&) *+(:4 $" 67('. 67(.- " !$" !#" $; ,-(0122 *+(&' $" ,-(&' *+(45 *+(33-(9. *+(&' *+(&) *+(3*+(&)3-(&) 7+(9. 7+(9. 3-(9. 9.(95 9.(7+ 9.(7+ =" *+(67 *+(0122 4,(,- ,-(&' 45(67 45(38*(45 9.(7+ 9.(+7 <; <" ; 9.(67 &' !!" % !!" !#" !$" " $" #" !" " Figure 2: Visualization of geographical cliques of attackers. The coloring refers to the different cliques and the red circles indicate their sizes on the low-D map. The superposed text labels indicate only the two top attacking countries for some of the data points. the underlying structure of the high-dimensional dataset. Because of the intrinsic non-linearity of real-world data sets, we applied a recent MDS technique called t-SNE to visualize each dimension of the data set, and to assess the consistency of the cliques results. t-SNE [28] is a variation of Stochastic Neighbour Embedding; it produces significantly better visualizations than other MDS techniques by reducing the tendency to crowd points together in the centre of the map. Moreover, this technique has proven to perform better in retaining both the local and global structure of real, high-dimensional datasets in a single map, in comparison to other non-linear dimensionality reduction techniques such as Sammon mapping, Isomaps or Laplacian Eigenmaps [10]. Stochastic Neighbor Embedding aims at minimizing a cost function that is based on the sum of Kullback-Leibler divergences over all datapoints using a gradient descent method. t-SNE improves further this technique by using an initial Student-t distribution, rather than a Gaussian, to compute the similarity between two points in the low-dimensional space (which tends to alleviate the problem of “crowding” points in the center of the map, see [28] for a detailed explanation). Figure 2 shows the resulting two-dimensional plot obtained by mapping the geographical vectors on a 2D map using t-SNE. Each datapoint on this map represents the geographical distribution of a given attack event. The coloring refers to the clique membership of each event, as obtained previously by applying the clique-based clustering, and the dotted circles indicate the clique sizes. We could easily verify that two adjacent events on the map have highly similar geographical distributions (even from a statistical viewpoint), while two distant events have clearly nothing in common in terms of originating countries. Quite surprisingly, the resulting mapping is far from being chaotic; it presents a relatively sparse structure with clear datapoint groupings, which means also that most of those attack events present very tight relationships regarding their origins. Due to the Attack Dimension Geolocation Nr of Cliques 31 Max.size (nr events) 40 Min.size (nr events) 3 Volume of sources (%) 84.4 IP Subnets (Class A) 25 51 3 91.2 Targeted platforms 17 86 2 70.1 Port sequences 22 66 4 93.2 Most prevalent patterns found in the cliques(1) !CN,CA,US,FR,TW", !IT,ES,FR,SE,DE,IL", !KR,US,BR,PL,CN,CA" !US,JP,GB,DE,CA,FR,CN,KR", !US,FR,JP,CN,DE,ES,TW", !CA,CN" !PL,DE,ES,HU,FR" !87,82,151,83,84,81,85,213", !222,221,60,218,58,24,124,121,219,82,220" !201,83,200,24,211,218,89,124,61,82,84", !24,60" !83,84,85,80,88", !193,195,201,202,203,216,200,61,24,84,59" !202", !88, 192", !195", !193", !194" !129, 134, 139, 150", !24, 213" !I", !1433T", !I-445T", !5900T", !1026U", !135T", !50286T" !I-445T-139T-445T-139T-445T", !6769T", !1028U-1027U-1026U" Table 1: Some experimental clique results obtained from a honeynet dataset collected from Sep 06 until June 08. (1) the given patterns represent the average distributions for the most prevalent cliques, i.e. the ones lying in the upper quartile in terms of number of sources. For the IP subnets (resp. targeted platforms), the numbers refer to the distributions of originating (resp. targeted) class A-subnets. !" % &##'* & & && &##'* '(!)* #" !!##* #!!$* +',!#* $" ',""* ',""* (#++* ($$,+* !$" !#" $#!'+* #!!$* #!!$* !$((* '+)#$* !.!,* $,())* $' & && & $" ',""* ("$'* ',""* " +" ##'* ',""* $,!.*##'* (+'* &##'*(+,*##'*--$,!.* $,!)* $,!.* $,!)* ("$!/ (#++* (' (" (+'* ',""* !(+#* ' ("$./("$!/("$)/ !!" % !!" !#" !$" " $" #" !" " Figure 3: Same visualization of the geographical cliques of attackers as Fig 2, but here the superposed text labels indicate the port sequences targeted by the attackers. strict statistical distances used to calculate cliques, this kind of correlation can hardly be obtained by chance only. Similar “semantic mapping” can naturally be obtained for the other dimensions (e.g., subnets, platforms, etc), so as to help assessing the quality of other cliques of attackers. To conclude this Section, Figure 3 shows the same geographical mapping on which the port sequences of several attack events have been superposed on top of the datapoints. This can help to visualize unobvious relationships among different types of activities and their origins, and it leads also to the natural intuition that an intelligent algorithm could potentially leverage the results of this knowledge discovery process, by combining efficiently different sets of cliques. 3. 3.1 MULTI-CRITERIA DECISION-MAKING Requirements and Motivation The decision-support component of our method shall take advantage of the knowledge obtained via the extraction of cliques, and of the global semantic mappings obtained through dimensionality reduction. The final objective consists in re-constructing sequences of attack events that can be attributed with a high confidence to the same root phenomenon in function of multiple criteria. In other words, we want to build an inference engine that takes as input the extracted knowledge to classify incoming attack events into either “known phenomena”, or otherwise to identify a new phenomenon when needed (e.g., when we observe the first attack event of a new zombie army). There exists certainly many different classification algorithms that are able to map multiple input features to multiple output classes, even for complex, non-linear mappings, such as Support Vector Machines, Artificial Neural Networks, etc. However, we are confronted to specific constraints that do not allow us to use this type of supervised machine learning techniques. First, we have a priori zero-knowledge of the expected output, which means that we can not provide training samples showing the characteristics of the output we are looking for. Secondly, we want to include some domain knowledge to specify which type of combinations we expect to be promising in the root cause identification. Third, the inference system must be flexible enough to allow additional criteria to be used in the future, so as to further improve the inference capabilities. Finally, we favor the “white-box” approach having a transparent reasoning process, which allows an expert to understand the reasons (i.e., the combinations of criteria) for which the system has grouped a given set of events into the same root phenomenon. Although large-scale phenomena on the Internet are complex and dynamic, our intuition is that two consecutive attack events should be linked to the same root phenomenon if and only if they share at least two different attack characteristics. That is, we want to build a decision-making process that will attribute two attack events to the same phenomenon when the events features are “close enough” for any combination of at least two attack dimensions out of the complete set of criteria: {origins, targets, activity, commonIP }. So, we hypothesize that real-world phenomena may perfectly evolve over time, which means that two consecutive attack events of the same zombie army must not necessarily have all their attributes in common. For example, the bots composition of a zombie army may evolve over time because of the cleaning of infected machines and the recruitment of new bots. From our observation viewpoint, this will translate into a certain shift in the IP subnet distribution of the zombie machines for subsequent attack events of this army H7#!35;'$,!4*#!! F*3,;!5'EA#*,! :5;'$!@4*34A+#,!4*#!&'((3B#6!456!4++!*'+#,!4*#!! I++!%'$;'$,!(3!4*#!F%EA35#6!! #@4+'4$#6!C3$7!;*#D6#B5#6!E#EA#*,73;!&'5FG%5,!! C3$7!45!4//*#/4G%5!E#$7%6.! Figure 4: Main components of a Fuzzy System. (and thus, most probably different cliques w.r.t. the origins). Or, a zombie army may be instructed to scan several consecutive IP subnets in a rather short interval of time, which will lead to the observation of different events having highly similar distributions of originating countries and subnets, but those events will target completely different sensors, and may eventually use different exploits (hence, targeting different port sequences). On the other hand, we consider that only one correlated attack dimension is not sufficient to link two attack events to the same root cause, since the result might then be due to chance only (e.g., a large proportion of attacks originate from some large or popular countries, certain Windows ports are commonly targeted, etc). However, by combining intelligently several attack viewpoints, we can reduce considerably the probability that two attack events would be attributed to the same root cause whereas they are in fact unrelated. 3.2 Fuzzy Inference Systems We still need to formally define what is the “relatedness degree” between two attack events, certainly when they do not belong to a same clique but are somehow “close” to each other. Intuitively, attack events characteristics in the real world have unsharp boundaries, and the membership to a given phenomenon can be a matter of degree. For this reason, we have developed a decision-making process that is based on a fuzzy inference system (FIS). The mathematical concepts behind fuzzy reasoning are quite simple and intuitive; in fact, it aims at reproducing the reasoning of a human expert with very simple mathematical functions. Fuzzy inference is thus a convenient way to map an input space to an output space with a flexible and extensible system, and using the codification of common sense and expert knowledge. The mapping then provides a basis from which decisions can be made. The main components of an inference system are sketched in Fig. 4. To map the input space to the output space, the primary mechanism is a list of if-then statements called rules, which are evaluated in parallel, so the order of the rules is unimportant. Instead of using crisp variables, all inputs are fuzzified using membership functions in order to determine the degree to which the input variables belong to each of the appropriate fuzzy sets. If the antecedent of a given rule has more than one part (i.e., multiple ’if’ statements), a fuzzy logical operator is applied to obtain one number that represents the result of the antecedent for that rule. For example, the fuzzy OR operator simply selects the maximum of the two values. The results of all rules are then combined and distilled into a single, crisp value that can be used to make a decision. This aggregation process can be done in two different ways. Mamdani’s inference [16] expects "%$ "%# "%! "2 !!" 2 ;*' ;*! ' ' "%& 678*297:17:2 ! 2 ;*' ;*! ' "%& )*+,*-./01 !"#$!%&!&'(()!*'+#,-!#./.0!! !1'+#!20!3&!!"!3,!#!456!$"!3,!%!$7#5!&"!3,!'!! !!!1'+#!80!3&!!(!3,!#)!$7#5!&(!3,!')!! !!!! !!!1'+#!50!9! !! )*+,*-./01 :5;'$,! <=2-)2>! =8! <=?-)?>! 9! "%$ "%# "%! " !" (' #" $" "2 "%& "%$ "%# "%! !#" !3" !!" !'" " " 5 4' Figure 5: Fuzzy rule evaluation. the output membership functions to be also fuzzy sets. After the aggregation process, there is a fuzzy set for each output variable that needs defuzzification by computing for instance the centroı̈d of the output function. Whereas in a Sugenotype inference system [25], the output membership functions are either linear or constant. The general form of a rule in a Sugeno fuzzy model is: if Input1 is x and Input2 is y then Output is z = a.x + b.y + c. For a zero-order Sugeno model, the output level z is a constant (a=b=0). The output level zi of each rule is weighted by the firing strength wi of the rule. The most common way to calculate the final output of the system is the weighted average of all rule outputs: P i wi .zi F inal output = P i wi When it is possible to model a fuzzy system using Sugenotype inference, the defuzzification and aggregation process is thus greatly simplified and much more efficient than with Mamdani’s inferences, which is why we used a Sugeno-type system to model each attack phenomenon. Concretely, we use the knowledge obtained from the extraction of cliques to build the fuzzy rules that describe the behavior of each phenomenon. The characteristics of new incoming attack events are then used as input to the fuzzy systems that model the phenomena identified so far. In each of those fuzzy systems, the features of the most recent attack event shall define the current parameters of the membership function used to evaluate the following simple rules: if xi is close AND if yi is close then zi is related, ∀i ∈ {geo, subnets, targets, portsequence}. Fig 5 gives a graphical representation of how such a rule is evaluated for the subnets of origins of two given attack events. Since this characteristic is represented by a 2D mapping, we can see the result of evaluating the relative position of the events according to both dimensions (x, y). Each membership function is maximal within the cliques, then it decreases smoothly to take into account the fuzziness of real-world phenomena. In this case, the antecedents of the rule hold respectively 0.16 and 1.0, which results in an output of 0.16 (since a logical AND in fuzzy logic corresponds to the MIN operator). So, the membership functions referred to as “is close” in the fuzzy rules are defined by the characteristics of the cliques to which the attack events belong. The calculation of the rule output zi ∈ [0, 1] is just the intersection between the two curves, which quantifies the inter-relationship between the cliques (and hence, between the attack events). Similarly, we can evaluate the fuzzy rules for the other dimensions considered in the inference system. For the last dimension, i.e. the common IP’s, we use a static membership function whose input is the common IP ratio calculated between the two events. Fig 6 represents this static membership function, where we can see the output ZIP increasing and where F (z1 , z2 , . . . , zn ) = W1 .z1! + W2 .z2! + . . . + Wn .zn! 1 Fuzzy output 0.8 0.6 0.4 0.2 0 0 5 10 Common IP ratio (%) 15 20 Figure 6: Common IP Membership function. smoothly as the ratio of common IP addresses increases from 0 to 10%, where ZIP is then maximal. This curve is actually drawn from our knowledge, or domain experience, in monitoring malicious traffic. Note that, initially, the inference engine has no knowledge, so the first incoming attack event will create the first phenomenon. Then, each time a new event could not be attributed to an existing phenomenon, the inference engine will create a new fuzzy system to model this new emerging phenomenon. The inference engine is thus self-adaptive by design. 3.3 Multi-criteria Decision-making Having formally defined how to evaluate the output of each rule, for each phenomenon, a last problem remains regarding the weighted average that is used as aggregation function in a classical Sugeno inference system. In fact, it does not allow us to express that certain combinations of criteria (or rule outputs) must be somehow prioritized, as previously described in the requirements. We need thus to introduce another type of multi-criteria aggregation function that allows to model more complex requirements such as “most of”, or “at least two” criteria to be satisfied in the overall decision function. Yager has introduced in [29] a special type of operator called Ordered Weighted Aggregation (OWA), which allows to include some relationships between multiple criteria in the aggregation process. An OWA operator provides an aggregation function for criteria whose result lies between the classical “and” and “or” operators, which are in fact the two extreme cases. Assume Z1 , Z2 , . . . , Zn are n criteria of concern in our multi-criteria problem. For each criteria, Zi (x) ∈ [0, 1] indicates the degree to which x satisfies that criteria, which corresponds in our case to the rules output of a given fuzzy system. Then, we define a mapping function F : I n → I where I = [0, 1] as an OWA operator of dimension n, if associated with F is a weighting vector W = (W1 , W2 , . . . , Wn ) such that 1. Wi ∈ [0, 1] 2. P i Wi = 1 with zi! being the ith largest element in the collection z1 , ..., zn . That is, Z ! is an ordered vector composed of the elements of Z put in descending order, which means that the weights Wi are associated with a particular ordered position rather than a particular element. Yager [29] has carefully studied the mathematical foundations of OWA operators, and he demonstrated that such operators have the desired properties such as monotonicity, generalized commutativity, associativity and idempotence. To define the weights Wi to be used, Yager suggests two possible approaches: either to use some learning mechanism with sample data and a regression model, or to give some semantics or meaning to the Wi ’s by asking a decision-maker to provide directly those values. We selected the latter approach by defining the weighting vector as W = (0.1, 0.35, 0.35, 0.1, 0.1), which translates our intuition about the dynamic behaviors of large-scale phenomena. It can be interpreted as: “at least three criteria must be satisfied, but the first criteria is of less importance compared to the 2nd and 3rd ones”. These values were carefully chosen in order to avoid the grouping of unrelated events when, for example, two events are coming from popular countries and targeting common (Windows) ports in the same interval of time, but those events are in reality not related to the same phenomenon. In this worst-case scenario, we can imagine that the ordered vector of criteria (obtained from the evaluation of the fuzzy rules) could be something similar to Z = (0.3, 0.1, 0, 1, 0). That is, we have a high correlation for the targeted port sequences (z4 = 1), and we have then some weak correlation (due to chance) for the geographical origins (z1 = 0.3) and also for the subnets of origins (z2 = 0.1). By applying our weighting vector W to Z ! = (1, 0.3, 0.1, 0, 0), we get as final decision value F = 1 ∗ 0.1 + 0.3 ∗ 0.35 + 0.1 ∗ 0.35 = 0.24. By considering other scenarios, we can verify that the values of the weighting vector W work as expected, i.e. it minimizes the final output value in these cases. Moreover, these considerations enable us also to fix our decision threshold to an empirical value of about 0.25. That is, when the final output value F lies under this threshold, we will reject the attribution of the attack event under scrutiny to the current phenomenon whose fuzzy system is being evaluated. Finally, when several fuzzy systems provide an output value lying above the threshold, we will obviously chose the highest one to attribute the event; however, this case was rarely observed in our experiments. There exists certainly other alternatives for choosing the Wi ’s, but according to our experimental results, this choice proved to be very effective in identifying sequences of attack events having the same root cause. 4. BEHAVIORAL ANALYSIS OF GLOBAL PHENOMENA 4.1 Main Characteristics In this Section, we provide some experimental results obtained by applying our multi-criteria inference method to the same set of attack events we already introduced in Section 2.3 (clique analysis). As already mentioned, these experimental results only aim at validating the applicability and usefulness of the method proposed. They do not pre- 1 It is important to note that the sizes of the zombie armies given here only reflect the number of sources we could observe on our sensors; the actual sizes of those armies are most probably much larger, even though some churn effects (DHCP, NAT) could also affect these numbers. Total size (nr of sources) 0 1 10k 20k 30k 40k 50k 60k 70k 1 0.9 0.9 CDF size 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 100 200 300 Lifetime (nr of days) 400 F(x) F(x) CDF lifetime 0 600 500 Figure 7: Empirical CDF of the size and lifetime of zombie armies. % / "*$ "*( "*+ "*# =>? tend to offer a complete view of all possible phenomena observable on the Internet. At the contrary, they show that, even with a limited number of data sources, it is possible to observe and reason about a couple of interesting phenomena. Furthermore, these anecdotal, yet representative, examples show that our method helps in characterizing their root cause, i.e., in addressing the attack attribution issue. So, over the whole collection period (640 days), we found about 32 global phenomena. In total, 348 attack events (99% of our data set) could be attributed to a given largescale phenomenon. An in-depth analysis has revealed that most of those phenomena (apart from the noisy network worm W32.Rahack.H [24], also known as W32/Allaple) are quite likely related to zombie armies, i.e., groups of compromised machines belonging to the same botnet(s). We conjecture this for the following main reasons: i) the apparent coordination of the sources, both in time (i.e., coordinated events on several sensors) and in the distribution of tasks (e.g., scanners versus attackers); ii) the short durations of the attack events, typically a few days only, whereas “classical” worms tend to spread over longer, continuous periods of time; iii) the absence of known classical network worm spreading on many of the observed port sequences; and iv) the source growing rate, which has a sort of exponential shape for worms and is somehow different for botnets [13]. To illustrate the results, Table 2 on page 10 presents an overview of some global phenomena found in our dataset. Thanks to our method, we are able to characterize precisely the behaviors of the identified phenomena or zombie armies. Hence, we found that the largest army had in total 57 attack events comprising 69,884 sources, and could survive for about 112 days. The longest lifetime of a zombie army observed so far was still 586 days. Fig. 7 shows the cumulative distributions (CDF) of the lifetime and size of the identified armies. Those figures reveal some interesting aspects of their global behaviors: according to our observations, at least 20% of the zombie armies had in total more than ten thousand observable1 sources during their lifetime, and the same proportion of armies could survive on the Internet for at least 250 days. On average, zombie armies have a total size of about 8,500 observed sources, a mean number of 658 sources per event, and their mean survival time is 98 days. Regarding the origins, we observe some very persistent groups of IP subnets and countries of origin across many different armies. On Fig. 8, we can see the CDF of the sources involved in the zombie armies of Table 2, where the x-axis represents the first byte of the IPv4 address space. It appears clearly that malicious sources involved in those phenomena are highly unevenly distributed and form a relatively small number of tight clusters, which account for a significant number of sources and are thus responsible for a large deal of the observed malicious activities. This is consistent with other prior work on monitoring global malicious activities, in particular with previous studies related to measurements of Internet background radiation [4, 17, 31]. However, we are now able to show that there are still some notable differences in the spatial distributions of those zombie armies with respect to the average distribution over "*' 7.4@2A4 B7% B7) B7' B7# B7$ B7%" B7%% B7%& B7&" "*) "*! "*& "*% "/ !" #" $" %&" %'" %(" &%" &)" ,-.)/01234/536200/7!089:4;0< Figure 8: Empirical CDF of sources in IPv4 address space for the 9 zombie armies illustrated in Table 2. all sources (represented with the blue dashed line). In other words, certain armies of compromised machines can have very different spatial distributions, even though there is a large overlap between “zombie-friendly” IP subnets. Moreover, because of the dynamics of this kind of phenomena, we can even observe very different spatial distributions within a same army at different moments of its lifetime. This is a strong advantage of our analysis method that is more precise and enables us to distinguish individual phenomena, instead of global trends, and to follow their dynamic behavior over time. Another interesting observation on Fig. 8 is related to the subnet CDF of ZA1 (uniformly distributed in the IPv4 space, which means randomly chosen source addresses) and ZA20 (a constant distribution coming exclusively from the subnet 24.0.0.0/8). A very likely explanation is that those zombie armies have used spoofed addresses to send UDP spam messages to the Windows Messenger service. So, this indicates that IP spoofing is still possible under the current state of filtering policies implemented by certain ISP’s on the Internet. Finally, in terms of attack capability, we observe that about 50% of the armies could target at least two completely dif- 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 1 0.5 0 0 1400 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1200 1000 Nr of sources Geo Fuzzy Common output IPs Port Seq Targets Subnets 1 0.5 0 800 600 400 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 Attack events (ordered in time) 11 12 13 200 0 100 120 140 160 180 Time (by day) 200 220 Figure 9: Output of the fuzzy inference system (zi and Figure 10: Time series of coordinated attack events for F (zi )) modeling the zombie army nr 12. zombie army ZA10 (i.e., nr of sources observed by day). ferent ports (thus, probably two different exploits, at least), and one army had even an attack capability greater than 10 (ZA4 in Table 2). At this stage, it is unclear why a zombie army would target such a large number of unusual, high TCP ports (12293T, 15264T, etc). A recurrent misconfiguration or P2P phenomenon is thus not excluded; but even in that case, it is very interesting to note that our method was able to attribute all those different events to the same root phenomenon, thanks to the combination of several statistical metrics. we can see that this army had four waves of activity during which it was randomly scanning 5 different subnets (note the almost perfect coordination among those attack events). When inspecting the subnet distributions of those different attack waves, we could clearly observe a drift in the origins of those sources, quite likely as certain machines were infected by (resp. cleaned from) the bot software. Finally, we found another smaller army (ZA11) that is clearly related to ZA10 (e.g., same temporal behavior, similar activity, same targets); but in this case, a different group of zombie machines, resulting in very different subnet CDF’s on Fig. 8), was used to attack only specific IP addresses on our sensors, probably by taking advantage of the results given by the army of scanners (ZA10). 4.2 Some Detailed Examples In this Section, we further detail two zombie armies to illustrate some typical behaviors we could observe among the identified phenomena, e.g.: i) a move (or drift) in the origins of certain armies (both geographical and IP blocks) during their lifetime; ii) a large scan sweep by the same army targeting several consecutive class A-subnets; iii) within a same army, multiple changes in the port sequences (or exploits) used by zombies to scan or to attack; iv) a coordination between different armies. Zombie army 12 (ZA12) is an interesting case in which we can observe the behaviors ii) and iii). Fig. 9 represents the output of the fuzzy system modeling this phenomenon. Each bar graph represents the fuzzy output zi for a given attack dimension, whereas the last plot shows the final aggregated output from which the decision to group those events together was made (i.e., F (zi )). We can clearly see that the targets and the activities of this army have evolved between certain attack events (e.g., when the value of zi is low). That is, this army has been scanning (at least) four consecutive class A-subnets during its lifetime (still 183 days), while probing at the same time three different ports on these subnetworks. Then, the largest zombie army observed by the sensors (ZA10) has showed the behaviors i) and iv). On Fig. 10, 5. CONCLUSIONS We have introduced a general analysis method to address the complex problem related to “attack attribution”. Our approach is based on a novel combination of knowledge discovery and a multi-criteria fuzzy decision-making process. By applying this method, we have showed how apparently unrelated attack events could be attributed to the same global attack phenomenon, or to the same army of zombie machines operating in a coordinated manner. To the best of our knowledge, this is the first formal, systematic and rigorous method that enables us to identify and characterize precisely the behaviors of those large-scale attack phenomena. As future work, we envisage to extend our method to other data sets, such as high-interaction (eventually client) honeypot data, or malware data sets, and to include even more relevant attack features so as to improve further the inference capabilities of the system, and thus also our insights into malicious behaviors observed on the Internet. Acknowledgments This work has been partially supported by the European Commission through project FP7-ICT-216026-WOMBAT funded by the 7th framework program. The opinions expressed in this paper are those of the authors and do not necessarily reflect the views of the European Commission. Id Nr of events 10 Total size Lifetime (nr sources) (nr days) (Class A- subnets) 1 18,468 535 24.*,193.*,195.*,213.* Targeted sensors Attack capability Main origins 4 82 26,962 321 202.* 5 13 9,644 131 195.* 6 15 51,598 >1 year > 7 subnets 9 23 11,198 218 192.*,193.*,194.* 10 57 69,884 112 128.*,129.*,134.*,139.*,150.* 11 14 2,636 110 129.*,134.*,139.*,150.* I-445T-139T-445T-139T-445T 12 14 27,442 183 192.*,193.*,194.*,195.* 1025T,1433T,2967T 20 10 30,435 337 24.*, 129.*, 195.* (countries / subnets) 1026U 12293T,15264T,18462T,25083T,25618T,28238T,29188T, 32878T,33018T,38009T,4152T,46030T,4662T,50286T,. . . 135T,139T,1433T,2968T,5900T ICMP (W32.Rahack.H / Allaple) 2967T,2968T,5900T I-I445T 1026U,1026U1028U1027U,1027U US,JP,GB,DE,CA,FR,CN,KR,NL,IT 69,128,195,60,81,214,211,132,87,63 IT,ES,DE,FR,IL,SE,PL 87,82,83,84,151,85,81,88,80 CN,US,PL,IN,KR,JP,FR,MX,CA 218,61,222,83,195,221,202,24,219 KR,US,BR,PL,CN,CA,FR,MX,TW 201,83,200,24,211,218,89,124 US,CN,TW,FR,DE,CA,BR,IT,RU 193,200,24,71,70,213,216,66 CN,CA,US,FR,TW,IT,JP,DE 222,221,60,218,58,24,70,124 US,FR,CA,TW,IT 82,71,24,70,68,88,87 US,JP,CN,FR,TR,DE,KR,GB 218,125,88,222,24,60,220,85,82 CA,CN 24,60 Table 2: Overview of some large-scale phenomena found in a honeynet dataset collected from Sep 06 until Jun 08. 6. REFERENCES [1] Paul Barford and David Plonka. Characteristics of network traffic flow anomalies. In In Proceedings of ACM SIGCOMM Internet Measurement Workshop, 2001. [2] Paul Barford and Vinod Yegneswaran. An Inside Look at Botnets. Advances in Information Security. Springer, 2006. [3] David Barroso. Botnets - the silent threat. In European Network and Information Security Agency (ENISA), November 2007. [4] Zesheng Chen, Chuanyi Ji, and Paul Barford. Spatial-temporal characteristics of internet malicious sources. In Proceedings of INFOCOM, 2008. [5] M. P. Collins, T. J. Shimeall, S. Faber, J. Janies, R. Weaver, M. De Shon, and J. Kadane. Using uncleanliness to predict future botnet addresses. In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, pages 93–104, New York, NY, USA, 2007. ACM. [6] Evan Cooke, Farnam Jahanian, and Danny McPherson. The Zombie roundup: Understanding, detecting, and disrupting botnets. In Proceedings of the Steps to Reducing Unwanted Traffic on the Internet (SRUTI 2005 Workshop), Cambridge, MA, July 2005. [7] B. Fuglede and F. Topsoe. Jensen-shannon divergence and hilbert space embedding. pages 31–, June-2 July 2004. [8] G. Gu, R. Perdisci, J. Zhang, and W. Lee. BotMiner: Clustering analysis of network traffic for protocol- and structure-independent botnet detection. In Proceedings of the 17th USENIX Security Symposium, 2008. [9] Guofei Gu, Junjie Zhang, and Wenke Lee. BotSniffer: Detecting botnet command and control channels in network traffic. In Proceedings of the 15th Annual Network and Distributed System Security Symposium (NDSS’08), February 2008. [10] Geoffrey Hinton and Sam Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems 15, volume 15, pages 833–840, 2003. [11] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data. Prentice-Hall advanced reference series, 1988. [12] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22: 79-86., 1951. [13] Wenke Lee, Cliff Wang, and David Dagon, editors. Botnet Detection: Countering the Largest Security Threat, volume 36 of Advances in Information Security. Springer, 2008. [14] C. Leita, V.H. Pham, O. Thonnard, E. Ramirez-Silva, F. Pouget, E. Kirda, and Dacier M. The Leurre.com Project: Collecting Internet Threats Information Using a Worldwide Distributed Honeynet. In Proceedings of the [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] WOMBAT Workshop on Information Security Threats Data Collection and Sharing, WISTDCS 2008. IEEE Computer Society press, April 2008. J. Lin. Divergence measures based on the shannon entropy. Information Theory, IEEE Transactions on, 37(1):145–151, Jan 1991. E. H. Mamdani and S. Assilian. An experiment in linguistic synthesis with a fuzzy logic controller. Int. J. Hum.-Comput. Stud., 51(2):135–147, 1999. Ruoming Pang, Vinod Yegneswaran, Paul Barford, Vern Paxson, and Larry Peterson. Characteristics of internet background radiation. In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 27–40, New York, NY, USA, 2004. ACM. Markus Kötter Georg Wicherski Paul Bächer, Thorsten Holz. Know your enemy: Tracking botnets. In http://www.honeynet.org/papers/bots/. M. Pavan and M. Pelillo. A new graph-theoretic approach to clustering and segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2003. V. Pham, M. Dacier, G. Urvoy Keller, and T. En Najjary. The quest for multi-headed worms. In DIMVA 2008, 5th Conference on Detection of Intrusions and Malware & Vulnerability Assessment, July, 2008, Paris, France, Jul 2008. F. Pouget and M. Dacier. Honeypot-based forensics. In AusCERT2004, AusCERT Asia Pacific Information technology Security Conference 2004, 23rd - 27th May 2004, Brisbane, Australia, 2004. The Leurre.com Project. http://www.leurrecom.org. M. Abu Rajab, J. Zarfoss, F. Monrose, and A. Terzis. A multifaceted approach to understanding the botnet phenomenon. In IMC ’06: Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 41–52, New York, NY, USA, 2006. ACM. Symantec Security Response. W32.rahack.h, [april 2009]. Michio Sugeno. Industrial Applications of Fuzzy Control. Elsevier Science Inc., New York, NY, USA, 1985. Olivier Thonnard and Marc Dacier. A framework for attack patterns’ discovery in honeynet data. DFRWS 2008, 8th Digital Forensics Research Conference, August 11- 13, 2008, Baltimore, USA, 2008. Olivier Thonnard and Marc Dacier. Actionable knowledge discovery for threats intelligence support using a multi-dimensional data mining methodology. In ICDM’08, 8th IEEE International Conference on Data Mining series, December 15-19, 2008, Pisa, Italy, Dec 2008. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, November 2008. [29] Ronald R. Yager. On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Trans. Syst. Man Cybern., 18(1):183–190, 1988. [30] V Yegneswaran, P Barford, and V Paxson. Using honeynets for internet situational awareness. In Fourth ACM Sigcomm Workshop on Hot Topics in Networking (Hotnets IV), 2005. [31] Vinod Yegneswaran, Paul Barford, and Johannes Ullrich. Internet intrusions: global characteristics and prevalence. In SIGMETRICS, pages 138–147, 2003. Malware Detection using Statistical Analysis of Byte-Level File Content S. Momina Tabish, M. Zubair Shafiq, Muddassar Farooq Next Generation Intelligent Networks Research Center (nexGIN RC) National University of Computer & Emerging Sciences (FAST-NUCES) Islamabad, 44000, Pakistan {momina.tabish,zubair.shafiq,muddassar.farooq}@nexginrc.org ABSTRACT 1. INTRODUCTION Commercial anti-virus software are unable to provide protection against newly launched (a.k.a “zero-day”) malware. In this paper, we propose a novel malware detection technique which is based on the analysis of byte-level file content. The novelty of our approach, compared with existing content based mining schemes, is that it does not memorize specific byte-sequences or strings appearing in the actual file content. Our technique is non-signature based and therefore has the potential to detect previously unknown and zero-day malware. We compute a wide range of statistical and information-theoretic features in a block-wise manner to quantify the byte-level file content. We leverage standard data mining algorithms to classify the file content of every block as normal or potentially malicious. Finally, we correlate the block-wise classification results of a given file to categorize it as benign or malware. Since the proposed scheme operates at the byte-level file content; therefore, it does not require any a priori information about the filetype. We have tested our proposed technique using a benign dataset comprising of six different filetypes — DOC, EXE, JPG, MP3, PDF and ZIP and a malware dataset comprising of six different malware types — backdoor, trojan, virus, worm, constructor and miscellaneous. We also perform a comparison with existing data mining based malware detection techniques. The results of our experiments show that the proposed nonsignature based technique surpasses the existing techniques and achieves more than 90% detection accuracy. Sophisticated malware is becoming a major threat to the usability, security and privacy of computer systems and networks worldwide [1], [2]. A wide range of host-based solutions have been proposed by researchers and a number of commercial anti-virus (AV) software are also available in the market [5]–[21]. These techniques can broadly be classified into two types: (1) static, and (2) dynamic. Static techniques mostly operate on machine-level code and disassembled instructions. In comparison, dynamic techniques mostly monitor the behavior of a program with the help of an API call sequence generated at run-time. The application of dynamic techniques in AV products is of limited use because of the large processing overheads incurred during run-time monitoring of API calls; as a result, the performance of computer systems significantly degrades. In comparison, the processing overhead is not a serious concern for static techniques because the scanning activity can be scheduled offline in an idle time. Moreover, static techniques can also be deployed as an in-cloud network service that moves complexity from an end-point to the network cloud [28]. Almost all static malware detection techniques including commercial AV software — either signature-, or heuristic-, or anomaly-based — use specific content signatures such as byte sequences and strings. A major problem with the content signatures is that they can easily be defeated by packing and basic code obfuscation techniques [3]. In fact, the majority of malware that appears today is a simple repacked version of old malware [4]. As a result, it effectively evades the content signatures of old malware stored in the database of commercial AV products. To conclude, existing commercial AV products cannot even detect a simple repacked version of previously detected malware. The security community has expanded significant effort in application of data mining techniques to discover patterns in the malware content, which are not easily evaded by code obfuscation techniques. Two most well-known data mining based malware detection techniques are ‘strings’ (proposed by Schultz et al [7]) and ‘KM’ (proposed by Kolter et al [8]). We take these techniques as a benchmark for comparative study of our proposed scheme. The novelty of our proposed technique — in contrast to the existing data mining based technique — is its purely non-signature paradigm: it does not remember exact file content/contents for malware detection. It is a static malware detection technique which should be, intuitively speaking, robust to the most commonly used evasion techniques. The proposed technique computes a diverse set of statistical Categories and Subject Descriptors D.4.6 [Security and Protection]: Invasive Software General Terms Experimentation, Security Keywords Computer Malware, Data Mining, Forensics Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSI-KDD’09, June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-4 ...$5.00. and information-theoretic features in a block-wise manner on the byte-level file content. The generated feature vector of every block is then given as an input to standard data mining algorithms (J48 decision trees) which classify the block as normal (n) or potentially malicious (pm). Finally, the classification results of all blocks are correlated to categorize the given file as benign (B) or malware (M). If a file is split into k equally-sized blocks (b1 , b2 , b3 , · · · , bk ) and n statistical features are computed for every k-th block (fk,1 , fk,2 , fk,3 , · · · , fk,n ), then mathematically our scheme can be represented as: where (F) is a suitable feature set, (D) is a data mining algorithm for classification of individual blocks. The file is eventually categorized as benign (B) or malware (M) by the correlation module (C). Once a suitable feature set (F) and a data mining algorithm (D) are selected, we test the accuracy of the solution using a benign dataset consisting of six filetypes: DOC, EXE, JPG, MP3, PDF and ZIP; and a malware dataset comprising of six different malware types: backdoor, trojan, virus, worm, constructor and miscellaneous. The results of our experiments show that our scheme is able to provide more than 90% detection accuracy1 for detecting malware, which is an encouraging outcome. To the best of our knowledge, this is the first pure non-signature based data mining malware detection technique using statistical analysis of the static byte-level file contents. The rest of the paper is organized as follows. In Section 2 we provide a brief overview of related work in the domain of data mining malware detection techniques. We describe the detailed architecture of our malware detection technique in Section 3. We discuss the dataset in Section 4. We report the results of pilot studies in Section 5. In Section 6, we discuss the knowledge discovery process of the proposed technique by visualizing the learning models of data mining algorithms used for classification. Finally, we conclude the paper with an outlook to our future work. on a dataset that consists of 1, 001 benign and 3, 265 malicious executables. These executables have 206 benign and 38 malicious samples in the portable executable (PE) file format. They have collected most of the benign executables from Windows 98 systems. They use three different approaches to statically extract features from executables. The first approach extracts DLL information inside PE executables. Further, the DLL information is extracted using three types of feature vectors: (1) the list of DLLs (30 boolean values), (2) the list of DLL function calls (2, 229 boolean values), and (3) the number of different function calls within each DLL (30 integer values). RIPPER — an inductive rule-learning algorithm — is used on top of every feature vector for classification. These schemes based on DLL information provides an overall detection accuracy of 83.62%, 88.36% and 89.07% respectively. Enough details about the DLLs are not provided, so we could not implement this scheme in our study. The second feature extraction approach extracts strings from the executables using GNU strings program. Naı̈ve Bayes classifier is used on top of extracted strings for malware detection. This scheme provides an overall detection accuracy of 97.11%. This scheme is reported to give the best results amongst all, and we have implemented it for our comparative study. The third feature extraction approach uses byte sequences (n-grams) using hexdump. The authors do not explicitly specify the value of n used in their study. However, from an example provided in the paper, we deduce it to be 2 (bi-grams). The Multi-Naı̈ve Bayes algorithm is used for classification. This algorithm uses voting by a collection of individual Naı̈ve Bayes instances. This scheme provides an overall detection accuracy of 96.88%. The results of their experiments reveal that Naı̈ve Bayes algorithm with strings is the most effective approach for detecting the unseen malicious executables with reasonable processing overheads. The authors acknowledge the fact that the string features are not robust and can be easily defeated. Multi-Naı̈ve Bayes with byte sequences also provides a relatively high detection accuracy, however, it has large processing and memory requirements. Byte sequence technique was later improved by Kolter et al and is explained below. 2. 2.2 b1 b2 .. . bk F ⇒ f1,1 , f1,2 · · · f1,n f2,1 , f2,2 · · · f2,n .. . fk,1 , fk,2 · · · fk,n D ⇒ n/pm n/pm .. . n/pm C B/M, ⇒ RELATED WORK In this section we explain the details of most relevant malware detection techniques: (1) ‘strings’ by Schultz et al [7], (2) ‘KM’ by Kolter et al [8] and (3) ‘NG’ by Stolfo et al [16], [17]. In our comparative study, we use ‘strings’ and ‘KM’ as benchmarks for comparison, whereas ‘NG’ effectively uses just one of the many statistical features used in our proposed technique. 2.1 Schultz et al — Strings In [7], Schultz et al use several data mining techniques to distinguish between the benign and malicious executables in Windows or MS-DOS format. They have done experiments 1 Throughout this text, the terms detection accuracy and Area Under ROC Curve (AUC) are used interchangeably. The AUC (0 ≤ AUC ≤ 1) is used as a yardstick to determine the detection accuracy. Higher values of AUC mean high true positive (tp) rate and low false positive (fp) rate [30]. At AUC = 1, tp rate = 1 and fp rate = 0. Kolter et al — KM In [8], Kolter et al use n-gram analysis and data mining approaches to detect malicious executables in the wild. They use n-gram analysis to extract features from 1, 971 benign and 1, 651 malicious PE files. The PE files have been collected from machines running Windows 2000 and XP operating systems. The malicious PE files are taken from an older version of the VX Heavens Virus Collection [27]. The authors evaluate their approach for two classification problems: (1) classification between the benign and malicious executables, and (2) categorization of executables as a function of their payload. The authors have categorized only three types — mailer, backdoor and virus — due to the limited number of malware samples. Top n-grams with the highest information gain are taken as binary features (T if present and F if absent) for every PE file. The authors have done pilot studies to determine the size of n-grams, the size of words and the number of top n-grams to be selected as features. A smaller dataset consisting of 561 benign and 476 malicious executables is considered in this study. They have used 4-grams, one byte word and top 500 n-grams as features. Several inductive learning methods, namely instance-based learner, Naı̈ve Bayes, support vector machines, decision trees and boosted versions of instance-based learner, Naı̈ve Bayes, support vector machines and decision trees are used for classification. The same features are provided as input to all classifiers. They report the detecting accuracy as the area under an ROC curve (AUC) which is a more complete measure compared with the detection accuracy [29]. AUCs show that the boosted decision trees outperform rest of the classifiers for both classification problems. 2.3 Stolfo et al — NG In a seminal work, Stolfo et al have used n-gram analysis for filetype identification [15] and later for malware detection [16], [17]. Their earlier work, called fileprint analysis, uses 1-gram byte distribution of a whole file and compares it with different filetype models to determine the filetype. In their later work, they detect malware embedded in DOC and PDF files using three different models single centroid, multi centroid, and exemplar of benign byte distribution of the whole files. A distance measure, called Mahanalobis Distance, is calculated between the n-gram distribution of these models and a given test file. They also use 1-gram and 2-gram distributions to test their approach on a dataset comprising of 31 benign application executables, 331 benign executables from the System32 folder and 571 viruses. The experimental results have shown that their proposed technique is able to detect a considerable proportion of the malicious files. However, their proposed technique is specific to embedded malware and does not deal with detection of stand-alone malware. 3. ARCHITECTURE OF PROPOSED TECHNIQUE In this section, we explain our data mining based malware detection. It is important to emphasize that our technique does not require any a priori information about the filetype of a given file; as a result, our scheme is robust to subtle file header obfuscations crafted by an attacker. Moreover, our technique is also able to classify the malware as a function of their payload, i.e. it can detect the family of a given malware. The architecture of proposed technique is shown in Figure 1. It consists of four modules: (1) block generator (B), (2) feature extractor (F), (3) data miner (D), and (4) correlation (C). It can accept any type of file as an input. We now discuss the details of every module. 3.1 Block Generator Module (B) The block generator module divides the byte-level contents of a given file into fixed-sized chunks — known as blocks. We have used blocks for reducing the processing overhead of the module. In future, we want to analyze the benefit of using variable-sized blocks as well. Remember that using a suitable block size plays a critical role in defining the accuracy of our framework because it puts a lower limit on the minimum size of malware that our framework can detect. We have to compromise a trade-off between the amount of available information per block and the accuracy of the system. In this study, we have set the block size to 1024 bytes (= 1K). For instance, if the file size is 100K bytes then the file is split into 100 blocks. The frequency histograms for 1-, 2-, 3-, and 4-gram of byte values are calculated for each block. These histograms are given as input to the feature extraction module. 3.2 Feature Extraction Module (F) Feature extraction module computes a number of statistical and information-theoretic features on the histograms for each block generated by the previous module. Overall, the features’ set consist of 13 diverse features, which are separately computed on 1, 2, 3, and 4 gram frequency histograms.2 This brings the total size of features’ set to 52 features per block. We now provide brief descriptions of every feature. 3.2.1 Simpson’s Index The Simpson’s index [31] is a measure defined in an ecosystem, which quantifies the diversity of species in a habitat. It is calculated using the following equation: n(n − 1) , N (N − 1) Si = (1) where n is the frequency of byte-values of consecutive ngrams and N is the total number of bytes in a block i.e, 1000 in our case. A value of zero shows no significant difference between frequency of n-grams in a block. Similarly, as the value of Si increases, the variance in frequency of n-grams in a block also increases. In all subsequent feature definitions, we represent with Xi the frequency of j th n-gram in ith block, where j is varying from 0 − 255 for 1-gram, 0 − 65535 for 2-grams, 0 − 16777215 for 3-grams, and 0 − 4294967295 for 4-grams. 3.2.2 Canberra Distance This distance measures the sum of series of fractional differences between coordinates of a pair of objects [31]. Mathematically we represent it as: n CA(i) = j=0 3.2.3 | Xj − Xj+1 | | Xj | + | Xj+1 | (2) Minkowski Distance of Order This is a generalized metric to measure the differences between absolute magnitude of differences between a pair of objects: n mi = λ j=0 | Xj − Xj+1 , |λ (3) where we have used λ = 3 as suggested in [31]. 3.2.4 Manhattan Distance It is a special case of Minkowski distance [31] with λ = 1. n mhi = j=0 | Xj − Xj+1 | (4) 2 In rest of the paper, we use the generic term n-grams once we want to refer to all 4-grams separately. Figure 1: Architecture of our proposed technique 3.2.5 Chebyshev Distance 3.2.8 Chebyshev distance measure is also called maximum value distance. It measures the absolute magnitude of differences between coordinates of a pair of objects: chi = maxj | Xj − Xj+1 | (5) It is a special case of Minkowski difference [31] with λ = ∞. 3.2.6 Bray Curtis Distance It is a normalized distance measure which is defined as the ratio of absolute difference of frequencies of n-grams and the sum of their frequencies [31]: bci = 3.2.7 n j=0 | Xj n j=0 (Xj − Xj+1 | + Xj+1 ) (6) Angular Separation This feature models the similarity of two vectors by taking cosine of the angle between them [31]. A higher value of angular separation between two vectors shows that they are similar. ASi = n j=0 ( n j=0 Xj2 · Correlation Coefficient The standard angular separation between two vectors which is centered around the mean of the its magnitude values is called correlation coefficient [31]. This again measures the similarity between two values: n j=0 (Xj CCi = ( n j=0 − X i ) · (Xj+1 − X i ) 2 (Xj − X i ) · n j=0 2 1/2 , where X i is the mean of frequencies of n-grams in a given block i. 3.2.9 Entropy Entropy measures the degree of dispersal or concentration of a distribution. In information-theoretic terms, entropy of a probability distribution defines the minimum average number of bits that a source requires to transmit symbols according to that distribution [31]. Let R be a discrete random variable such that R = {ri , i ∈ ∆n }, where ∆n is the image of a random variable. Then entropy of R is defined as: E(R) = − t(ri ) log2 t(ri ), i∈∆n where t(ri ) is the frequency of n-grams in a given block. Xj · Xj+1 n j=0 2 Xj+1 )1/2 (7) (8) (Xj+1 − X i ) ) (9) 3.2.10 Kullback - Leibler Divergence KL Divergence is a measure of the difference between two probability distributions [31]. It is often referred to as a distance measure between two distributions. Mathematically, it is represented as: n KLi (Xj || Xj+1 ) = 3.2.11 Xj log j=0 Xj Xj+1 1 1 D(Xj || M ) + D(Xj+1 || M ) (11) 2 2 where M = 12 (Xj + Xj+1 ). Itakura-Saito Divergence Itakura-Saito Divergence is a special form of Bregman distance [31]. n BF (Xj , Xj+1 ) = ( j=0 Xj Xj − log − 1) Xj+1 Xj+1 which is generated by the convex function Fi (Xj ) = − (12) log Xj . Total Variation It measures the largest possible difference between two probability distributions Xj and Xj+1 [31]. It is defined as: δi (Xj , Xj+1 ) = 3.3 1 2 j | Xj − Xj+1 | (13) Data Mining based Classification Module (D) The classification module gets as an input the feature vector in the form of an arff file [26]. This feature file is then further presented for classification to 6 sub-modules. The six modules actually contain learnt models of six types of malicious files: backdoor, virus, worm, trojan, constructor and miscellaneous. The feature vector file is presented to all sub-modules in parallel and they produce an output of n or pm per block. In addition to this, the output of the classification sub-modules provide us insights into the payload of the malicious file. Boosted decision tree is used for classifying each block as n or pm. We have used AdaBoostM1 algorithm for boosting decision tree (J48) [24]. We have selected this classifier after extensive pilot studies which are detailed in Section 5. We provide brief explanations of decision tree (J48) and boosting algorithm (AdaBoostM1) below. 3.3.1 Qty. DOC EXE JPG MP3 PDF ZIP 300 300 300 300 300 300 Avg. Size (kilo-bytes) 1, 015.2 4, 095.0 1, 097.8 3, 384.4 1, 513.1 1, 489.6 Min. Size (kilo-bytes) 44 48 3 654 25 8 Max. Size (kilo-bytes) 7, 706 15, 005 1, 629 6, 210 20, 188 9, 860 Jensen-Shannon Divergence JSDi (Xj || Xj+1 ) = 3.2.13 Filetype (10) It is a popular measure in probability theory and statistics and measures the similarity between two probability distributions [31]. It is also known as Information Radius (IRad). Mathematically, it is represented as: 3.2.12 Table 1: Statistics of benign files used in this study Decision Tree (J48) We have used C4.5 decision tree (J48) implemented in Waikato Environment for Knowledge Acquisition (WEKA) [26]. It uses the concept of information entropy to build the tree. Every feature is used to split the dataset into smaller subsets and the normalized information gain is calculated. The feature with highest information gain is selected for decision making. 3.3.2 Boosting (AdaBoostM1) We have used AdaBoostM1 algorithm implemented in WEKA [24]. As the name suggests, it is a meta algorithm which is designed to improve the performance of base learning algorithms. AdaBoostM1 keeps calling the weak classifier until a pre-defined number of times and tweaks those instances that have resulted in misclassification. In this way, it keeps on adapting itself with the ongoing classification process. It is known to be sensitive to outliers and noisy data. 3.4 Correlation Module (C) The correlation module gets the per block classification results in the form of n or pm. It then calculates the correlation among the blocks which are labeled as n or pm. Depending upon the fraction of n and pm blocks in a file, the file is classified as malicious or benign. We can also set a threshold for tweaking the final classification decision. For instance if we set the threshold to 0.5, a file having 4 benign and 6 malicious blocks will be classified as malicious, and vice-versa. 4. DATASET In this section, we present an overview of the dataset used in our study. We first describe the benign and then the malware dataset used in our experiments. It is important to note that in this work we are not differentiating between packed and non-packed files. Our scheme works regardless of the packed/non-packed nature of the file. 4.1 Benign Dataset The benign dataset for our experiments consists of six different filetypes: DOC, EXE, JPG, MP3, PDF and ZIP. These filetypes encompass a broad spectrum of commonly used files ranging from compressed to redundant and from executables to document files. Each set of benign files contains 300 typical samples of the corresponding filetype, which provide us with 1, 800 benign files in total. We have ensured the generality of the benign dataset by randomizing the sample sources. More specifically, we queried well-known search engines with random keywords to collect these files. In addition, typical samples are also collected from the local network of our virology lab. Some pertinent statistics of the benign dataset used in this study are tabulated in Table 1. It can be observed from Table 1 that the benign files have very diverse sizes varying from 3 KB to 20 MB, with an average file size of approximately 2 MB. The divergence in sizes of benign files is important as malicious programs are inherently smaller in size for ease of propagation. Table 2: Statistics of malware used in this study Maj. Category Qty. Backdoor Constructor Trojan Virus Worm Miscellaneous 3,444 172 3114 1048 1471 1062 4.2 Avg. Size (kilo-bytes) 285.6 398.5 135.7 50.7 72.3 197.7 Min. Size (bytes) 56 371 12 175 44 371 Max. Size (kilo-bytes) 9, 502 5, 971 4, 014 1, 332 2, 733 14, 692 Malware Dataset We have used ‘VX Heavens Virus Collection’ [27] database which is available for free download in the public domain. Malware samples, especially recent ones, are not easily available on the Internet. Computer security corporations do have an extensive malware collection, but unfortunately they do not share their malware databases for research purposes. This is a comprehensive database that contains a total of 37, 420 malware samples. The samples consist of backdoors, constructors, flooders, bots, nukers, sniffers, droppers, spyware, viruses, worms and trojans etc. We only consider Win32 based malware in PE file format. The filtered dataset used in this study contains 10, 311 Win32 malware samples. To make our study more comprehensive, we divide the malicious executables based on the function of their payload. The malicious executables are divided into six major categories such as backdoor, trojan, virus, worm, constructor, and miscellaneous (malware like nuker, flooder, virtool, hacktool etc). We now provide a brief explanation of each of the six malware categories. 4.2.1 Backdoor A backdoor is a program which allows bypassing of standard authentication methods of an operating system. As a result, remote access to computer systems is possible without explicit consent of the users. Information logging and sniffing activities are possible using the gained remote access. 4.2.2 Constructor This category of malware mostly includes toolkits for automatically creating new malware by varying a given set of input parameters. 4.2.3 Worm The malware in this category spreads over the network by replicating itself. 4.2.4 Trojan A trojan is a broad term that refers to stand alone programs which appear to perform a legitimate function but covertly do possibly harmful activities such as providing remote access, data destruction and corruption. 4.2.5 Virus A virus is a program that can replicate itself and attach itself with other benign programs. It is probably the most well-known type of malware and has different flavors. 4.2.6 Miscellaneous The malware in this category include DoS (denial of service), nuker, exploit, hacktool and flooders. The DoS and Table 3: AUC’s for detecting malicious executables vs benign files. Bold entries in every column represent the best results. Classifier Back Cons RIPPER NB M-NB B-J48 0.591 0.615 0.615 0.625 0.611 0.610 0.606 0.652 SMO B-SMO NB B-NB J48 B-J48 IBk 0.720 0.711 0.715 0.715 0.712 0.795 0.752 − − − − − − − Misc Strings 0.690 0.714 0.711 0.765 Troj Virus Worm 0.685 0.751 0.755 0.762 0.896 0.944 0.952 0.946 0.599 0.557 0.575 0.642 KM 0.611 0.611 0.610 0.606 0.560 0.652 0.611 0.755 0.766 0.847 0.821 0.805 0.851 0.850 0.865 0.931 0.947 0.939 0.850 0.921 0.942 0.750 0.759 0.750 0.760 0.817 0.820 0.841 J48 B-J48 NB B-NB IBk B-IBk Proposed solution with 0.835 0.812 0.863 0.849 0.841 0.883 0.709 0.716 0.796 0.709 0.722 0.794 0.779 0.817 0.844 0.782 0.817 0.844 4 features 0.837 0.909 0.839 0.933 0.748 0.831 0.746 0.807 0.791 0.917 0.791 0.918 0.880 0.884 0.715 0.707 0.812 0.812 B-J48 Proposed solution with 52 features 0.979 0.965 0.950 0.985 0.970 0.932 nuker based malware allow an attacker to launch malicious activities at a victim’s computer system that can possibly result in a denial of service attack. These activities can result in slow down, restart, crash or shutdown of a computer system. Exploit and hacktool exploit vulnerabilities in a system’s implementation which most commonly results in buffer overflows. Flooder initiates unwanted information floods such as email, instant messaging and SMS floods. The detailed statistics of the malware used in our study is provided in Table 2. The average malware size in this dataset is 64.2 KB. The sizes of malware samples used in our study vary from 4 bytes to more than 14 MB. Intuitively speaking, small sized malware are harder to detect than the larger ones. 5. PILOT STUDIES In our initial set of experiments, we have conducted an extensive search in the design space to select the best features’ set and classifier for our scheme. The experiments are done on 5% of total dataset to have small design cycle for our approach. Recall that in Section 3, we have introduced 13 different statistical and information-theoretic features. The pilot studies are aimed to convince ourselves that we need all of them. Moreover, we have also evaluated well-known data mining algorithms on our dataset in order to find the best classification algorithm for our problem. We have used Instance based (IBk) [22], Decision Trees (J48) [25], Naı̈ve Bayes [23] and the boosted versions of these classifiers in our pilot study. For boosting we have used AdaBoostM1 algorithm [24]. We have utilized implementations of these algorithms available in WEKA [26]. 5.1 Discussion on Pilot Studies The classification results of our experiments are tabulated in Table 3 which show that the boosted Decision Tree (J48) significantly outperforms other classifiers in terms of detec- 1 Table 4: Feature Analysis. AUC’s for detecting virus executables vs benign files using boosted J48. F1,F2, F3 and F4 correspond to 4 different features. No. of Gram/feature 1 1 1 2 1, 1 1, 1 1, 2 1, 2 1, 1, 2 1, 1 1, 2 1, 1, 1, 2 AUC 0.823 0.839 0.866 0.891 0.940 0.928 0.932 0.929 0.962 0.954 0.913 0.956 0.8 tp rate Feature F1 F2 F3 F4 F1-F2 F1-F3 F1-F4 F2-F4 F1-F2-F4 F3-F2 F3-F4 F1-F2-F3-F4 0.6 0.4 Boosted IBK IBK Boosted NB NB Boosted J48 J48 0.2 0 0 0.2 0.4 0.6 0.8 1 fp rate Figure 2: ROC plot for virus-benign classification tion accuracy. Similarly we also evaluate the role of number of features and the number of n-grams on the accuracy of proposed approach. In Table 3, we first use four features namely: Simpson’s index (F1), Entropy Rate (F2), Canberra distance (F3) on 1-gram and Simpson’s index on 2-grams (F4). Our scheme achieves a significantly higher detection accuracy compared with strings and KM. We then tried different combinations of features and n-grams and tabulated the results in Table 4. It is obvious from Table 4 that once we move from a single feature from (F1) on 1-gram to (F4) on 2-grams the detection accuracy improves from 0.82 to 0.891. Once we use combination of features computed on 1-gram and 2-grams the accuracy approaches to 0.962 (see F1-F2-F4). This provided us the motivation to use all 13 features on 1-, 2-, 3- and 4-grams resulting in a total of 52 features. The ROC curve for the virus-benign classification using F1, F2, F3 and F4 features is shown in Figure 2. It is clear in Table 3 that strings approach has a significantly higher accuracy for virus types compared with other malware types. We analyzed the reason behind this by looking at the signatures used by this method. We observed that typically viruses carry some common strings like “Chinese Hacker” and for strings approach they become its signatures. Since similar strings do not appear in other malware types; therefore, strings accuracy degrades to as low as 0.62 in case of backdoors. Recall that KM uses 4-grams as binary features. KM follows the same pattern of higher accuracy for detecting viruses and relatively lower accuracy for other malware types. However, its accuracy results are significantly higher compared with the strings approach. Note that our proposed solution with 52 features not only provides the best detection accuracy for virus category but its accuracy remains consistent across all malware types. This shows the strength of using diverse features’ set with a boosted J48 classifier. 6. RESULTS & DISCUSSION Remember that our approach is designed to achieve a challenging objective: to distinguish between malicious files of type backdoor, virus, worm, trojan, constructor from benign files of types DOC, EXE, JPG, MP3, PDF and ZIP just on the basis of byte-level information. We now explain our final experimental setup. The proposed scheme, as explained in Section 3.2, computes features on n-grams of each block of a given file. We create the training samples by randomly selecting 50 malicious and benign files each from malware and benign datasets respectively. We create six separate training samples for each malware category. We use these samples to train boosted decision tree and consequently get 6 classification models for each malware type. For an easier understanding of the classification process, we take backdoor as an example. We have selected 50 backdoor files and 50 benign files to get a training sample for classifying backdoor and benign files. We train the data mining algorithm with this sample and as a result get the training model to classify a backdoor. Moreover, we further test, using this model, all six benign filetypes and backdoor malware files. It is important to note that this model is specifically designed to distinguish backdoor files from six benign filetypes (one-vs-all classification), where only 50 backdoor samples are taken in training. This number is considerably small keeping in view the problem at hand. Nonetheless, the backdoor classification using a single classifier completes in seven runs: six for benign filetypes and one for itself. In a similar fashion, one can classify malware types i.e., virus, worm, trojan, constructor and miscellaneous. Once our solution is trained for each category of malware, we test it on the benign dataset of 1800 files and the malware dataset of 10, 311 files from VX Heavens. We ensure that the benign and malware files used in the training phase are not included in the testing phase to verify our claim of zero-day malware detection. The classification results are shown in Figure 3 in the form of AUCs. It is interesting to see that viruses, as expected, are easily classified by our scheme. In comparison trojans are programs that look and behave like benign programs, but perform some illegitimate activities. As expected, the classification accuracy for trojans is 0.88 — significantly smaller compared with viruses. In order to get a better understanding, we have plotted the ROC graphs of classification results for each malware type in Figure 3. The classification results show that malware files are quite distinct from benign files in terms of byte-level file content. The ROC plot further confirms that virus are easily classified compared with other malware types while trojan and backdoors are relatively difficult to classify. Table 5 shows portions of the developed decision trees for all malware categories. As decision trees provide a simple and robust method of classification for a large dataset, it is interesting to note that malicious files are inherently different from the benign ones even at the byte-level. 1 Table 5: Portions of developed decision trees | | | | | | | | | | | | | | | | | | KL1 | | | | | > 0.022975 Entropy2 <= 6.822176 | Manhattan1 <= 0.005411 | | Manhattan4 <= 0.000062: Malicious (20.0) | | Manhattan4 > 0.000062: Benign (8.0/2.0) | Manhattan1 > 0.005411: Malicious (538.0/6.0) (a) between Backdoor and Benign files | | | | | | | | | | | | CorelationCoefficient1 > 0.619523 | Chebyshev4 <=1.405 | | Itakura2 <= 87.231983: Malicious (352.0/9.0) | | Itakura2 > 87.231983 | | | TotalVariation1 <= 0.3415: Malicious (11.0) | | | TotalVariation1 > 0.3415: Benign (8.0) 0.8 tp rate | | | | | | 0.6 0.4 Virus (AUC = 0.945) Worm (AUC = 0.919) Trojan (AUC = 0.881) Backdoor (AUC = 0.849) Constructor (AUC = 0.925) Miscellaneous (AUC = 0.903) 0.2 0 0 0.2 | | | | | | | | | | | | | | CorelationCoefficient4 > 0.187794 | Simpson_Index_1 <= 0.005703 | | CorelationCoefficient3 <= 0.17584: Malicious (32.0) | | CorelationCoefficient3 > 0.17584 | | | Entropy1 <= 4.969689: Malicious (4.0/1.0) | | | Entropy1 > 4.969689: Benign (5.0) | Simpson_Index_1 > 0.005703: Benign (11.0) (c) between Virus and Benign files | | | | | | | | | | | | | | | | | | | | | Entropy2 > 3.00231 | Canberra2 <= 49.348481 | | Canberra1 <= 14.567909: Malicious (161.0) | | Canberra1 > 14.567909 | | | Itakura3 <= 126.178699: Malicious (5.0) | | | Itakura3 > 126.178699: Benign (5.0) | Canberra2 > 49.348481: Benign (13.0/1.0) (d) between Worm and Benign files CorelationCoefficient3 <= -0.013832 | Itakura2 <= 5.905754 | | Itakura3 <= 5.208592 | | | CorelationCoefficient4 <= -0.155078: Malicious (7.0) | | | CorelationCoefficient4 > -0.155078: Benign (43.0/6.0) | | Itakura3 > 5.208592: Benign (303.0/5.0) (e) between Constructor and Benign files | | | | | | | | Entropy4 > 6.754364 | KL1 <= 0.698772 | | Manhattan3 <= 0.001003 | | | Entropy2 <= 6.411063: Malicious (29.0) | | | Entropy2 > 6.411063 | | | | KL1 <= 0.333918: Malicious (2.0) | | | | KL1 > 0.333918: Benign (2.0) | | Manhattan3 > 0.001003: Benign (3.0) (f ) between Miscellaneous and Benign files The novelty of our scheme lies in the way the selected features have been computed — per block n-gram analysis, and the correlation between the blocks classified as benign or potentially malicious. The features used in our study are taken from statistics and information theory. Many of these features have already been used by researchers in other fields for similar classification problems. The chosen set of features is not, by any means, the optimal collection. The selection of optimal number of features remains an interesting problem which we plan to explore in our future work. Moreover, the executable dataset used in our study contained both packed and non-packed PE files. We plan to evaluate the robustness of our proposed technique on manually crafted packed file dataset. 0.4 0.6 0.8 1 fp rate (b) between Trojan and Benign files Figure 3: ROC plot for detecting malware from benign files 7. CONCLUSION & FUTURE WORK In this paper we have proposed a non-signature based technique which analyzes the byte-level file content. We argue that such a technique provides implicit robustness against common obfuscation techniques — especially repacked malware to obfuscate signatures. An outcome of our research is that malicious and benign files are inherently different even at the byte-level. The proposed scheme uses a rich features’ set of 13 different statistical and information-theoretic features computed on 1-, 2-, 3- and 4-grams of each block of a file. Once we have calculated our features’ set, we give it as an input to the boosted decision tree (J48) classifier. The choice of features’ set and classifier is an outcome of extensive pilot studies done to explore the design space. The pilot studies demonstrate the benefit of our approach compared with other well-known data mining techniques: strings and KM approach. We have tested our solution on an extensive executable dataset. The results of our experiments show that our technique achieves 90% detection accuracy for different malware types. Another important feature of our framework is that it can also classify the family of a given malware file i.e. virus, trojan etc. In future, we would like to evaluate our scheme on a larger dataset of benign and malicious executables and reverse engineer the features’ set for further improving the detection accuracy. Moreover, we plan to evaluate the robustness of our proposed technique on a customized dataset containing manually packed executable files. Acknowledgments This work is supported by the National ICT R&D Fund, Ministry of Information Technology, Government of Pakistan. The information, data, comments, and views detailed herein may not necessarily reflect the endorsements of views of the National ICT R&D Fund. We acknowledge M.A. Maloof and J.Z. Kolter for their valuable feedback regarding the implementation of strings and KM approaches. Their comments were of great help in establishing the experimental testbed used in our study. We also acknowledge the anonymous reviewers for their valuable suggestions pertaining to possible extensions of our study. 8. REFERENCES [1] Symantec Internet Security Threat Reports I-XI (Jan 2002—Jan 2008). [2] F-Secure Corporation, “F-Secure Reports Amount of Malware Grew by 100% during 2007”, Press release, 2007. [3] A. Stepan, “Improving Proactive Detection of Packed Malware”, Virus Buletin, March 2006, available at http://www.virusbtn.com/virusbulletin/ archive/2006/03/vb200603-packed.dkb [4] R. Perdisci, A. Lanzi, W. Lee, “Classification of Packed Executables for Accurate Computer Virus Detection”, Pattern Recognition Letters, 29(14), pp. 1941-1946, Elsevier, 2008. [5] AVG Free Antivirus, available at http://free.avg.com/. [6] Panda Antivirus, available at http://www.pandasecurity.com/. [7] M.G. Schultz, E. Eskin, E. Zadok, S.J. Stolfo, “Data mining methods for detection of new malicious executables”, IEEE Symposium on Security and Privacy, pp. 38-49, USA, IEEe Press, 2001. [8] J.Z. Kolter, M.A. Maloof, “Learning to detect malicious executables in the wild”, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 470-478, USA, 2004. [9] J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro, S. White, “Biologically inspired defenses against computer viruses”, International Joint Conference on Artificial Intelligence (IJCAI), pp. 985-996, USA, 1995. [10] R.W. Lo, K.N. Levitt, R.A. Olsson, “MCF: A malicious code filter”, Computers & Security, 14(6):541-566, Elseveir, 1995. [11] O. Henchiri, N. Japkowicz, “A Feature Selection and Evaluation Scheme for Computer Virus Detection”, IEEE International Conference on Data Mining (ICDM), pp. 891-895, USA, IEEE Press, 2006. [12] P. Kierski, M. Okoniewski, P. Gawrysiak, “Automatic Classification of Executable Code for Computer Virus Detection”, International Conference on Intelligent Information Systems, pp. 277-284, Springer, Poland, 2003. [13] T. Abou-Assaleh, N. Cercone, V. Keselj, R. Sweidan. “Detection of New Malicious Code Using N-grams Signatures”, International Conference on Intelligent Information Systems, pp. 193-196, Springer, Poland, 2003. [14] J.H. Wang, P.S. Deng, “Virus Detection using Data Mining Techniques”, IEEE International Carnahan Conference on Security Technology, pp. 71-76, IEEE Press, 2003. [15] W.J. Li, K. Wang, S.J. Stolfo, B. Herzog, “Fileprints: identifying filetypes by n-gram analysis”, IEEE Information Assurance Workshop, USA, IEEE Press, 2005. [16] S.J. Stolfo, K. Wang, W.J. Li, “Towards Stealthy Malware Detection”, Advances in Information Security, Vol. 27, pp. 231-249, Springer, USA, 2007. [17] W.J. Li, S.J. Stolfo, A. Stavrou, E. Androulaki, A.D. Keromytis, “A Study of Malcode-Bearing Documents”, International Conference on Detection of Intrusions & Malware, and Vulnerability Assessment (DIMVA), pp. 231-250, Springer, Switzerland, 2007. [18] M.Z. Shafiq, S.A. Khayam, M. Farooq, “Embedded Malware Detection using Markov n-Grams”, International Conference on Detection of Intrusions & Malware, and Vulnerability Assessment (DIMVA), pp. 88-107, Springer, France, 2008. [19] M. Christodorescu, S. Jha, and C. Kruegal, “Mining Specifications of Malicious Behavior”, European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2007), pp. 5-14, Croatia, 2007. [20] Frans Veldman, “Heuristic Anti-Virus Technology”, International Virus Bulletin Conference, pp. 67-76, USA, 1993, available at http://mirror.sweon.net/ madchat/vxdevl/vdat/epheurs1.htm. [21] Jay Munro, “Antivirus Research and Detection Techniques”, Antivirus Research and Detection Techniques, ExtremeTech, 2002, available at http://www.extremetech.com/article2/0,2845, 367051,00.asp. [22] D.W. Aha, D. Kibler, M.K. Albert, “Instance-based learning algorithms”, Journal of Machine Learning, Vol. 6, pp. 37-66, 1991. [23] M.E. Maron, J.L. Kuhns, “On relevance, probabilistic indexing and information retrieval”, Journal of the Association of Computing Machinery, 7(3), pp.216-244, 1960. [24] Y. Freund, R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting”, Journal of Computer and System Sciences, No. 55, pp. 23-37, 1997 [25] J.R. Quinlan, “C4.5: Programs for machine learning”, Morgan Kaufmann, USA, 1993. [26] I.H. Witten, E. Frank, “Data mining: Practical machine learning tools and techniques”, Morgan Kaufmann, 2nd edition, USA, 2005. [27] VX Heavens Virus Collection, VX Heavens website, available at http://vx.netlux.org [28] J. Oberheide, E. Cooke, F. Jahanian. “CloudAV: N-Version Antivirus in the Network Cloud”, USENIX Security Symposium, pp. 91-106, USA, 2008. [29] T. Fawcett, “ROC Graphs: Notes and Practical Considerations for Researchers”, TR HPL-2003-4, HP Labs, USA, 2004. [30] S.D. Walter, “The partial area under the summary ROC curve”, Statistics in Medicine, 24(13), pp. 2025-2040, 2005. [31] T.M. Cover, J.A. Thomas, “Elements of Information Theory”, Wiley-Interscience, 1991. Online Phishing Classification Using Adversarial Data Mining and Signaling Games Gaston L’Huillier University of Chile Blanco Encalada 2120 Santiago, Chile [email protected] Richard Weber Nicolas Figueroa [email protected] [email protected] University of Chile Republica 701 Santiago, Chile University of Chile Republica 701 Santiago, Chile ABSTRACT 1. In adversarial systems, the performance of a classifier decreases after it is deployed, as the adversary learns to defeat it. Recently, adversarial data mining was introduced as a solution to this, where the classification problem is viewed as a game mechanism between an adversary and an intelligent and adaptive classifier. Over the last years, phishing fraud through malicious email messages has been a serious threat that affects global security and economy, where traditional spam filtering techniques have shown to be ineffective. In this domain, using dynamic games of incomplete information, a game theoretic data mining framework is proposed in order to build an adversary aware classifier for phishing fraud detection. To build the classifier, an online version of the Weighted Margin Support Vector Machines with a game theoretic prior knowledge function is proposed. In this paper, a new content-based feature extraction technique for phishing filtering is described. Experiments show that the proposed classifier is highly competitive compared with previously proposed online classification algorithms in this adversarial environment, and promising results where obtained using traditional machine learning techniques over extracted features. In security applications, modern threats are becoming more effective as adversaries are adapting and evolving over current security systems. In many domains, such as fraud, phishing, spam, intrusion detection, and other malicious activities, a permanent race between adversaries and classifiers is present. The evolution of the initial problem is driven by a rational change of the adversaries’ behavior. In this context, one of the major problems of a classifier is to consider the drift concept and incremental properties of security systems. In recent studies on this topic [35, 38], the incremental characteristic of these applications has been mainly considered, leaving the adversarial behavior as an open question in most of the previously mentioned domains. Nowadays, in the Cyber-Crime context, one of the most common social engineering threats is the phishing fraud. This malicious activity consists of email scams, where attackers ask for personal information to break into any site where victims store useful private information, such as financial institutions, e-commerce or massive services. The phishing filtering problem is not an easy task. Even human behavior, mental models for phishing identification and security skins for graphical user interfaces were proposed for enhancing human skills to detect phishing emails [9, 10, 28]. While client side phishing filtering techniques have been developed by large software companies, server side filtering techniques have been a large research focus [1, 5, 11, 4]. Most of this work is based on machine learning approaches to determine the relevant features to extract from phishing emails, and data mining techniques to determine hidden patterns associated to the relationship between the extracted features. There is an important issue when using data mining to build a classifier for the phishing detection task, and many other adversarial classification tasks: it must deal with the uncertainty of classifying malicious or regular activities, without information about the real intention of the message. The latter can be modeled as a Bayesian game (or incomplete information game), where the classifier must decide an strategy not knowing the adversaries’ real type, whether it was malicious or just happened to be a “malicious like” regular message. All this, using just the revealed set of features to decide. The dynamic behavior of signaling games, a special representation of dynamic games of incomplete information (dynamic Bayesian games), presents common elements with the online learning theory. In both cases the final outcomes, whether a game equilibria or an online classifier, are de- Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining; I.5.1 [Pattern Recognition]: Design Methodology—classifier design and evaluation; K.4.4 [Computers and Society]: Electronic Commerce—Security General Terms Algorithms, Email Filtering, Game Theory, Data Mining Keywords Spam and phishing detection, Adversarial Classification, Games of Incomplete Information Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSI-KDD’09 , June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-4 ...$5.00. INTRODUCTION termined by incremental events presented in their environments. As the nature of email filtering is determined by a high stream of messages, online algorithms, as well as generative learning algorithms have been considered as a measure to minimize the computational cost, even if the predictive power is lower than the obtained by discriminative learning algorithms. The aim of this work is to present a game-theoretic data mining framework using dynamic games of incomplete information for the adversarial classification problem. A mechanism is proposed to model a signaling game between an adversary and a classifier, where equilibrium strategies and the classifier beliefs are used to build an online machine learning classifier to detect phishing emails. Section 2 of this paper introduces previous work on the adversarial data mining and latest research in phishing filtering. The problem definition and the proposed mechanism is introduced in section 3. The proposed Classifier strategy and main contribution of this paper is presented in section 4. In section 5 the proposed phishing feature extraction and problem parameters are defined. Experimental design is presented in section 6, followed by results shown in section 7. Finally, main conclusions and future work are presented in section 8. 2. 2.1 PREVIOUS WORK Adversarial Machine Learning As described by Dalvi et al. in [8], an adversarial game can be represented as a game between two players: A malicious agent whose adversarial activity reports him benefits, and a classifier whose main objective is to identify as many malicious activities as possible, maximizing its expected utility. The malicious agent tries to avoid detection by changing its features (hence its behavior), inducing a high false-negative rate to the classifier. The adversary is aware that changing features to a non-adversarial behavior might not increase its benefit. Considering this, the adversary might try to maximize its benefit minimizing the cost of changing features. This framework, based on a single shot game of complete information, was initially tested in a spam detection domain where the adversary-aware naı̈ve Bayes classifier had significantly less false positives and false negatives than the classifier’s plain version. Then a repeated version of the game was tested, where results showed that the adversary-aware classifier outperformed consistently the adversary-unaware naı̈ve Bayes classifier. Some extensions of the adversarial classification framework were recently developed. An interesting approach, proposed by M. Kantarcioglu et al. in [19], considers an adversarial stackelberg game model to define the interaction between the classifier and the adversary. On this setting, the subgame perfect equilibrium is determined using stochastic optimization, were monte-carlo simulations on mixture models and linear adversarial transformations where tested over spam data sets, showing promising results. Another approach, developed by Sönmenz in [26], a two-player zerosum game is solved using a joint Linear Support Vector Machines (L-SVM) and minimax optimization problem to find the optimal equilibrium and the optimal hyperplane simultaneously. In this work, the general sum case is determined using a Nikaido-Isado-type functions to define the optimal hyperplanes in a L-SVM context. Recently, several studies about the possibility that a classifier is maliciously mis-trained or that its optimal strategies could be revealed in adaptive adversarial environment has been developed. Open questions such as if “Can machine learning be secure?” or questions such as “Can the adversary manipulate a learning system to permit a specific attack?” are extensively discussed in [3]. More specifically, Nelson et al. presents in [23] how to exploit a spam classifier to render it useless using a very specific attack framework, using indiscriminate, focused attacking and an optimal attacking function, all of them aware about the training model used a naı̈ve Bayes classifier. Another potential adversarial problem for the classifier, introduced by Lowd and Meek in [18], proposed as the adversarial learning theory, which enables the adversary to reconstruct the classifier based on reasonable assumptions and reverse engineering algorithms. However, Biggio et al. in [6] present a promising alternative to randomize the classifier decision function in order to hide the classifier’s strategy observed by the adversary, minimizing the adversarial learning and the possibilities to mis-train or learn from the classifier. 2.2 Phishing Classification Spam filtering has been discussed over the last years, and many filtering techniques have been described [15]. Phishing classification is different in many aspects from the spam case, where most of the spam email just want to inform about some product. In phishing there is a more complex interaction between the message and the receiver, like following malicious links, filling deceptive forms, or replying with useful information which are relevant for the message to succeed. Also, there is a clear difference among many phishing techniques, where the two main message categories are the popularly known deceptive phishing and malware phishing. While malware phishing has been used to spread malicious software installed on victim’s machines, Deceptive phishing, according to [2], can be categorized in the following six categories: Social engineering, Mimicry, Email spoofing, URL hiding, Invisible content and Image content. For each one of these subcategories, specific feature extraction techniques have been proposed [2] to help phishing classifiers to use the right characterization of these messages. Among the countermeasures used against phishing, three main alternatives have been used [2]: Blacklisting and whitelisting, network and encryption based countermeasures and content based filtering. The first alternative, in general terms consists in using public lists of malicious phishing web-sites (the black list) and lists of legitimate non-malicious web-sites (white list). The idea is that every link in a message must be checked in both lists. The main problem of this countermeasure is that phishing web-sites do not persist long enough to be updated on-time in the black list, making difficult to keep an up-to-date list of malicious web-sites. The second alternative is based on email authentication methods, where the transaction time could be a considerable computational cost. Besides, a special technological infrastructure is needed for this countermeasure [2]. Previous work on content-based phishing filtering [1, 5, 2, 11, 4] focused on the extraction of a large number of features and the usage of popular machine learning techniques for classification. These approaches for automatic phishing filtering have shown promising results on setting the relative importance of features. Different text mining techniques for phishing filtering have been proposed, where Abu-Nihmed et al. in [1] used Logistic Regression, Support Vector Machines (SVM) and Random Forests to estimate classifiers for the correct labeling of email messages, obtaining the best results with an F-measure1 of 90%. Fette et al. in [11], using a list of improved features directly extracted from email messages, proposed a SVM based model which obtained an F-measure of 97,64% in a different phishing corpus data set. Bergholz et al. in [5] proposed a more sophisticated characterization of emails using Class-Topic model, where using a SVM model obtained an F-measure of 99,46% in an updated version of previously used phishing corpus. Later, in [2], Bergholz et al. proposed an improved list of features to extract from emails, that could characterize most of phishing tactics proposed by the same authors. Using a SVM model, a F-measure of 99,89% in a new benchmark database was obtained. 3. PROBLEM DEFINITION Consider a message arriving at time t represented by the feature vector xt = (xt,1 , ..., xt,i , ..., xt,a ), where xt,i is the ith feature of message xt . Each message can belong to two classes: positive (or malicious) messages, and negative (or regular) messages. We define the adversarial classification under a dynamic game of incomplete information as a signaling game between an Adversary, which attempts to defeat a Classifier by not revealing information about his real type, modifying xi (a message of type i) into xj (a message of type j) by using the transformation function φ(xi ) = xj . Consider the incomplete information game, as defined by J. Harsanyi in [16], as the tuple Γb = (N , (An )n∈N , (Tn )n∈N , (pn )n∈N , (Un )n∈N ) where N = {1, ..., N } is the set of players, An is the set of possible actions for player n, ∀n ∈ N . Tn is the nth player possible types set ∀n ∈ N . pn is a probability function pn : Tn → [0, 1] which assigns a probability distribution over j∈N Tj to each possible player type (Tn ), ∀n ∈ N . Finally, the utility function of player n is denoted by Un : ( j∈N Aj ) × ( j∈N Tj ) → R, which corresponds to the payoff of player n as a function over the actions of all players (An ) and their types (tn ). Based on the previous scheme, as described in [12, 14], dynamic games of incomplete information can be modeled as a signaling game. The proposed model of incomplete information for the adversarial classification between an Adversary (A) and a Classifier (C), i.e. N = {A, C}, behaves as the following sequence of events: × × × 1. Nature draws ! a type ti for the Adversary from T = {tR,xi }ki=1 {tM,xi }ki=1 , which states whether the adversary is Regular (R) or Malicious (M), and defines the initial optional message of type i, xi . Nature draws according to the probability distribution p(ti ), " where p(ti ) > 0, ∀i and ki=1 p(ti ) = 1. 1 2. The Adversary observes his type ti , which can be either tR,xi or tM,xi , and chooses a message xj from his set of actions AA = {φ(xi ) = xj }kj=1 , where xi is defined from the type tR,xi or tM,xi . The function φ : Ra → Ra transforms a feature vector xi into xj , the F-measure a machine learning quality measure represented by the harmonic mean between precision and recall. See section 6.1. message which the Classifier has to decide its class. A non malicious adversary does not have incentives to modify its behavior, so φ(xi ) = xi , when its type is tR,xi , ∀i = {1, ..., k}. 3. The Classifier observes xj (but not ti ) and chooses an action C(xj ) from its set of actions AC = {+1, −1}. It is important to notice that the Classifier is a single type player, so its type is common knowledge and there is no need to be mentioned further. 4. Finally, payoffs are revealed by UA (ti , φ(xi ), C(φ(xi ))) and UC (ti , φ(xi ), C(φ(xi ))). The extensive form game that represents the signaling game between the Adversary and Classifier is presented in figure 1. In order to analyze the optimal strategies for the Classifier in the proposed mechanism, special requirements and assumptions over the traditional Bayesian Nash equilibrium must be considered. These requirements must be satisfied to refer to the perfect Bayesian equilibrium (PBE) refinement concept in the adversarial classification signaling game. Definition 1. Sequential rationality: At each information set, the Classifier must have a belief about which node on the information set has been reached by the play of the game. Given these beliefs, the Classifier’s strategies must be sequentially rational [14, 17]. Previous definition insist that the Classifier have beliefs and act optimally given these beliefs, but it is necessary to take assumptions in order to discard unreasonable beliefs. In an extensive-form game, information sets are “on the equilibrium path” if they will be reached with positive probability and the game is played according to the equilibrium strategies. At “on the equilibrium path” information sets, beliefs are determined by Bayes’ rule and the players’ equilibrium strategies. Krebs and Wilson formalized in [17] the concept of sequential rationality, where it is stated that the equilibrium no longer consists on just optimal strategies for each agent, but also includes a belief for each agent at each information set at which the agent has to make a move. Definition 2. Signaling requirement 1 (S1) After observing any message xj , from AA , the Classifier must have a belief about which types could have sent xj . Denote this belief by the probability distribution µ(ti |xj ), where " µ(ti |xj ) ≥ 0, ∀ti ∈ T and ti ∈T µ(ti |xj ) = 1. Definition 3. Signaling requirement 2 (S2C) For each xj ∈ AA , the Classifier’s optimal strategy defined as the ∗ probability distribution σC over the Classifier’s actions C(xj ) ∈ AC , must maximize the Classifier’s expected utility, given the beliefs µ(ti |xj ) about which types could have sent xj . That is, # µ(ti |xj ) · UC (ti , xj , αC ) (1) ∀xj , σC∗ (·|xj ) ∈ arg max αC where UC (ti , xj , σ(·|xj )) = ti∈T # C(xj )∈AC σC (C(xj )|xj )UC (ti , xj , C(xj )) (2) Figure 1: Extensive-form representation of the signaling game between the Classifier an the Adversary. On the figure, xij is defined by φ(xi ) = xj and Ij is the j th information set where the classifier has to decide C(xj ) = {+1, −1}. All intermediate nodes between Nature and information sets, represents the strategy nodes for the adversary, where φ(xi ) = xj is decided. Definition 4. Signaling requirement 3 (S2A) For each ti ∈ T , the Adversary’s optimal message xj = φ(xi ), ∗ defined by the probability distribution σA over the Adversary’s actions xj ∈ AA , must maximize the Adversary’s utility function, given the Classifier’s strategy σC∗ . That is, ∀ti , σA (·|ti ) ∈ arg max UA (ti , αA , σC∗ ) αA where UA (ti , σA , σC ) = # xj ∈AA and UA (ti , xj , σC (·|xj )) = (3) σA (xj |ti )UA (ti , xj , σC (·|xj )) # C(xj )∈AC σC (C(xj )|xj )UA (ti , xj , C(xj )) Definition 5. Signaling requirement 4 (S3) For each ∗ xj ∈ AA , if there exists ti ∈ T such that σA , then the Classifier’s belief at the information set Ij corresponding to xj must follow from Bayes’ rule and the Adversary’s strategy µ(ti |xj ) = " ∗ σA (xj |ti ) · p(ti ) ∗ tr ∈T σA (xj |tr ) · p(tr ) (4) " ∗ If tr ∈T σA (xj |tr )·p(tr ) = 0, µ(ti |xj ) can be defined as any probability distribution. Sequential equilibria, a subset of perfect Bayesian equilibrium (PBE) in the adversarial signaling game is a pair of ∗ mixed strategies σA and σC∗ and a belief µ(ti |xj ) satisfying signaling requirements S1, S2C, S2A, and S3. It is clear, by construction of the mechanism, that requirements S1 and S3 are satisfied by the adversarial classification game. Nevertheless, signaling requirement S2A will be considered satisfied as a first approach and a strong assumption on the game development. Adversarial behavior strategies, as described by [8], could be considered as a interesting alternative to be developed. However, this will be considered as an open question to be treated as future work. Assumption 1. Signaling game refinements: The dynamic game of incomplete information between the Classifier and the Adversary can be modeled by a signaling game, which satisfies signaling requirements for sequential rationality. Recently, numerical approximation on the sequential equilibria refinement have been proposed by Turocy in [31] based on previous work defined in [21, 30], using a transformation of the logit quantal response equilibrium (QRE) correspondence, parameterized by a scalar precision parameter, which as tends to infinity, a numerical approximation for the sequential equilibria is obtained. This numerical algorithm has been implemented in Gambit [20], an open-source project for estimating equilibrium results in finite games. 4. CLASSIFIER STRATEGY As mentioned before, the Classifier’s optimal strategies are defined by the set AC = {+1, −1}. From the signaling requirement S2C, it can be shown that the Classifier’s optimal strategy C ∗ (xj ) can be solved by the following conditional statement, $ +1 if condition 6 is satisfied ∗ C (xj ) = (5) −1 Otherwise # ti ∈TM ti µ(ti , xj )∆UC,M (xj ) > {tM,xi }ki=1 , Where TM = defined by equation 4, ti ∆UC,R (xj ) = # ti ∈TR ti µ(ti , xj )∆UC,R (xj ) (6) TR = {tR,xi }ki=1 , µ(ti |xj ) is σC∗ (−1|xj )UC (tR,xi , xj , −1) − σC∗ (+1|xj )UC (tR,xi , xj , +1) and ti ∆UC,M (xj ) = σC∗ (+1|xj )UC (tM,xi , xj , +1) − σC∗ (−1|xj )UC (tM,xi , xj , −1) On the following, these expressions will be considered as ti ∆UC,M (xj ) = $M · (σC∗ (+1|xj ) + σC∗ (−1|xj ) · γ) · (wT · xj + b) and ti ∆UC,R (xj ) = $R ·(σC∗ (−1|xj )+σC∗ (+1|xj )·γ)·(wT ·(e−xj )+b) where γ, $R and $M must be defined based on an microeconomic assumptions on the primitives of the game, and e is a vector of ones, whose dimension is a. For more detail, the modeling intuition and the final analytical expression of the utility functions are presented on Appendix A. The previous game-theoretic result (condition 6), can be considered as a prior knowledge constraint in a classification problem, associated with the regularized risk minimization from the statistical learning theory proposed by Vapnik in [32]. All this, is formulated as the following quadratic problem, stated as follows, Algorithm 4.1: Bayesian Adversary Aware Online SVM Data: (x1 , y1 ), ..., (xn , yn ), γ, $M , $R , m, Gp, C Result: f (xt ) = wT · xt + bt 1 Initialize w0 := 0, b0 := 0, seenData := {}; 2 foreach xt , yt do T 3 Classify xt using f (xt ) = wt−1 · xt + bt−1 ; ( ' T 4 if yt wt−1 · (xt ) + bt−1 < Ψ(xt ) then 5 Find w$ , b$ with prior knowledge SMO on seenData, with wt−1 and bt−1 as seed hypothesis, and Ψ(xt ); 6 set wt := w$ and bt := b$ ; 7 8 if size(seenData) > m then remove oldest example from seenData; 9 if T mod Gp = 1 then Approximate sequential equilibrium strategies using logit QRE; add xt to seenData; update p(ti ) based on observed messages on seenData; update beliefs µ(ti |x), ∀ti ∈ T, x ∈ seenData using signaling requirement S3 ; update Ψ(xi ), ∀i ∈ seenData; 10 11 min w,b,ξ s.t a N # 1# 2 wi + C ξi 2 i=1 i=1 % & yi wT · xi + b · Ψ(xi ) ≥ (1 − ξi ) 12 13 (7) 14 ∀i ∈ {1, .., N } Where, 1 + ψ(xi ) Ψ(xi ) = "a k=1 wi + 2 · b and " t ∈T ψ(xi ) = " r M tr ∈TR µ(tr |xi ) · $M · (σC∗ (+1|xi ) + γ · σC∗ (−1|xi )) µ(tr |xi ) · $R · (σC∗ (−1|xi ) + γ · σC∗ (+1|xi )) The online algorithm to solve the proposed minimization problem, is based on solving its dual formulation using the Sequential Minimal Optimization (SMO) described by Platt in [24]. The SMO algorithm is used to train SVMs breaking up the large Quadratic Programming (QP) representation of the dual into small series of QP problems, which are solved analytically by the algorithm. Small changes in the SMO algorithm, such as explained in previous prior knowledge inclusion in SVMs [37] was considered. A dual representation of the classification problem in equation 7, is presented as the following, max α N # i=1 s.t αi − N # N # i=1 j=1 N # i=1 Algorithm 4.1 presents the online learning algorithm, Bayesian Adversary-Aware Online SVM (BAAO-SVM). Based on the Classifier’s beliefs and sequential equilibrium strategies, the hyperplane parameters are updated, incorporating as prior knowledge constraints the game theoretic results. The main idea of the algorithm is that given an incoming message xt , a label is assign using the classification function T f (xt ) = wt−1 · xt + bt−1 . If the Classifier’s optimal strategy is not satisfied (equation 6), the hyperplane parameters are updated using a modified version of the SMO algorithm over the seen messages (seenData set). A memory parameter m is used to set the number of messages in seenData. Then, every Gp periods, the sequential equilibrium strategies are updated using logit QRE. Finally, xt is added to seenData and the type’s probabilities are updated, hence beliefs and Ψ(xi ).∀i ∈ seenData. It is important to notice that the algorithm evolves dynamically as messages are presented to the Classifier. 5. 5.1 yi Ψ(xi ) · yj Ψ(xj ) · αi αj xTi xj Ci ≥ αi ≥ 0 return 1 ; 15 ξi ≥ 0 ∀i ∈ {1, .., N } (8) yi Ψ(xi ) · αi = 0 ∀i ∈ {1, .., N } Based on previous work on Online Support Vector Machines algorithms described by Gentile in [13] and later by Sculley in [25], the proposed adversary aware classifier is PHISHING FEATURES, STRATEGIES AND TYPES EXTRACTION Corpus Description The previously defined classifier will be tested over an English language phishing and Ham email corpus built using Jose Nazario’s phishing corpus [22] and the Spamassassin Ham collection. The phishing corpus2 consists of 4450 emails manually retrieved from November 27, 2004 to August 7, 2007. The Spamassassin collection, from the Apache SpamAssassin Project3 , is based on a collection of 6951 Ham email messages. The email collection was saved in a unix mbox email format, and was processed using Perl scripts. 2 3 Available at http://monkey.org/~jose/wiki/doku.php?id=PhishingCorpus Available at http://spamassassin.apache.org/publiccorpus/ 5.2 Basic Features As initially described in [11] and then in [5, 2, 4], the extraction of basic content based features is needed for a minimum representation of phishing emails. These features are associated to structural properties of the email, link analysis, programming elements and the output of the spam filters: • Structural properties are proposed as four binary features defined by the MIME (Multipurpose Internet Mail Extensions) standard, related to the possible number of email formats, and states information about the total number of different body parts (all body parts, discrete body parts, composite body parts and alternative body parts). • Link analysis provides seven binary features related to the properties of every link in an email message: The existence of links in a message, if the number of internal links is greater than one, if the number of external links is greater than one, if the number of links with IP numbers is greater than one, if the number of deceptive links4 is greater than one, if the number of links behind images is greater than one and if the maximum number of dots in all links for a message is greater than 10. • Programming elements are defined as binary features representing whether HTML, JavaScript and forms are used in a message. • Finally, the SpamAssassin filter’s output score for an email was used, indicating with a binary feature if the score was greater than 5.0, the recommended spam threshold. It is important to notice that all previously mentioned features (a total of 15 features) are directly extracted from content-based properties of an email message, and each one can be considered as a strategy for the Adversary to defeat the Classifier. 5.3 Word List and Clustering Features Previously mentioned features are not sufficient for the appropriate characterization of a phishing message, and clearly not the complete representation of adversarial strategies. Following the content-based extraction, a new list of features is proposed to determine the characterization for the phishing emails, which is indeed directly associated to the possible strategy AA for the Adversary. On the following, word-based features will be described as an approach to fulfill the needed phishing strategies representation. These features will be presented as a binary variable for each word in a list of keywords, whose value is 1 if the word is used in the document, and 0 otherwise. Also, a feature for each cluster of words, defined as the keyword clusters is considered. The main idea is that phishing strategies are defined as a list of words used in a message. So, for each keyword cluster (Adversary type), a list of relevant words will be associated, representing a phishing strategy. First, a stop-words removal and stemming pre-processing is necessary to setup the email database. Let R be the total number of different words in the complete collection of 4 Links whose real URL is different from the URL presented to the email reader. phishing emails, and Q the total number of emails. A vectorial representation of a the phishing corpus is given by M = (mij ), i = 1, ..., R and j = 1, ..., Q , where mij is the weight associated to whether a given word is more important than another in a document. The mij weights considered in this research are defined as an improvement of the tf-idf term [34] (Term Frequency times inverse document frequency), defined by ) * Q mij = fij (1 + sw(i)) × log (9) ni where fij is the frequency of the i-th word in the j-th document, sw(i) is a factor of relevance associated of word i in a set of words and ni is the number of documents containing wi i word i. On this case, sw(i) = email , where wemail is the TE frequency of word i over all documents, and T E is the total amount of emails. The tf-idf term is a weighted representation of the importance of a given word, in a document that belongs to a collection of documents. The term frequency indicates the weight of each word in a document, while the inverse document frequency states whether the word is frequent or uncommon in the document, setting a lower or higher weight respectively. Based the on previous tf-idf representation, a clustering technique must be considered for the segmentation of the whole collection of phishing emails. K-Means clustering with the cosine between documents, as the distance function was used (see equation 10). "R k=1 mki mkj +" cos(mi , mj ) = +" (10) R R 2 2 k=1 (mki ) k=1 (mkj ) The optimal number of clusters was determined using as stopping rules the minimization of the distance within every cluster and the maximization of the distance between clusters. Then, for each cluster the most relevant words are determined by ,Cw(i) = |ζ| mip (11) p∈ζ for i ∈ 1, .., R, where Cw is a vector containing the geometric mean of each word weights within the messages contained in a given cluster. Here, ζ is the set of documents in each cluster and mip is represented by equation 9. Finally, the most important words for each cluster can be determined ordering the weights of vector Cw. This procedure is based on previous work described in [33], except that the application was for web-mining and the clustering algorithm used was Self Organizing Feature Maps. Results on this method showed that the optimal number of clusters is 13, where the 30 most relevant words of each cluster where considered as features (a total of 403). The first five relevant words of each cluster are presented in table 1. 5.4 Strategies and Types Extraction Based on previously mentioned features (a total of 418 features), a feature selection algorithm is used to improve the performance of the classification algorithms, eliminating noisy features that does not represent the target value at all, and don’t give enough information about the real phenomenon observed by the game agents. This is a key step for eliminating word features considered arbitrarily as the 30 most relevant words for each cluster, giving a final list of Table 1: This table shows the five most relevant words for each of the 13 clusters of the phishing corpus. Cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 Word 1 limit address ebay chase vector account signin amazon ebay login area use union Word 2 use follow secur repli area paypal list union email respons demo sidebar nation Word 3 credit bill bank payment desktop messag partner never page verif hidden card answer Word 4 card communiti access answer loan inform site maintain polici window expens repli googl Word 5 provid violat user info keybank updat offer world help yahoo image review barclay relevant words for the phishing/ham classification problem. An information gain algorithm was implemented, similar to the one used in decision trees. Here, the information gain for each feature was calculated over the whole database, eliminating those features that does not reported a minimum threshold. A total of 153 features where eliminated, obtaining the final data set of 265 features. After the feature selection algorithm, the Adversary’s types ti ∈ T are extracted using K-Means clustering over the whole collection of emails (phishing and ham). Therefore, the number of clusters over the whole set of features Kfeatures will represent the total number of types for the Adversary player. For each message x, represented by a vector of 265 variables, the type will be determined by ti = arg min d(x, Ci ), ∀i = {1, .., Kfeatures } i (12) where Ci is the centroid of cluster i, and function d : Ra × Ra → R represents the distance between two vectors of dimension a. The distance function used in this research is the Hamming distance, represented by the number of bits needed to change one vector into another. Results shows that Kfeatures = 7, where distributed among four frequency dominant phishing clusters and three frequency dominant ham clusters. Clusters where obtained following the same stopping rule used in the keyword clustering, by the minimization of the distance within every cluster and the maximization of the distance between clusters. As stated in section 3 and Appendix A, strategies are directly associated to the utility function modeling for the Classifier, where the usage of any of the 265 previously stated features, could increase or decrease the utility function depending on the classification strategy considered by the Classifier. 6. EXPERIMENTS The classification of phishing emails is a natural extension of text mining, where the most promising classification algorithms are Support Vector Machines, naı̈ve Bayes, Random Forest, among other text categorization algorithms [27]. A considerable problem is the online setting associated to the email inbox nature, where messages arrives from an infinite set of messages. On this context, the following experimental settings will be determined to give the right benchmark results for the proposed feature extraction between previous results and batch learning SVMs. Also the objective of the experimental setting is to show the accuracy and effectiveness between different online classification algorithms and the proposed adversary aware classifier. First, a 10 times 10 cross validation learning schema using SVM on the complete database characterized with 265 features was developed, using the libSVM-library [7], and the same learning schema was used to train a naı̈ve Bayes model implemented in Weka [36]. Then, an incremental drift evaluation of SVMs was made using a stratified hold-out learning schema using the 20% of the database to train the classifier, and the next 80% of the database was considered as an arriving message test schema without any actualization of the support vectors, as a proxy for the concept drift behavior over batch algorithms. Finally, for the online setting, the Relaxed Online SVM proposed by Sculley in [25] was used, as well as an incremental evaluation of naı̈ve Bayes, and the proposed adversary aware classifier (BAAO-SVM) was evaluated in this schema. The adversary aware classifier was developed using the 265 features as possible Adversary’s strategies, and the {+1, −1} set as the Classifier’s strategies. Types where considered as previously described type extraction method, where a total of 7 clusters were obtained. Approximation on the sequential equilibria was determined using logit QRE, implemented in Gambit [20] software command-line tool (gambitlogit). The Classifier’s strategy (adversary-aware classifier) described in section was implemented in C++, extending D. Sculley’s Online SVM implementation [25], with a modified version of SMO for prior knowledge described in [37]. The values of γ, $R and $M where defined as an initial estimation over the primitives of the game. More details on this model parameters finding where intentionally omitted by the authors. 6.1 Evaluation Criteria The resulting confusion matrix of this binary classification task can be described using four possible outcomes: Correctly classified phishing messages or True Positives (TP), correctly classified ham messages or True Negative (TN), wrong classified ham messages as phishing or False Positive (FP) and wrong classified phishing messages as ham or False Negative (TN). The evaluation criteria considered are common machine learning measures, which are constructed using the latter classification outcomes. • The False Positive Rate (FP-Rate) and the False Negative Rate (FN-Rate) as the proportion of wrongly classified ham and phishing email messages respectively. FP-Rate = FP FP + TN (13) FN-Rate = FN FN + TN (14) • Precision, that states the degree in which identified as phishing messages are indeed malicious. Can be interpreted as the classifier’s safety. Precision = TP TP + FP (15) • Recall, that states the percentage of phishing messages that the classifier manages to classify correctly. Can be interpreted as the classifier’s effectiveness. TP Recall = TP + FN (16) • F-measure, the harmonic mean between the precision and recall F-measure = 2 ∗ Precision ∗ Recall Precision + Recall (17) • Accuracy, the overall percentage of correct classified email messages. Accuracy = 7. 7.1 TP + TN TP + TN + FP + FN RESULTS Word List and Clustering Features Online Algorithms Performance To identify the online property of learning algorithms is not an easy task. In this work, a first approach using previously mentioned classification performance measures in section 6.1, the applicability and accuracy of the overall proposed algorithm were tested. In this context, results where obtained for ROSVM with an F-measure of 86.01% with an accuracy of 85.20%, for an online version of Naı̈ve Bayes the F-measure is 85.20% whose accuracy is 81.18% and for the proposed adversary aware classifier (BAAO-SVM)5 the Fmeasure is 87.69% whose accuracy is 86.63%, with a better performance than previously used online classification algorithms on these evaluating criteria. 8. CONCLUSIONS AND FUTURE WORK An extension of the Adversarial Classification framework for Adversarial Data Mining was presented, considering dynamic games of incomplete information, or signaling games, as a new approach to make classifiers improve their performance in adversarial environments. This approach considered strong assumptions on the Adversary strategies, the utility function modeling for the Classifier, and experimental setups related to the database processing. As a first approach, interesting empirical results are presented. 5 Model FP-Rate FN-Rate Accuracy Bergholz’s SVM 10x10xv SVM Naı̈ve Bayes Inc. drift SVM Inc. Naı̈ve Bayes Online SVM BAAO-SVM 0.07% 1.21% 4.47% 5.33% 1.33% 15.45% 14.69% 1.11% 0.33% 6.60% 10.60% 25.66% 14.26% 12.26% 99,52% 99.48% 94.31% 91.60% 81.18% 85.20% 86.63% (18) As shown in Table 4, the F-measure obtained for a 10 times 10 fold cross-validation SVM is 99.32% and for Naı̈ve Bayes under the same learning schema the F-measure obtained is 94.84%. Previous results for the same email corpus (phishing and ham corpus) reported an F-measure of 99.89% obtained by Bergholz et al. in [5]. In some evaluating measures, this work’s results are slightly worst than previously obtained results, but are highly competitive as most of the features considered after the feature extraction and selection algorithms are different from those proposed by previous authors. This points out an interesting open question: as a future work, a combined feature extraction technique could achieve better results. However, results for the False Positive Rate is considerable better than previously obtained with a value of 0.33%, compared to 1.11% respectively. 7.2 Table 2: Experimental results for the benchmark machine learning algorithms. FP-Rate, FN-Rate and Accuracy evaluation criteria. BAAO-SVM: Bayesian Adversary Aware Online SVM Table 3: Experimental results for the benchmark machine learning algorithms. Precision, Recall and F-measure evaluation criteria Model Precision Recall F-measure Bergholz’s SVM 10x10xv SVM Naı̈ve Bayes Inc. drift SVM Inc. Naı̈ve Bayes Online SVM BAAO-SVM 99.89% 99.67% 93.35% 95.90% 99.78% 85.20% 87.64% 99.89% 98.97% 96.38% 89.40% 74.34% 86.83% 87.74% 99.89% 99.32% 94.84% 92.54% 85.20% 86.01% 87.69% The proposed adversary aware classifier, whose core is mainly the Support Vector Machines model, considers a signaling game where beliefs, mixed strategies and probabilities for the messages’ types are updated and incorporated as prior knowledge, as new messages are presented. This enables the classifier to change the margin error parameter dynamically as the game evolves, considering an embedded awareness of the adversarial environment. More specifically, this is considered in the miss-classification constraint in the optimization problem for the SVM algorithm. Results obtained showed promising results over previous online text categorization algorithms used for email filtering. Feature extraction is a key component for the game strategies and types for the dynamic game of incomplete information proposed. Results showed that the proposed feature extraction and selection techniques are highly competitive in comparison with previous feature extraction work. Future work could be oriented to consider a mixture of previous and present feature extraction techniques. This could estimate a better strategy space for the Adversary, therefore the Adversary types. This is an important topic that affects directly the definition of the signaling game between the Adversary and the Classifier, hence the Classifier’s performance. Determining the actual drift concept of the game is an important open question. An experimental setup to show the impact for the classifier related to the inclusion of new Adversary strategies within an already defined set of strategies (features) might help to answer this question. In the game modeling, the Adversary strategies could be estimated using linear programming problem, as previous authors recommended in the original Adversarial Classification framework [8]. However, this first approach in adversarial classification with dynamic games of incomplete information showed interesting empirical and theoretical results. An extension on theoretical aspects of the game theory framework, such as refinements on these equilibria, using for example the intuitive criteria proposed by Cho and Kreps, among other special refinements for the perfect Bayesian equilibria. [12] 9. [14] ACKNOWLEDGMENTS [13] Support from the Millennium Science Institute on Com[15] plex Engineering Systems (http://www.sistemasdeingenieria.cl) and the Center for Analysis and Modeling for Security (CEAMOS) is greatly acknowledged. [16] 10. REFERENCES [1] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing detection. In eCrime ’07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, pages 60–69, New York, NY, USA, 2007. ACM. [2] B. Andre, J. D. Beer, S. Glahn, M.-F. Moens, G. Paass, and S. Strobel. New filtering approaches for phishing email. Journal of Computer Security, 2009. Accepted for publication. [3] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and J. D. Tygar. Can machine learning be secure? In ASIACCS ’06: Proceedings of the 2006 ACM Symposium on Information, computer and communications security, pages 16–25, New York, NY, USA, 2006. ACM. [4] R. Basne, S. Mukkamala, and A. H. Sung. Detection of Phishing Attacks: A Machine Learning Approach, chapter Studies in Fuzziness and Soft Computing, pages 373–383. Springer Berlin / Heidelberg, 2008. [5] A. Bergholz, J.-H. Chang, G. Paass, F. Reichartz, and S. Strobel. Improved phishing detection using model-based features. In Fifth Conference on Email and Anti-Spam, CEAS 2008, 2008. [6] B. Biggio, G. Fumera, and F. Roli. Adversarial pattern classification using multiple classifiers and randomisation. In SSPR/SPR, pages 500–509, 2008. [7] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. [8] N. Dalvi, P. Domingos, M. Sumit, and S. DeepakVerma. Adversarial classification. In in Proceedings of the Tenth International Conference on Knowledge Discovery and Data Mining, volume 1, pages 99–108, Seattle, WA, USA, 2004. ACM Press. [9] J. S. Downs, M. Holbrook, and L. F. Cranor. Behavioral response to phishing risk. In eCrime ’07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, pages 37–44, New York, NY, USA, 2007. ACM. [10] J. S. Downs, M. B. Holbrook, and L. F. Cranor. Decision strategies and susceptibility to phishing. In SOUPS ’06: Proceedings of the second symposium on Usable privacy and security, pages 79–90, New York, NY, USA, 2006. ACM. [11] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In WWW ’07: Proceedings of the [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] 16th international conference on World Wide Web, pages 649–656, New York, NY, USA, 2007. ACM. D. Fudenbert and J. Tirole. Game Theory. MIT Press, October 1991. C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213 – 242, December 2001. R. Gibbons. Game Theory for Applied Economists. Princeton University Press, 1992. J. Goodman, G. V. Cormack, and D. Heckerman. Spam and the ongoing battle for the inbox. Commun. ACM, 50(2):24–33, 2007. J. C. Harsanyi. Games with incomplete information played by bayesian players. the basic probability distribution of the game. Management Science, 14(7):486–502, 1968. D. M. Kreps and R. Wilson. Sequential equilibria. Econometrica, 50(4):863–94, July 1982. D. Lowd and C. Meek. Adversarial learning. In KDD ’05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 641–647, New York, NY, USA, 2005. ACM. C. C. M. Kantarcioglu, B. Xi. A game theoretic framework for adversarial learning. In CERIAS 9th Annual Information Security Symposium, 2008. M. A. M. McKelvey, Richard D. and T. L. Turocy. Gambit: Software tools for game theory, version 0.2007.01.30, 2007. R. D. McKelvey and T. R. Palfrey. Quantal response equilibria for normal form games. In Normal Form Games, Games and Economic Behavior, pages 6–38, 1996. J. Nazario. Phishing corpus, 2004-2007. B. Nelson, M. Barreno, F. J. Chi, A. D. Joseph, B. I. P. Rubinstein, U. Saini, C. Sutton, J. D. Tygar, and K. Xia. Exploiting machine learning to subvert your spam filter. In LEET’08: Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats, pages 1–9, Berkeley, CA, USA, 2008. USENIX Association. J. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines, 1998. D. Sculley and G. M. Wachman. Relaxed online svms for spam filtering. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 415–422, New York, NY, USA, 2007. ACM. O. Sönmez. Learning game theoretic model parameters applied to adversarial classification. Master’s thesis, Saarland University, 2008. F. Sebastiani. Text categorization. In A. Zanasi, editor, Text Mining and its Applications to Intelligence, CRM and Knowledge Management, pages 109–129. WIT Press, Southampton, UK, 2005. D. S. Skins and R. Dhamija. The battle against phishing:. In In SOUPS Š05: Proceedings of the 2005 symposium on Usable privacy and security, pages 77–88. ACM Press, 2005. T. L. Turocy. A dynamic homotopy interpretation of the logistic quantal response equilibrium [30] [31] [32] [33] [34] [35] [36] [37] correspondence. Games and Economic Behavior, 51(2):243–263, May 2005. T. L. Turocy. Using quantal reponse to compute nash and sequential equilibria. Economic Theory, Vol. 42, Issue 1, 2010. V. N. Vapnik. The Nature of Statistical Learning Theory (Information Science and Statistics). Springer, 1999. J. D. Velasquez, S. A. Rios, A. Bassi, H. Yasuda, and T. Aoki. Towards the identification of keywords in the web site text content: A methodological approach. IJWIS, 1(1):53–57, 2005. Y. H. A. T. Velásquez, J.D. and R. Weber. A new similarity measure to understand visitor behavior in a web site. IEICE Transactions on Information and Systems, Special Issues in Information Processing Technology for web utilization, vE87-D i2.:389–396, 2004. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, New York, NY, USA, 2003. ACM. I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2nd edition edition, 2005. X. Wu and R. Srihari. Incorporating prior knowledge with weighted margin support vector machines. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 326–333, New York, NY, USA, 2004. ACM. P. Zhang, X. Zhu, and Y. Shi. Categorizing and mining concept drifting data streams. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 812–820, New York, NY, USA, 2008. ACM. APPENDIX A. CLASSIFIER’S UTILITY FUNCTION Based on a simple example, the notion of the utility function for the Classifier is presented. Finally a general expression for the utility function is presented, and an analytti ti ical expression for ∆UC,R (xj ) and ∆UC,M (xj ) is deduced. Consider a database characterized by just one feature x1 . This feature represents the usage of a phishing word, suppose the word “paypal”. Therefore, if “paypal” is used in a message, then x1 = 1 and x1 = 0 otherwise. Suppose we have a classification function defined by y = w · x1 + b. If y >= τ , for a given threshold τ , the message is classified as phishing (or malicious), and if y < τ is classified as regular (or non-malicious). As a strong assumption, the construction and modeling of the utility function considers that every feature with a value different than 0 presents a phishing strategy. Suppose the case where the real type of given message is malicious. If a malicious feature x1 is used in the message, then the value for the classification function is y = w + b, so there is a high probability that w + b > τ , therefore to classify the message as malicious. The utility of the Classi- fier, if the message is classified as malicious, is proposed to be considered as UC = $M (w + b), the maximum reward to the classifier for performing a good job. Now, if the Classifier decides that the message was not malicious, even if x1 = 1, it must be penalized with his maximum cost UC = −γ$M (w + b), where γ > 1 is a miss-classification cost parameter. Now, if x1 = 0, then the classifier is rewarded (or penalized) with a lower value, as it manages to classify a message without phishing properties as malicious, or missclassifying a malicious message without phishing properties respectively. Also, to differ the payoffs from the case where the message real type is regular, an $M parameter is introduced. Here, C(x1 ) is the prediction of the classification function. Based on the previous idea, the Classifier’s payoffs for the malicious and regular cases are defined as follows: $M · (w + b) if x1 = 1, C(x1 ) = +1 $ · b if x1 = 0, C(x1 ) = +1 M UCMalicious (x1 , C(x1 )) = −γ · $ · (w + b) if x1 = 1, C(x1 ) = −1 M −γ · $M · b if x1 = 0, C(x1 ) = −1 In case that the real type of a given message is regular, the payoff are quite different. Here, the maximum payoff can be achieved when the classifier decides is not phishing, when is not using a phishing characteristic. The minimum payoff is achieved when the regular message doesn’t use a phishing characteristic but is classified as malicious. Without loose of generality, an $R parameter is introduced to differ the payoffs from the previous case. $R · (w + b) $ · b R Regular UC (x1 , C(x1 )) = −γ · $R · (w + b) −γ · $R · b if if if if x1 x1 x1 x1 = 0, = 1, = 0, = 1, C(x1 ) = −1 C(x1 ) = −1 C(x1 ) = +1 C(x1 ) = +1 For the general case with a features6 , following the same idea of previous simple case where a = 1, the classification function is defined by y = wT · x + b, where dim(x) = dim(w) = a. All this, associated to mixed strategies of playing C(xj ) = +1 or C(xj ) = −1, represented by σC∗ (+1|xj ) or σC∗ (+1|xj ) respectively. It can be shown that, ti ∆UC,M (xj ) ti ∆UC,M (xj ) = = ti ∆UC,M (xj ) = ∗ σC (+1|xj )UC (tM,xi , xj , +1) ∗ −σC (+1|xj )UC (tM,xi , xj , −1) σC (+1|xj )$M · (wT · x + b) ∗ −σC (−1|xj )($M · (−γ) · (wT · x + b)) ∗ ∗ $M (σC (+1|xj ) + γ · σC (−1|xj )) · (wT · xj + b) Following the same idea, when the message’s real type is regular, it can be shown that, ti ∗ ∗ ∆UC,R (xj ) = $R ·(σC (−1|xj )+σC (+1|xj )γ)·(wT ·(e−xj )+b) Where e is a vector of dimension a with ones. It is important to consider that the previous utility function representation is specially designed for binary features that only captures malicious strategies. 6 a is the number of features or strategies revealed to the Classifier from a given Adversary Data Security and Integrity: Developments and Directions Bhavani Thuraisingham Department of Computer Science University of Texas at Dallas [email protected] ABSTRACT Data is a critical resource in numerous organizations. One of the challenging problems facing these organizations today is to ensure that only authorized individuals have address to data. Data also has to be protected from malicious corruption. Much of the early work on data security focused on multilevel secure data management systems where users have different clearance levels and data has different sensitivity levels and access to data is governed by the security policies. There were many efforts on securing relational, distributed and objectoriented databases. More recently, several aspects of data security are being investigated including data confidentiality, integrity, trust and privacy. Furthermore, securing data warehouses, semantic web, as well as applying data mining for solving security problems are getting a lot of attention This presentation will review the developments in data security and integrity as well as discuss directions for further research and development. In particular, policy management for the semantic web, assured information sharing, privacy preserving data mining and novel ways to build secure data management systems will be discussed. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD-WS'09, June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-4...$5.00 BIOGRAPHY Dr. Bhavani Thuraisingham joined The University of Texas at Dallas (UTD) in October 2004 as a Professor of Computer Science and Director of the Cyber Security Research Center in the Erik Jonsson School of Engineering and Computer Science. She is an elected Fellow of three professional organizations: the IEEE (Institute for Electrical and Electronics Engineers), the AAAS (American Association for the Advancement of Science) and the BCS (British Computer Society) for her work in data security. She received the IEEE Computer Society’s prestigious 1997 Technical Achievement Award for “outstanding and innovative contributions to secure data management.” Her research interests are in Assured information sharing and trustworthy semantic web; secure geospatial data management; and Data mining for security applications. Prior to joining UTD, Thuraisingham worked for the MITRE Corporation for 16 years which included an IPA (Intergovernmental Personnel Act) at the National Science Foundation. Her work in information security and information management has resulted in over 80 journal articles, over 200 refereed conference papers three US patents. She is the author of eight books in data management, data mining and data security and have given over 60 keynote addresses on these topics. Prof. Thuraisingham’s website is http://www.utdallas.edu/~bxt043000. Towards Trusted Intelligence Information Sharing Joseph V. Treglia Joon S. Park School of Information Studies Syracuse University Syracuse, NY, USA School of Information Studies Syracuse University Syracuse, NY, USA jvtregli @syr.edu [email protected] A BST R A C T While millions of dollars have been invested in information technologies to improve intelligence information sharing among law enforcement agencies at the Federal, Tribal, State and Local levels, there remains a hesitation to share information between agencies. This lack of coordination hinders the ability to prevent and respond to crime and terrorism. Work to date has not produced solutions nor widely accepted paradigms for understanding the problem. Therefore, to enhance the current intelligence information sharing services between government entities, in this interdisciplinary research, we have identified three major areas of influence; Technical, Social, and Legal. Furthermore, we have developed a preliminary model and theory of intelligence information sharing through a literature review, experience and interviews with practitioners in the field. This model and theory should serve as a basic conceptual framework for further academic work and lead to further investigation and clarification of the identified factors and the degree of impact they exert on the system so that actionable solutions can be identified and implemented. C ategories and Subject Descriptors C.2.0 [Computer-Communication Networ ks]: General ! security and protection. H.3.5 [Information Storage and Retrieval]: Online Information Services ! data sharing. K.4.1 [Computers and Society]: Public Policy Issues ! transborder data flow. K.6.1 [M anagement of Computing and Information Systems]: Security and Protection ! unauthorized access. G eneral T erms Management, Reliability, Security, Theory. K eywords Information Sharing, Intelligence Sharing, Security. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD-WS'09, June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-4...$5.00 1. I N T R O D U C T I O N The term information sharing in law enforcement gained popularity as a result of the 9/11 Commission Hearings and report of the United States government's lack of response to information that was known about planned terrorist attacks on the New York City World Trade Towers prior to the events. This led to the enactment of several executive orders by President Bush mandating agencies implement policies to "share information" across organizational boundaries (United States, 2007c). Intelligence information sharing is the transfer of information obtained that relates to an actual or impending occurrence of a criminal or terrorist act. It includes suspicious activity reports regarding incidents or observations which are of a less obvious nature but which may be supportive or related to criminal or terrorist related activity. An incident or activity of suspicious nature or that is outside of the norm for a particular environment and circumstance could be considered suspicious activity or intelligence information depending upon the circumstances. The adjudication as to whether an event is considered and captured as suspicious activity or intelligence is subjective and left to the discretion of the officer involved in the report or observation. Where a person of average intelligence and familiarity with the normative environment would believe an act may be a part of the cause or furtherance of a criminal act, it would qualify as intelligence information. An example may be helpful here; a person taking pictures of trains in a train yard may not cause someone familiar with the area to be concerned. Many people collect, model, and photograph trains. However, if we add to this scenario; 1) the person is not from the surrounding area, 2) the person is taking photos of trains and the facilities, 3) the person becomes agitated when asked about her/his purpose, 4) the person provides inaccurate information about where they are from or inconsistent versions of what they are doing, then perhaps the person is affiliated with a terrorist group or have criminal intentions. This all may bring the incident to the level of a reportable incident, or suspicious activity report, which would become intelligence information to be shared among enforcement agencies. It is the combination of activity and circumstance that is the trigger. This will also be referred to as intelligence information or intelligence. It is yet another problem, worthy of study, to identify the means by which various agencies collect and manage this type of information. While millions of dollars have been invested in information technologies to improve information sharing capabilities among all law enforcement agencies, according to the National Security Agency (NSA) there remains a hesitation to share intelligence information between agencies (Lieberman, 2007). Information technologies for the future should provide for ubiquitous and distributed computing and communication systems that deliver transparent and high quality service, without disruption, and while enabling and preserving privacy, security, trust, participation and cooperation. In this paper we identify barriers affecting effective and trusted information sharing between federal, tribal, state and local law enforcement agencies in the United States. Research in understanding these dynamics will lead to identification of actionable solutions to law enforcement intelligence information sharing across the federal, tribal, state and local levels. 2. SU M M A R Y O F R E L A T E D W O R K Inter-organizational systems studied by management information systems researchers have primarily focused on the private sector and do not directly apply to the government sector (Lai & Mahapatra, 1997). Preliminary work involved an exploration of conditions for cooperation between emergency management agencies where perceived information assurance of others and information sharing standards were more strongly related to information sharing than were cultural norms, in emergency contexts (Lee & Rao, 2007). Research on emergency services "#$%"&#'( &)*&( &#+),-+*.( #,/-"%,0#,&12( 13+)( *1( %&)#"( *4#,+-#15( information assurance level and technical standards, seemed to encourage information sharing systems use. There is limited work available that focuses on interagency information sharing issues in the law enforcement sector, examples here include studies done in agencies outside the united states (Glomseth et al., 2007; Jing & Pengzhu, 2007). Such studies may not generalize to this environment in many respects. Some studies done in the United States do involve looking at agencies sharing information at the same levels (Pardo et al., 2006a). There is initial work on cultural influences on information sharing behaviors in the public sector (Luna-Reyes et al., 2007). There is recent work on organizational capability assessment for information systems development, which involved criminal justice agencies and this also included the cultural considerations and system complexity issues (Cresswell et al., 2007). Further examination of the influence of perception of technological issues, culture, trust, and legal or policy issues in the law enforcement and public sector context is necessary. 3. K E Y I N F L U E N C ES O N T R UST E D I N F O R M A T I O N SH A R I N G In order to enhance the current intelligence information sharing services between government entities, we have identified three major areas of influence; Technical, Social, and Legal, (summarized in Figure 1) in our previous work (Treglia and Park, 2009). In this paper, we further develop the preliminary model and theory of intelligence information sharing through a literature review, experience, and interviews with practitioners in the field. Within each area we identify, individual factors are discussed that play roles in influencing whether or not intelligence information is ultimately shared. Technical Interoperability Availability Control Intelligence Information Sharing Social Trust Shadow Network Criticality Legal Policy Conflict Governance F igure 1. K ey Influences on T rusted Information Sharing 3.1 T echnical Influences 3.1.1 Interoperability The interoperability of information systems and the data elements captured and used was found to be a problematic issue. Tools such as XML are widely used in business development of Web services and for B2B integration and data exchange (Lampathaki et al., 2008, Fernández-Medina and Yagüe, 2008). Although 82% of non-federal law enforcement agencies in the United States use computers for internet access unified standards for information systems have not been universally accepted by law enforcement entities (U.S. Dept. of Justice Statistics, 2006). This has lead to hardware, software and network inconsistencies (Chau et al., 2002; United States, 2007b). It is also related to the definition of fields and data descriptions. With more than 19,000 law enforcement agencies in the United States, each having their own systems and hierarchy, it is no wonder that there are issues with compatibility between agencies and systems when you try and +%..*6%"*&#( %"( -,&#"+%,,#+&( 7893"#*3( %:( ;31&-+#( <&*&-1&-+1( =*>( Enfo"+#0#,&( <&*&-1&-+12?( @AABCD( E1( *( 0*&&#"( %:( :*+&2( &)#"#( *"#( many different information systems currently being used by law enforcement agencies for data management and communication, such as COPLINK, OneDOJ, N-DEx, ALECS, LInX and others (Bulman, 2008; Chen et al., 2003; McKay, 2008). Furthermore, the interoperability in regulations hinders trust between agencies. For instance, there are no broadly accepted standards for security clearances and subsequent access across agencies. The federal government has a lengthy process for approving access to intelligence information and there is no provision to readily accept security clearances from other federal or non-federal agencies (Whitehouse, 2007). A state police officer with secret clearance in his agency does not carry this standard or designation with other local or federal agencies. Security clearances even across federal agencies do not automatically transfer and must be reevaluated and reassessed by the individual agency. A justifiable concern is that there are not universal standards for hiring and background checks across the /*"-%31(*4#,+-#1D(F)#($"%+#11(31#'(&%(/#"-:G(*($#"1%,51(+"#'-6-.-&G( in a given agency may not be adequate for a certain level of secure access at another agency. This is a tremendous obstacle to sharing information among agencies and feeds into a perception of mistrust across agencies. This is, however, an area which can be addressed through legislative changes and changes to internal agency processes. 3.1.2 Availability Availability means that the systems must respond in a timely manner. These systems must have a high degree of survivability and function in mission critical environments where parts of the network may be compromised but accurate service must be continued (Park et al., 2009; Schooley, 2007). For instance, network availability impacts acceptance and use of systems (Chan & Teo, 2007; Koroma et al., 2003). Furthermore, information 031&(6#(H#$&(3$(&%('*&#(*,'(-,(*++%"'*,+#(>-&)(&)#(31#"15(-,&#"#1&1( and needs. For the new systems to integrate with the varieties of technology and protocols that are used the complexity of processing and connections are increased and system performance and reliability become taxed. Systems become more prone to delays or failures as they must incorporate legacy and other protocols into their core programming and functions. Increases in overhead for security also add to the workload and increases the potential for system delays or failure. Systems that are considered slow or non-responsive according to the expectations of the users will have a hard time being adopted. 3.1.3 Control Control as perceived by the users is required for information sharing and systems adoption. Information-sharing systems must be capable of controlling, monitoring and managing all usage and dissemination of intelligence information for tracking purposes to provide assurance, which is required for trust (Li et al., 2008). There is no broadly accepted set of minimum security and access control standards and protocols for intelligence information systems that have been uniformly adopted for use across federal, tribal, state and local agencies (Cresswell et al., 2007). Distributed workflow control tasks in these integrated and grid environments may increase the level of information sharing, availability, cost effectiveness, but, on the flip side, they also increase the complexity and control problems (Serra da Cruz et al., 2008; Park et al., 2001). Therefore, provenance and user control tasks and capabilities must be suitable to these varied environs in a trusted information-sharing system. 3.2 Social Influences 3.2.1 Trust Trust is a key influencer of sharing behavior. Here trust refers to the degree in which the person with intelligence information trusts other people in other agencies who may receive information or have access to it. Trust has been identified as an area of concern in much of the information systems and management research (Gao, 2005; Humenn et al., 2004; Jing & Pengzhu, 2007; Koufaris & Hampton-Sosa, 2004; Lee, 2006; Lee et al., 2008; Xiong & Liu, 2004; McKnight et al., 2002; Niu, 2007; Razavi & Iverson, 2006; Ruppel et al., 2003; Schoorman et al., 2007; Zhang, 2005; Rocco, 1998; ISAC White Paper, 2004; Li et al., 2008; Ray, 2004; Chakraborty et al., 2006; Park et al., 2006; Park et al., 2007). Trust occurs at the individual and organizational level. It includes other law enforcement officers and extends to the other staff or persons who may gain access to information were it made available to them and assumes that here is a means for sharing this information (Scott, 2006). Agencies treat information security differently and there is a reality to corruption in agencies at any level (Ivkovic & Shelley, 2005). The person responsible to share intelligence information may have personal knowledge of individual employees who they do not trust or a general impression or bias, correct or not, of the security within the agency in general terms. Personal impression does influence their decision to share information on the unified system or not. The person deciding to share the information weighs trust in this way. Trust may weigh heavily on the decision to provide information as well (Niu, 2007; Pardo et al., 2006b). In the case of a very trusting person they are more likely to freely provide information to the system than someone who is more apprehensive or who has some specific concerns as above. 3.2.2 Shadow Network Shadow networks involve the situation where a personal or agency connection, in or outside of the work place, creates a conflict of interest and the organization or individual may not act in a non-biased, objective manner. This may involve personal friendships, affiliations or family ties and connections through other activities or interests outside the workplace. This can have positive and negative effects for organizations (Ingram & Lifschitz, 2006). Intelligence information that that may negatively impact an agency or key individuals or associates may be withheld and not shared by the organization involved. The stigma or interpersonal links behind the scenes play a role in interaction and sharing decisions (Kulik et al., 2008). This is related to the organizational notion of shadow systems, which are described by <&*+G(7IJJKC(*1(8&)#(+%0$.#L(>#6(%:(-,&#"*+&-%,1(-,(>)-+)(1%+-*.( covert political and psycho-dynamic systems coexist in tension >-&)(&)#(.#4-&-0*&#(1G1�(7<)*>2(IJJBM(<&*+#G2(IJJKCD?(F)#"#(-1( an obvious link here to personal integrity and to social impacts of $%&#,&-*..G( '*0*4-,4( -,:%"0*&-%,( &)*&( )-&1( 8&%%( +.%1#( &%( )%0#D?(( The personal integrity of the individual member with the information has an influence on whether or not they will share. Integrity is internal to the individual. Trust is focused outward to the perception of another agency by the individual. Integrity has to do with the specific character and makeup of the person with the information. Influences such as policy, trust and personal interests, personal connection and corruption affect different individuals in different ways based on their personal integrity and interests. A person who demonstrates a high degree of respect for the rules and regulations of the agency would be considered to have a high degree of integrity and would be more likely to follow policy than someone with a record of bending or not following the rules. Integrity involves a willingness to place the %"4*,-N*&-%,1("3.#1(*,'(-,&#"#1&1(*6%/#(%,#51(%>,D( 3.2.3 Criticality Criticality of the information itself and its potential harmful impact if not disclosed is a key influencer of action in sharing information. Studies by J. Lee and H.R. Rao have shown that officers are more likely to share information where there is a clear and present danger to life or property (Lee & Rao, 2007). The greater the threat the greater the likelihood that the people involved will cooperate and share information. Preliminary work exploring possible causes and effects of inter-agency information- sharing systems adoption in the counter-terrorism and disaster management domains involved an exploration of environmental and situational conditions for cooperation between emergency management agencies. The perceived information assurance of others and having information sharing standards were more strongly related to agencies sharing than were cultural norms, in emergency contexts. It does support an assertion that during a crisis, where criticality is described as a factor, people are more willing to share information regardless of other influences. The timeliness of the information itself is also related to criticality. The relationship of time to the consequences or effectiveness of the information influences whether or not the information is shared or not. In the case of information obtained too late or after the fact it may or may not be shared based on what consequence it may have at that point in time. Information of a questionable value may be held in waiting so that it can be verified or supported in some way before sharing. As the time draws near to where the information may become useless if not shared the decision to share or not share is reevaluated. 3.3 L egal Influences 3.3.1 Policy Conflict and Competition Agency policy also has influence on whether information gets shared or not. In an agency with defined policy as to what is to be shared it is easier for staff to make the determination to follow through with information that is clearly within the guidelines. Clear and enforced rules for information sharing do lead to better sharing of this information (Carter & United States 2004). Policies vary and are subject to interpretation. Furthermore, based on the funding or evaluation policy, agencies may compete for resources and there is a competitive element to doing the job better than other agencies that have shared interests and responsibility. For instance, funding for activities may be based on how many crimes are solved or specific incidents handled by a particular agency. An example of this would be formula grants, which are disseminated based on key reported activities handled by an agency. Actually, the Department of Justice alone Distributed $2.396 billion dollars of assistance to law enforcement and other agencies based on formula and +%0$#&-&-/#( 4"*,&( "#O3#1&1( *,'( %&)#"( $"%4"*01( 78P#'#"*.( E11-1&*,+#( :"%0( Q#$*"&0#,&( %:( ;31&-+#2( PR( @AAS2( 1300*"G2?( n.d.). This leads to competition for important cases and an interest in being the agency to close a particular case or handle a particular incident. As long as funding determinations are made in this manner, competition among agencies will likely continue to be an influencing factor. Under the present structure many law enforcement agencies are put in a position of being in competition for statistics and resources with other agencies because agencies from the Federal to Local levels each must justify their budgets to their constituencies and oversight entities. There is an interest in showing that your agency is the one doing the work, more activity and responsibility correlating to getting more money and resources. 3.3.2 Governance There is not a clear and universal guide to what intelligence information can and cannot be shared across the federal, tribal, state and local levels. The laws and policies governing information security, dissemination and use vary across local, state tribal and federal agencies. Where agencies do not have clear guidance on whether or not intelligence information may be shared they may choose to take the safer path of not sharing to protect them from liability. For instance, security clearances for intelligence information sharing and recognition of legitimate rights to access intelligence information by local, state tribal and federal agencies remains a process that is not coordinated or acknowledged across agencies. The governance regarding law enforcement collaboration must be reevaluated. Law enforcement agencies in the United States share overlapping responsibilities and jurisdiction with no one unitary command; this creates problems over control and authority in investigations, information sharing and access. They independently act in the interests of their constituencies as well as for the broader collective good. Our approach and expectations for collaboration in this environment must be challenged to be effective in the future. Law enforcement in the US remains uniquely decentralized and does not operate under unitary command or control. Recent case studies on knowledge sharing within public sector inter-organizational networks confirm information-sharing difficulties across agencies (Jing & Pengzhu, 2007; Pardo et al., 2006a). Agencies overlap jurisdictions and responsibility; each with a duty to their own constituencies. 4. F R A M E W O R K A N D IMPL E M ENT A TION Based on the key influences on trusted information sharing we analyzed in Section 3, we introduce our framework with two types of influences affecting whether or not sharing occurs; :*+-.-&*&%"1(*,'('#&"*+&%"1D(F)-1(0%'#.(-1(*,(%::1$"-,4(%:(=#>-,51( force field analysis, which is used here for looking at factors or forces influencing the decision of an individual or organization to share intelligence information (Thomas, 1985). Forces act as facilitators, driving movement toward information sharing, or detractors drawing momentum away from a choice to share intelligence information. Each of the factors under the headings given has a potential for facilitating or detracting from a choice to share intelligence information in a given context. F igure 2. Influencing Intelligence Information Sharing Facilitators include the positive influences that result from technical, social, and legal issues. Detractors include negative influences resulting from technical, social and legal rules, regulations, actions or perceptions (see Figure 2 below). As facilitators, technical issues such as having compatible operating systems, software, hardware, data definitions, secure access, control, high usability, system availability all can work towards improving the potential for information sharing but do not cause information to be shared (Lee & Rao, 2007; Scott, 2006). Regarding technology we may picture two young friends who tie two tin cans together on a string to communicate; it is not the technology of the cans that cause the two to talk across the string but their desire to share with each other that controls use of technology. It is therefore the social and cultural aspects of the relationship that matter more than the technology in the equation for information sharing. Today, the two kids from our example are texting. Socially, greater trust and knowledge of the other parties involved lead to greater tendency towards intelligence information sharing. This involves agency culture and personal ties or connections with other involved agencies; which include shadow networking ties outside the workplace to include family and friend or other associations that involve one member having some other contact or relationship with someone associated with another agency (Drake et al., 2004b; Marks & Sun, 2007). A ready example is family, friends or participation in clubs or activities that involve others apart from the work environment. These external contacts can have a positive influence on the likelihood of intelligence information sharing. Shared training and joint operations such as the U.S. Marshals joint fugitive round up effort with state and local agencies in Florida can have a positive effect on information sharing (Clark, 2008). Importance, as described previously, can be a very critical factor influencing the sharing of intelligence information as well. Information that is credible and which may result in some specific harm or loss is more readily shared and the pressure to share this information increased where there may be an approaching deadline (Lee & Rao, 2007). In the area of legal influence, having a clear and enforced agency policy regarding intelligence information sharing will lead to the greater likelihood that information will be shared as will increased knowledge of laws and regulations which allow for intelligence information sharing. System governance and participation by others is also expected to be facilitator of intelligence information sharing where members and organizations have positive regard :%"( *,'( *++#$&( #*+)( %&)#"51( "%.#1( 7T"#11>#..(#&(*.D2(@AABM(U*"H(#&( al., 2001). People within agencies are more likely to participate in systems that they have choice, investment and control over. As detractors, intelligence comes from the field or other sources to an agency and the identified factors may negatively affect the degree to which this information is likely to be shared. Legal factors with a negative influence include security clearances, which are not uniform or recognized across agencies, laws regarding privacy, secrecy or sharing of information that are conflicting or not well understood by participants. Social issues here involve issues of lack of trust, integrity, assurance or an agency culture, which is geared towards not sharing (Lee & Rao, 2007). Trust is reduced where agencies compete for statistics, media attention and funding. Informal or outside contacts which are described as part of the shadow network have great potential to provide a negative influence if the information may be potentially damaging to an entity or person. Criticality includes timing of information and its potential impact so where there is little urgency the pressure to share this intelligence is reduced. Where there is no identified time frame or deadline the information may not be reacted to in a timely manner thus reducing the pressure to share this information. Lack of knowledge or inaccurate knowledge about what actions can be taken regarding sharing of information can hinder information sharing. Matters of jurisdiction, authority, and governance or control over the power or influence also work against sharing (Drake et al., 2004a). Technical factors act as detractors as well. Many agencies use different hardware and software programs for communication and information management and these may not interact together. Systems that are not responsive or show poor performance may not be adopted. Agencies with existing systems may not be financially able to change to more compatible or standardized systems. The costs for retraining onto new services can be high as well. Costs for maintenance of the systems must be considered. These factors serve as the basis and framework for investigation. Inter-relationships of the identified factors influence the degree to which information sharing is more or less likely to occur and can be viewed as the balance of the result. We are developing a formula and model based on this framework to describe and to predict resulting conditions regarding information sharing behaviors based on knowledge of the influencing factors. Probable effects from modifying the influencing factors are more readily apparent and easier to identify using such a model. The further investigation of these influences will inform the model as to the degree of influence each may have relative to the other so that decision makers can pattern solutions towards desired ends. 5. C O N C L USI O NS A N D F U T U R E W O R K By using the approach developed we can visualize the effects of making changes in different areas of influence on the level of information sharing and observe the outcomes. We can see how adjustments in degree of influence for the influencing factors identified determine whether or not intelligence information is shared under the given circumstances. This leads to the theory that intelligence information sharing between law enforcement agencies is affected by technical, social, and legal factors which are comprised of issues of interoperability, availability, control, trust, shadow networks, criticality, policy conflict, competition, and governance. We have identified the major areas of influence and posit that these factors work to facilitate or detract from the sharing of intelligence information between agencies. This model and theory should serve as a basic conceptual framework for further academic work and lead to further investigation and clarification of the identified factors and the degree of impact they exert on the system so that actionable solutions can be identified and implemented. 6. R E F E R E N C ES [1] Bulman, P. (2008). Communicating Across State and County Lines: The Piedmont Regional Voice over Internet Protocol Project. NIJ Journal, 261. Retrieved February 22, 2009, from http://www.ojp.usdoj.gov/nij/journals/261/piedmontvoip.htm. [2] Bureau of Justice Statistics Law Enforcement Statistics. (2007, August 8). . Retrieved May 9, 2008, from http://www.ojp.usdoj.gov/bjs/lawenf.htm. [3] Carter, D. L., & United States. (2004). Law Enforcement Intelligence a Guide for State, Local, and Tribal Law Enforcement Agencies. Washington, D.C.: U.S. Dept. of Justice, Office of Community Oriented Policing Services. [4] Chakraborty, Sudip and Ray, Indrajit (2006). TrustBAC: Integrating trust relationships into the RBAC model for access control in open systems. In Proceedings of the 11th ACM Symposium on Access Control Models and Technologies, Lake Tahoe, CA, June 2006. [5] Chan, H. C., & Teo, H. (2007). Evaluating the boundary conditions of the technology acceptance model: An exploratory investigation. ACM Trans. Comput.-Hum. Interact., 14(2), 9. doi: 10.1145/1275511.1275515. [6] Chau, M., Atababhsh, H., Zeng, D., & Chen, H. (2002). Building an Infrastructure for Law Enforcement Information Sharing and Collaboration: Design Issues and Challenges. National Science Foundation. Retrieved February 4, 2008, from http://dlist.sir.arizona.edu/473/01/chau4.pdf. [7] Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., & Schroeder, J. (2003). COPLINK: managing law enforcement data and knowledge. Communications of the ACM, 46(1), 28-34. [8] Clark, J. (2008, September 18). Remarks by Director John Clark at the Operation Orange Crush Press Conference. USMarshals.gov. Government Agency. Retrieved November 27, 2008, from http://www.usmarshals.gov/news/chron/2008/091808.htm. [9] Cresswell, A. M., Pardo, T. A., & Hassan, S. (2007). Assessing capability for justice information sharing. In Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains (pp. 122-130). Philadelphia, Pennsylvania: Digital Government Society of North America. Retrieved September 18, 2008, from http://portal.acm.org/citation.cfm?id=1248460.1248479&col l=ACM&dl=ACM&CFID=3249028&CFTOKEN=78511360 [10] Drake, D., Steckler, N. A., & Koch, M. J. (2004). Information Sharing in and Across Government Agencies: The Role and Influence of Scientist, Politician, and Bureaucrat Subcultures. Social Science Computer Review, 22(1), 67-84. doi: 10.1177/0894439303259889. [11] Federal Assistance from Department of Justice, FY 2008, summary. (n.d.). . Retrieved February 22, 2009, from http://www.usaspending.gov/faads/faads.php?datype=T&det ail=1&database=faads&fiscal_year=2008&maj_agency_cat=15. [12] Eduardo Fernández-V#'-,*(*,'(V*"-#00*(WD(R*4X#2(8<&*&#( of stan'*"'1(-,(&)#(-,:%"0*&-%,(1G1(1#+3"-&G(*"#*2?( Computer Standards & Interfaces 30, no. 6 (August 2008): 339-340, doi:10.1016/j.csi.2008.03.001. [13] Gao, J. (2005). Information sharing, trust in automation, and cooperation for multi-operator multi-automation systems. Retrieved from http://proquest.umi.com/pqdweb?did=1079666781&Fmt=7& clientId=3739&RQT=309&VName=PQD. [14] Glomseth, R., Gottschalk, P., & Solli-Saether, H. (2007). Occupational culture as determinant of knowledge sharing and performance in police investigations. International Journal of the Sociology of Law, 35(2), 96-107. doi: 10.1016/j.ijsl.2007.03.003. [15] Ingram, P., & Lifschitz, A. (2006). Kinship in the Shadow of the Corporation: The Interbuilder Network in Clyde River Shipbuilding, 17111990. American Sociological Review, 71, 334-352. [16] ISAC White Paper. (2004). Vetting and trust for communication among ISACs and government entities. Retrieved October 23, 2008, from http://www.isaccouncil.org/pub/Vetting_and_Trust_013104. pdf. [17] Ivkovic, S. K., & Shelley, T. O. (2005). The Bosnian Police and Police Integrity: A Continuing Story. European Journal of Criminology, 2(4), 428-464. doi: 10.1177/1477370805056057. [18] Jing, F., & Pengzhu, Z. (2007). A case study of G2G information sharing in the Chinese context. In Proceedings of the 8th annual international conference on Digital government research: bridging disciplines & domains (pp. 234-235). Philadelphia, Pennsylvania: Digital Government Society of North America. Retrieved September 18, 2008, from http://portal.acm.org/citation.cfm?id=1248460.1248496&col l=ACM&dl=ACM&CFID=3249028&CFTOKEN=78511360 [19] Koroma, J., Li, W., & Kazakos, D. (2003). A generalized model for network survivability. In Proceedings of the 2003 conference on Diversity in computing (pp. 47-51). Atlanta, Georgia, USA: ACM. doi: 10.1145/948542.948552. [20] Koufaris, M., & Hampton-Sosa, W. (2004). The development of initial trust in an online company by new customers. Inf. Manage., 41(3), 377-397. [21] Kulik, C. T., Bainbridge, H. T. J., & Cregan, C. (2008). KNOWN BY THE COMPANY WE KEEP: STIGMA-BYASSOCIATION EFFECTS IN THE WORKPLACE. Academy of Management Review, 33(1), 216-230. doi: Article. [22] Lai, V. S., & Mahapatra, R. K. (1997). Exploring the research in information technology implementation. Inf. Manage., 32(4), 187-201. [23] Lampathaki, F., Mouzakitisa, S., Gionisa, G., Charalabidisa, RD2(*,'(E1H%3,-1*2(QD(7@AASCD(8931-,#11(&%(631-,#11( interoperability: A current review of XML data integration 1&*,'*"'12?(T%0$3&#"(<&*,'*"'1(Y(W,&#":*+#12( doi:10.1016/j.csi.2008.12.006. [24] Lee, C. (2006). The role of trust in information sharing: A study of relationships of the interorganizational network of real property assessors in New York state. Retrieved from http://proquest.umi.com/pqdweb?did=1288656571&Fmt=7& clientId=3739&RQT=309&VName=PQD. [25] Lee, H. (2008). Cyber crime and challenges for crime investigation in the information era. In Intelligence and Security Informatics, 2008. ISI 2008. IEEE International Conference on (pp. xxv-xxvi). doi: 10.1109/ISI.2008.4565011. [34] Niu, J. (2007). Circles of trust: A comparison of the size and composition of trust circles in Canada and in China. Retrieved from http://proquest.umi.com/pqdweb?did=1276413271&Fmt=7& clientId=3739&RQT=309&VName=PQD. [26] Lee, J., & Rao, H. R. (2007). Exploring the causes and effects of inter-agency information sharing systems adoption in the anti/counter-terrorism and disaster management domains. In Proceedings of the 8th annual international conference on Digital government research: bridging disciplines \& domains (pp. 155-163). Philadelphia, Pennsylvania: Digital Government Research Center. [35] Pardo, Cresswell, Thompson, & Zhang. (2006). Knowledge sharing in cross-boundary information system development in the public sector. Information Technology and Management, 7(4), 293-313. doi: 10.1007/s10799-006-02786. [27] Li Xiong, & Ling Liu. (2004). PeerTrust: supporting reputation-based trust for peer-to-peer electronic communities. Knowledge and Data Engineering, IEEE Transactions on, 16(7), 843-857. doi: 10.1109/TKDE.2004.1318566. [28] Li, X., Hess, T. J., & Valacich, J. S. (2008). Why do we trust new technology? A study of initial trust formation with organizational information systems. The Journal of Strategic Information Systems, 17(1), 39-71. doi: 10.1016/j.jsis.2008.01.001. [29] Lieberman, J. (2007). Confronting the Terrorist Threat to the Homeland: Six Years After 9/11. 342 DIRKSEN SENATE OFFICE BUILDING, WASHINGTON, D.C.: Federal News Service. Retrieved May 8, 2008, from http://www.fas.org/irp/congress/2007_hr/091007transcript.pd f. [30] Luna-Reyes, L. F., Andersen, D. F., Richardson, G. P., Pardo, T. A., & Cresswell, A. M. (2007). Emergence of the governance structure for information integration across governmental agencies: a system dynamics approach. In Proceedings of the 8th annual international conference on Digital government research: bridging disciplines \& domains (pp. 47-56). Philadelphia, Pennsylvania: Digital Government Society of North America. Retrieved October 16, 2008, from http://portal.acm.org/citation.cfm?id=1248460.1248468&col l=GUIDE&dl=GUIDE&type=series&idx=SERIES10714&p art=series&WantType=Proceedings&title=AICPS&CFID=6 749952&CFTOKEN=77254060#. [31] Marks, D. E., & Sun, I. Y. (2007). The impact of 9/11 on organizational development among state and local law enforcement agencies. Journal of Contemporary Criminal Justice, 23(2), 159-173. [36] Park, J., Chandramohan, P., Suresh, A., & Giordano, J. (2009). Component survivability for mission-critical distributed systems. Journal of Automatic and Trusted Computing (JoATC). In press. [37] Park, J., An, G., and Chandra D. (2007). Trusted P2P computing environments with role-based access control (RBAC). IET (The Institution of Engineering and Technology, formerly IEE) Information Security, 1(1):27-35. [38] Park, J., Kang, M., and Froscher, J. (2001). A secure workflow system for dynamic cooperation. In Michel Dupuy and Pierre Paradinas, editors, Trusted Information: The New Decade Challenge , pages167!182. Kluwer Academic Publishers, 2001. Proceedings of the 16th IFIP TC11 International Conference on Information Security (IFIP/SEC), Paris, France, June 11-13. [39] Park, J., Sandhu, R., and Ahn, G. (2001). Role-based access control on the Web. ACM Transactions on Information and System Security (TISSEC), 4(1):37!71. [40] Park, J., Suresh, A., An, G., and Giordano, J. (2006). A framework of multiple-aspect component-testing for trusted collaboration in mission-critical systems. In Proceedings of the IE E E Workshop on Trusted Collaboration (TrustCol), Atlanta, Georgia, November 17-20. IEEE Computer Society. [41] Ray, Indrajit and Chakraborty, Sudip (2004). A vector model of trust for developing trustworthy systems. In P. Samarati et al. editors, Computer Security- ESORICS, Proceedings of the 9th European Symposium on Research in Computer Security, September 13-15, 2004, Sophia Antipolis, France. LNCS 3193, Springer. [42] Razavi, M. N., & Iverson, L. (2006). A grounded theory of information sharing behavior in a personal learning space. In Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work (pp. 459-468). Banff, Alberta, Canada: ACM. doi: 10.1145/1180875.1180946. [32] Mckay, J. (2008). Statement of John McKay, Former United States Attorney For the Western District of Washington, Before the Subcommittee on Intelligence, Information Sharing And Terrorism Risk Assessment Committee on Homeland Security United States House of Representatives (Washington, D.C., 2008), Retrieved October 18, 2008, from http://webdev.maxwell.syr.edu/insct/Research/IS%20Page/M cKay%20Testimony.pdf. [43] Rocco, E. (1998). Trust breaks down in electronic contexts but can be repaired by some initial face-to-face contact. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 496-502). Los Angeles, California, United States: ACM Press/Addison-Wesley Publishing Co. [33] McKnight, D. H., Choudhury, V., & Kacmar, C. (2002). Developing and Validating Trust Measures for e-Commerce: An Integrative Typology. Info. Sys. Research, 13(3), 334359. [45] Schooley, B. L. (2007). Inter-organizational systems analysis to improve time-critical public services: The case of mobile emergency medical services. Retrieved from http://proquest.umi.com/pqdweb?did=1390309161&Fmt=7& clientId=3739&RQT=309&VName=PQD. [44] Ruppel, C., Underwood-Queen, L., & Harrington, S. J. (2003). e-Commerce: The Roles of Trust, Security, and Type of e-Commerce Involvement. e-Service Journal, 2(2), 25-45. [46] Schoorman, F. D., Mayer, R. C., & Davis, J. H. (2007). AN INTEGRATIVE MODEL OF ORGANIZATIONAL TRUST: PAST, PRESENT, AND FUTURE. Academy of Management Review, 32(2), 344-354. doi: Article. [47] Scott, E. D. (2006). Factors influencing user-level success in police information sharing: An examination of Florida's FINDER system. Retrieved from http://proquest.umi.com/pqdweb?did=1251886251&Fmt=7& clientId=3739&RQT=309&VName=PQD. [48] Serra da Cruz, S., Chirigati, F., Dahis, R., Campos, M., & Mattoso, M. (2008). Using Explicit Control Processes in Distributed Workflows to Gather Provenance. In Provenance and Annotation of Data and Processes (pp. 186-199). Retrieved February 22, 2009, from http://dx.doi.org/10.1007/978-3-540-89965-5_20. [49] Shaw, P. (1997). Intervening in the shadow systems of organizations Consulting from a complexity perspective. Journal of Organizational Change Management, 10(3), 235. [50] Stacey, R. (1996). Complexity and Creativity in Organizations (1st ed., p. 312). Berrett-Koehler Publishers. [51] Thomas, J. (1985). Force field analysis: A new way to evaluate your strategy. Long Range Planning, 18(6), 54-59. doi: 10.1016/0024-6301(85)90064-0. [52] United States. (2007b). Building the Information Sharing Environment: Addressing the Challenges of Implementation: Hearing Before the Subcommittee on Intelligence, Information Sharing, and Terrorism Risk Assessment of the Committee on Homeland Security, U.S. House of Representatives, One Hundred Ninth Congress, Second Session, May 10, 2006 (p. 27). Washington: U.S. G.P.O. [53] United States. (2007c). Federal Support for Homeland Security Information Sharing: Role of the Information Sharing Program Manager: Hearing Before the Subcommittee on Intelligence, Information Sharing, and Terrorism Risk Assessment of the Committee on Homeland Security, House of Representatives, One Hundred Ninth Congress, First Session, November 8, 2005 (p. 58). Washington: U.S. G.P.O. Retrieved from http://www.gpoaccess.gov/congress/index.html. [54] U.S. Dept. of Justice, Bureau of Justice Statistics (2006). Law Enforcement Management and Administrative Statistics (LEMAS): 2003 Sample Survey of Law Enforcement Agencies [Computer file]. ICPSR04411-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [producer and distributor], 2006. [55] Treglia, J. and Park J. (2009). Technical, Social & Legal Barriers to Effective Information Sharing Among Sensitive Organizations. In Proceedings of iConference , Chapel Hill, North Carolina, February 8-11. Poster. [56] Whitehouse. (2007). Report of the Security Clearance Oversight Group Consistent with Title III of the Intelligence Reform and Terrorism Prevention Act of 2004. Retrieved March 16, 2008, from http://www.whitehouse.gov/omb/pubpress/2007/sc_report_to _congress.pdf. Social Networks Integration and Privacy Preservation using Subgraph Generalization Christopher C. Yang Xuning Tang College of Information Science and Technology Drexel University 3141 Chestnut Street Philadelphia, PA 19104 College of Information Science and Technology Drexel University 3141 Chestnut Street Philadelphia, PA 19104 [email protected] A BST R A C T Intelligence and law enforcement force make use of terrorist and criminal social networks to support their investigations such as identifying suspects, terrorist or criminal subgroups, and their communication patterns. Social networks are valuable resources but it is not easy to obtain information to create a complete terrorist or criminal social network. Missing information in a terrorist or criminal social network always diminish the effectiveness of investigation. An individual agency only has a partial terrorist or criminal social network due to their limited information sources. Sharing and integration of social networks between different agencies increase the effectiveness of social network analysis. Unfortunately, information sharing is usually forbidden due to the concern of privacy preservation. In this paper, we introduce the KNN algorithm for subgraph generation and a mechanism to integrate the generalized information to conduct social network analysis. Generalized information such as lengths of the shortest paths, number of nodes on the boundary, and the total number of nodes is constructed for each generalized subgraphs. By utilizing the generalized information shared from other sources, an estimation of distance between nodes is developed to compute closeness centrality. Two experiments have been conducted with random graphs and the Global Salafi Jihad terrorist social network. The result shows that the proposed technique improves the accuracy of closeness centrality measures substantially while protecting the sensitive data. C ategories and Subject Descriptors H3.3 [Information Storage and Retrieval]: Information Search and Retrieval ! information filtering, selection process; H3.5 [Information Storage and Retrieval: Online Information Services ! web-based services; H5.4 [Information Interfaces and Presentations] Hypertext/Hypermedia - Navigation G eneral T erms Algorithms, Management, Performance, Experimentation. K eywords Keywords are your own designated keywords. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD-WS'09, June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-"#$5.00 1. I N T R O D U C T I O N Social network analysis techniques have been widely used in intelligence and security informatics to support intelligence and law enforcement force to identify suspects, gateways, and extracting communication patterns of terrorist or criminal organizations. In our previous work [20] [22], we have shown how social network analysis and visualization techniques are useful in knowledge discovery of terrorist social network. However, using partial data of a terrorist or criminal social network, the social network analysis techniques are not able to extract the essential knowledge. In some cases, an inaccurate result can be obtained. For example, each law enforcement unit has its own criminal social network. Mining on an incomplete criminal social network may not be able to identify the bridge between two criminal subgroups. Unfortunately, limited by the privacy policy, different organizations are only allowed to share the insensitive information but not their social networks. As a result, an accurate social network analysis cannot be conducted unless an integration of the social networks owned by different organizations can be made. Assuming organization P ( O P) and organization Q ( O Q ) own the social networks G P and G Q respectively as shown in Figure 1, without integrating G P and G Q , O P will never discover the connection between the subgroup of D , E and F and the subgroup of G and H. Similarly, O Q will never discover the connection between the subgroup of A, B and C and the subgroup of G and H . After integrating G P and G Q to obtain G , both O P and O Q will identify the connections. G p: G Q: A G C D B E H G: C B A D F E G C B A D H E G F H F F igure 1. Illustrations of social networ ks integration The research problem is defined as follow. Given two or more social networks ( G 1, G 2$%#&%'()*%+,''-(-./%)(01.,21/,).3%4 O 1, O 2, #&$% /5-% )67-8/,9-% ,3% sharing the necessary information between these social networks to achieve a more accurate social network analysis and mining result and preserving the sensitive information at the same time. Each organization O i has a piece of social network, which is part of the whole picture ! a social network G constructed by integrating all G i. Conducting the social network analysis task on G , one can obtain the exact result. However, conducting the social network analysis task on any G i, one can never achieve the exact SNAM result because of the missing information. By integrating G i and some generalized information of G j, O i should be able to achieve a more accurate social network analysis and mining result. In this paper, we propose the algorithms for social network data sharing and integration. The proposed information sharing and integration of social networks have three major components, (i) constructing generalized subgraph, (ii) creating generalized information for sharing, and (iii) social networks integration and analysis. 1.1 Related Wor k Thuraisingham [15][16] defined assured information sharing as enforcing security and integrity policies during information sharing between organizations so that the data is integrated and mined to extract nuggets. Members or partners in a coalition conduct data sharing in a dynamic environment where they can join and leave the coalition in accordance with the policies and procedures [15]. Baird et al. [2][3] first discussed several aspects of coalition data sharing in the Markle report. Thuraisingham [15] has further discussed these aspects including confidentiality, privacy, trust, integrity, dissemination and others. In this work, we focus on social network data sharing and integration. Sensitive information should be protected while insensitive information can be shared. Sensitive information can also be used to generate generalized information so that privacy can be protected. In the recent years, a number of approaches for preserving privacy of relational data have been developed. The application is mainly data publishing so that sensitive personal information can be protected while organizations are releasing their data such as medical, census, customer transactions, and voter registrations. These techniques include k-anonymity [11],[13], l-diversity [8], tcloseness [6], m-invariance [19], and !-presence [9]. A naïve approach of preserving privacy is removing attributes that uniquely identifying a person such as names and identification numbers. However, trivial linking attack by using the innocuous sets of attributes known as quasi-identifiers across multiple databases can easily identify a person. k-anonymity [11],[13] is the first attempt of privacy preservation of relational data by ensuring at least k records with respect to every set of quasiidentifier attributes are indistinguishable. However, k-anonymity fails when there is a lack of diversity or other background knowledge. l-diversity [8] ensures that there are at least l wellrepresented values of the attributes for every set of quasi-identifier attributes. The weakness is that one can still estimate the probability of a particular sensitive value. m-invariance [19] ensures that each set of quasi-identifier attributes has at least m tuples, each with a unique set of sensitive values. There is at most 1/m confidence in determining the sensitive values. Other enhanced techniques of k-anonymity and l-diversity with personalization, such as personalized anonymity [18] 1.+% 4:$k)anonymity [17], allows users to specify the degree of privacy ;()/-8/,).%)(%3;-8,'<%1%/5(-35)=+%:%).%/5-%(-=1/,9-%'(->?-.8<%)'%/5-% sensitive data. In this work, we focus on the privacy preservation of social network data rather than relational data. The technique in privacy preservation of relational data is not directly applicable in social network data because the data representations are not the same. There is relatively less research work on privacy preservation of social network data (or graphs). A naïve approach is removing the identities of all nodes but only revealing the edges of a social network. In this case, the global network properties are preserved for other research applications assuming that the identities of nodes are not of interest in the research applications. This assumption is true in the objective of publishing social network data for studying the global network structure but it is not necessarily true in the case of social network analysis which is discussed in this paper. However, Backstorm et al. [1] proved that it is possible to discover whether edges between specific targeted pairs of nodes exist or not by active or passive attacks. Based on the uniqueness of small random subgraphs embedded in a social network, one can infer the identities of nodes by solving a set of restricted isomorphism problems. In order to tackle active and passive attacks and preserve the privacy of node identities in a social network, there are several anonymization models proposed in the recent literature: kcandidate anonymity [6], k-degree anonymity [8], and kanonymity [24]. Such anonymization models are proposed to increase the difficulty of being attacked based on the notion of kanonymity in tabular data. k-candidate anonymity [6] defines that there are at least k candidates in a graph G that satisfies a given query Q . k-degree anonymity [8] defines that, for every node v in a graph G , there are at least k-1 other nodes in G that have the same degree as v. k-anonymity [24] has the strictest constraint. It defines that, for every node v in a graph G , there are at least k-1 other nodes in G such that their anonymized neighborhoods are isomorphic. The technique to achieve the above anonymities is edge or node perturbation [6],[8],[24]. By adding and/or deleting edges and/or nodes, a perturbed graph is generated to satisfy the anonymity requirement. Adversaries can only have a confident of 1/k to discover the identity of a node by neighborhood attacks. Since the current research on privacy preservation of social network data focuses on preserving the node identities in data publishing, the anonymized social network can only be used to study the global network properties but may not be applicable to other social network analysis tasks which is the focus of this work. In addition, the sets of nodes and edges in a perturbed social network are different from the set of nodes and edges in the original social network. As reported by Zhou and Pei [24], the number of edges added can be as high as 6% of the original number of edges in a social network. A recent study [23] has investigated how edge and node perturbation can change certain network properties. Such distortion may cause significant errors in certain SNAM tasks such as centrality measurement although the global properties can be maintained. 2. A F ramewor k for Integration Social Networ ks with Privacy Preservation Assuming organization P ( O P) and organization Q ( O Q ) have social networks G P and G Q respectively, O P needs to conduct a social network analysis and mining (SNAM) task but G P is only a partial social network for the SNAM task. If there is not any privacy concern, one can integrate G P and G Q to generate an integrated G and obtain a better SNAM result. Due to privacy concern, O Q cannot release G Q to O P but only shares some data to O P according to the agreed privacy policy. At the same time, O P does not need all data from O Q but only those that are critical for the SNAM task. The objectives are maximizing the information sharing that is needed for the SNAM task but conforming to the privacy policy so that sensitive information can be preserved and more accurate SNAM result can be obtained. The information we share can be sensitive or insensitive and useful or not useful for a SNAM task to be conducted by the information requesting party. When we integrate social networks, we need to maximize the information sharing that is insensitive and useful for the SNAM tasks. The shared information should not include any sensitive information; however, it must be useful for improving the performance of the SNAM task conducted by the information requesting party. Organization P ( O P) Information Needs N P Organization Q ( O Q ) Trust Degree of P Social Network G Q Privacy Policy Social Network G Q Social Network G P Social Networks Integration Sharing Data ! Generalized Social Network of G Q ( G Q ") Social Network Analysis & Mining Task F igure 2. F ramewor k of Social Networ ks Integration We develop the privacy policy by considering the information needs based on the SNAM task, the trust degree of the information requesting party, and the information available in its own social network. The privacy policy determines what data can be shared. Thuraisingham [15],[16] discussed a coalition of dynamic relational data sharing, in which security and integrity policies are enforced. When we perform social network data sharing, we need to consider what kinds of nodes and edges are needed to accomplish a particular SNAM task by analyzing the network structure. The trust degree of the information requesting party will determine certain sensitive data to be protected and certain insensitive data to be shared. At the same time, the information available for sharing may not be able to obtain the exact SNAM result but our objective is maximizing the accuracy so that a better result can be obtained in comparing with the SNAM result obtained from the original network without sharing. Using subgraph generalization, a generalized social network, G Q ", will be created from G Q and conforming to the privacy policy. The generalized social graph only contains generalized information of G Q without releasing any sensitive information. For instances, the generalized information can be the maximum or minimum length of the shortest paths between two subgroups, the degree of an insensitive node, the radius of a subgroup, etc. The generalized social network G Q " will then be integrated with G P to support a social network analysis and mining task. Given the generalized information from G Q , it is expected to achieve better social network analysis and mining than conducting the analysis and mining on G P alone. 2.1 Subgraph G eneralization Given the insensitive data in G Q , we propose a subgraph generalization approach to create a generalized social network G Q " for sharing with O P. A subgraph generalization creates a generalized version of a social network, in which a connected subgraph is transformed as a generalized node and only generalized information will be presented in the generalized node. The edge that links from other nodes in the network to any nodes of the subgraph will be connected to the generalized node. The generalized social network protects all sensitive information while releasing the crucial and non-sensitive information to the information requesting party for social network integration and the intended SNAM task. A mechanism is needed to (i) identify the subgraphs for generalization, (ii) determine the connectivity between the set of generalized nodes in the generalized social network, and (iii) construct the generalized information to be shared. A subgraph of G = (V, E ) is denoted as G i = (V i, E i) where V i V, E i E , E i V i ! V i. G i is a connected subgraph if there is a path for each pair of nodes in G i. We only consider connected subgraph when we conduct subgraph generalization. The generated subgraphs should be mutually exclusive and exhaustive. A node v can only be part of a subgraph but not any other subgraphs. The union of nodes from all subgraphs should be equal to V, the original set of nodes in G . To construct a subgraph for generalization, we propose the Knearest neighbor ( KNN) method. Given a social network, G = (V, E ) with n nodes, |V| = n. K of these nodes are insensitive nodes. We divide G into K (or less) subgraphs G i = (V i, E i) where each subgraph has at least one insensitive node. Each subgraph will also be known as a generalized node in the generalized graph #". V = Ui =1to K V i. vi C corresponds to the center of a sub-graph G i and vi C must also be an insensitive node in G i. Let SP D (v, vi C ) be the distance of the shortest path between v and vi C . When v is assigned to the subgraph G i in subgraph generation, SP D (v, vi C ) must be shorter than or equal to SP D (v, vj C ) where j = 1, 2, .., K and j $%&. An edge exists between two generalized nodes G i and G j in the generalized graph #"%if and only if there is an edge between any two nodes in G such that one from each generalized node, G i and G j. #" has a set of generalized nodes, G 1, G 2$%#$% G K , and a set of edges, '". For simplicity, we use the graphs in Figure 3 to illustrate the subgraph generation by KNN method. G has eight nodes including v1 and v2. If we take v1 and v2 as the insensitive nodes and we are going to create 2 subgraphs by 2NN method, all other nodes will be assigned to one of the two subgraphs depending on their shortest distances with v1 and v2. Two subgraphs G 1 and G 2 are generated as illustrated in Figure 3. This illustration is made only for simplicity. A real social network will have significantly more number of nodes and edges. G V1 V2 V1 V2 F igure 4. Illustrations of a generalized graph G1 V1 G2 2.2 G eneralized Subgraph Information V2 F igure 3. Illustrations of generating subgraphs The KNN subgraph generation algorithm is presented below: !"#$%&'( )%()(*(+,!"-(,#"(!"#(,./(0'( 1234#()($"%( ((((567(#892($%& &&'(( ((((((567(#892(((%(&("6()( (( :5;*+,;$%-&&$("&<(%%(!"#$<'((( (((((((((((((( ((((((((((('((%('((=($%'( ((((((((((((((((((((((((((((('%('(&($%'( (((((((>?@(567'( ((((>?@(567'( !"#$=='( >?@(1234#( 567(#892(;$(-$%<( (.(( ((((:5;(*/012345;$(<(%%(*/012345;$%<(<( ((((AA(*/012345;$(<(3B("2#(BCDE78$2(BC92("28"($(( &*/012345;$(<( (((((((67&%(*/012345;$(<( (((((((.7(%(.7(=(;$(-$%<(( ((((>F!>( /7#8"#(8?(#@E#(D#"G##?(*/012345;$(<(8?@(*/012345;$%<((8?@( '((")*"*+",-(((((( >?@(567( The KNN subgraph generation algorithm creates K subgraphs G 1, G 2$%#$% G K from G . Each subgraph, G i, has a set of nodes, V i, and a set of edges, E i. Edges between subgraphs, '", are also created. A generalized graph, #", is constructed where each generalized node corresponds to a subgraph G i and labeled by the insensitive node, vi C . Using the example in Figure 4, the generalized graph is presented in Figure 4. When an organization is sharing its social network with another organization, the generalized graph is shared but not all information within each generalized node is shared but only generalized subgraph information is shared so that sensitive information is preserved. At the same time, generalized information will support integration and social network analysis. In the next section, we describe the generalized subgraph information for sharing. Given a generalized node G i, which is generated by KNN subgraph generation, and its center (insensitive node) vi C , we construct generalized subgraph information for the generalized node which should not reveals sensitive information including sensitive identities and sensitive relationships. The generalized information, however, should provide useful information for social network analysis after integration. Since the social network analysis we are investigating is closeness centrality, the length of the shortest paths in the generalized node is useful for estimating distances between nodes within the same generalized node or across different generalized nodes. In this paper, we propose to create generalized subgraph information such as the longest and shortest length of the shortest paths in a subgraph as well as the number of nodes in a subgraph and the number of nodes adjacent to another subgraph. Let vp and vq be any two nodes in G i and the length of the shortest path between vp and vq be SP D (vp,vq,G i). By considering all the shortest paths between any two nodes in G i, SP D (vp,vq,G i) vp,vq V i, we compute the longest length of the shortest paths between any two nodes in G i, denoted as L_SP D (G i), and the shortest length of the shortest paths between any two nodes in G i, denoted as S_SP D (G i). L_SP D (G i) = {SP D (vm,vn,G i)| SP D (vp,vq,G i)} vp,vq V i, SP D (vm,vn,G i) ( and S_SP D (G i) = { SP D (vm,vn,G i) | SP D (vp,vq,G i)} vp,vq V i , SP D (vm,vn,G i) ) The length of any shortest paths in G i ( ) must be smaller or equal to L_SP D (G i) and larger or equal to S_SP D (G i). S_SP D (G i*%)%+%)%L_SP D (G i) We can also compute the probability of the length of the shortest path between any two nodes in G i, denoted as Prob(SP D (G i) = +*, and ,%)%-./012- D (G i*3+*%)%4% Similary, let the length of the shortest path between vp and the center of G i, vi C , be SP D (vp,vi C ,G i). By considering all the shortest paths between any node in G i and vi C , we compute the longest length and the shortest length of the shortest paths between vp and vi C , denoted as L_SP D (vi C ,G i) and L_SP D (vi C ,G i). L_SP D (vi C ,G i) = { SP D (vm,vi C ,G i)| vp SP D (vp,vi C ,G i)} V i , , SP D (vm,vi C ,G i) ( and S_SP D (vi C ,G i) = { SP D (vm,vi C ,G i)| vp SP D (vp,vi C ,G i)} V i , SP D(vm,vi C ,G i) ) The probability of the length of the shortest path between any node and vi C , denoted as Prob(SP D (vi C ,G i) = ), can also be computed. S_SP D (vi C , G i) ) )% L_SP D (vi C , G i) and 0 )% Prob(SP D (vi C , G i*3%+* )%1, further away. Let the closest insensitive node to vi in G P be , and the second closest insensitive node to vi in G P be . We set the weights 5A and 56" as and Gi Gj F igure 5 E xample of Num(G i) and Num(G i ,G j) We denote Num(G i) as the number of nodes in G i and Num(G i ,G j) as the number of nodes in G i that are adjacent to another subgraph G j. For example, Figure 5 illustrates two subgraphs G i and G j and Num(G i) is 7. Num(G i ,G j) is 3. In summary, the generalized subgraph information for sharing includes: L_SP D (G i) S_SP D (G i) Prob(SP D ( G i*3+* L_SP D (vi C ,G i) S_SP D (vi C ,G i) Prob(SP D (vi C ,G i) = ) Num(G i) Num(G i ,G j) This generalized information does not publish any identities of sensitive nodes but only the insensitive nodes, vi C . It also provides the number of nodes in a subgraph and the number of nodes that are adjacent to other subgraphs. It does not publish any information about edges between any two sensitive nodes or between any sensitive and insensitive nodes. It only provides the information about the length of the shortest paths based on the edges in a subgraph. 2.3 G eneralized G raph Integration and Social Networ k A nalysis Given the generalized graph G Q" and the generalized subgraph information of G Q " from O Q , O P wants to integrate G Q " with its own graph G P to conduct more accurate closeness centrality. In order to achieve such purpose, we need to make estimation of the distance between any two nodes vi and vj in G P by integrating the generalized subgraph information of G Q ". To estimate the distance between two nodes vi and vj in G P, we identify the two closest insensitive nodes for vi and vj in G P and use the generalized information from G Q " to estimate the distances between the insensitive nodes. The shortest path between vi and vj may go through the subgraph of their closest insensitive node, the subgraph of the second closest insensitive, or the subgraph of other less closer insensitive node. However, the chance of going through the further insensitive node is lower, and therefore, we set a higher weight on the insensitive node that is closer. These insensitive nodes are also the centers of subgraphs in the generalized graph G Q " shared by O Q . In this paper, we only consider the two closest insensitive nodes but the formulation can be easily modified to consider other insensitive nodes that are such that node is higher. and the weight for the closest insensitive Similarly, let the closest insensitive node to vj in G P be , and the second closest insensitive node to vj in G P be , we set the weights 5B and 57" as as such that and . In G Q , , and are the centers of generalized subgraphs G A, , G B, and , respectively. We estimate the distance between vi and vj, d(vi,vj), by integrating the estimated distances of the four possible paths going through these insensitive nodes by a linear combination with weights equal to . is the estimated distance between vi and vj for the path going through and , where a can be A or A! and b can be B or B!. where G k is a generalized node on the shortest path between and and going through and in a generalized graph is estimated by , , and if a b and is estimated by if a = b. D (G a ,vi) and D (G b,vj) correspond to the portion of the estimated distance within G a and G b respectively while is the portion of the estimated distance in each subgraph G k that the shortest path between vi and vj is going through in the generalized graph G Q !. is computed by and the percentage of nodes in G a that is adjacent to the subgraph that is immediately following G a in the shortest path between vi and vj in the generalized subgraph G Q ! if vi is not the same as . If vi is the same as , is computed by the probabilities, . is computed by the probabilities, . Computation of is done similarly. generalized graph from G Q and conduct social network analysis by integrating G P and the generalized graph of G Q. 3.1.2 Evaluation where is the percentage of nodes in G a as a gatekeeper which is adjacent to G k and G k is the subgraph that is subgraph immediately following G a in the shortest path between vi and vj in the generalized graph G Q !H D (vi ,vj) corresponds to the estimated distance between vi and vj when both vi and vj are nodes of the same subgraph. By using the estimated distance between any two nodes vi and vj, with the shared information from G Q !, we compute the closeness centrality as follow: closeness centrality(vi ) n 1 n d (vi , v j ) j 1,i j where n is the total number of nodes in G P 3. E xperiment We have conducted two experiments to evaluate the effectiveness of our proposed techniques. The objective is conducting social network analysis based on the incomplete information of one organization ( G p) and the shared information from another organization ( G Q ). 3.1 E xperiment 1 ! Random G raphs In this experiment, we assume G Q has the complete information and evaluate how integrating the generalized information from G Q with G P can help us to conduct social network analysis as close as if we have the complete information. 3.1.1 Datasets We generated a series of random graphs as our datasets in this experiment. As discussed in [19, 21], terrorist and criminal social network usually consist of some clusters, which represent potential gangs. In other words, terrorist or criminal social network is not a graph with nodes randomly connected to one another with the same probability. To simulate the real-world data, we constructed random graphs by generating a random number of clusters and connecting these clusters randomly. Each cluster is generated by creating an edge randomly between any two nodes with a probability of p. To control the size of clusters, we set the upper limit for each cluster size as 40% of the random graph size. The generated random graphs are considered as a social network with complete information ( G Q ). G P will then be generated by randomly removing edges in G Q since G P only has partial information of the social network. In the experiment, we create In this experiment, the closeness centrality is utilized as the social network analysis. The closeness centrality computed from G Q is taken as the benchmark since it has the complete information. The closeness centrality computed from G P is considered as the worst case since it only has the partial information. The closeness centrality computed by our proposed subgraph generation and social network integration techniques will be compared with the benchmark. It will also be compared with the worst case to determine how much it can reduce the errors. The error is measured by the difference of the closeness centrality obtained from our proposed technique or G P divided by the closeness centrality obtained from G Q . In the experiments, we tested the impact of two random graph generation parameters on the performance of the proposed techniques. The parameters are Size (size of G Q in terms of number of nodes) and Similarity (similarity between G Q and G P). The cluster size (the upper limit of cluster size) is set as 40% of Size. The average closeness centrality and average error reported in the experiment is computed from ten sets of random graphs. 3.1.3 Experimental Results Table 1 and Figure 6 present the average closeness centrality computed from the estimation based on the proposed subgraph generalization and social network integration techniques ( E ), G Q (benchmark), and G P (worst case) for different Size. Table 2 and Figure 7 present the average error of E and G P for different Size. Similarity is set as 50% when we are evaluating the impact of Size. T able 1 A verage C loseness C entrality of E , G Q and G P with different Size Size 200 100 50 20 10 E 0.365 0.353 0.276 0.31 0.349 GQ 0.402 0.39 0.32 0.315 0.379 GP 0.315 0.286 0.207 0.248 0.33 *Similarity =50%, IHMJ IHM IHLJ IHL IHKJ IHK IH&J !3N#%KII !3N#%&II > !3N#%JI OP !3N#%KI O$ !3N#%&I F igure 6 A verage C loseness C entrality of E , G Q and G P with different Size It is found that the error increases when Size reduces from 200 to 50 and then decreases when Size reduces from 50 to 10 for both E and G P. However, the error in E is consistently and substantially lower than G P. The largest difference in error, which is 0.213, is observed when Size is 50. When Size is only 10, the difference in error is 0.061. T able 2 A verage E r ror in C loseness C entrality of E and G P with different Size Size 200 100 50 20 10 E 0.086 0.096 0.135 0.08 0.07 GP 0.216 0.267 0.356 0.209 0.131 T able 3 A verage C loseness C entrality of E , G Q and G P with different Similarity Similarity 40% 50% 60% 70% 80% 90% E 0.29 0.276 0.278 0.287 0.287 0.292 GQ 0.343 0.32 0.327 0.334 0.339 0.33 GP 0.203 0.207 0.245 0.275 0.296 0.318 * Size =50 IHM IHLJ *Similarity =50% IHL IHM IHLJ IHL IHKJ IHK IH&J IH& IHIJ I IHKJ IHK IH&J MIQ JIQ RIQ > SIQ TIQ OP UIQ O$ F igure 8 A verage C loseness C entrality of E , G Q and G P with different Similarity KII &II JI > KI O$ &I F igure 7 A verage E r ror in C loseness C entrality of E and G P with different Size Table 3 and Figure 8 present the average closeness centrality of E , G Q , and G P for different Similarity. Table 4 and Figure 9 present the average error of E and G P for different Similarity. Size is set as 50 when we evaluate the impact of Similarity. It is found that the error decreases consistently from 0.426 to 0.044 for G P when Similarity increases from 40% to 90%. Since G P becomes more similar to G Q when Similarity increases, the error should reduce to 0 when Similarity is 100%. However, the error stays at around 0.15 for E when Similarity increases from 40% to 90%. The error does not vary considerably when Similarity changes. That means the proposed technique of integrating generalized information produce a reasonable performance with an error of 0.163 even when the similarity is as low as 40%. However, the performance does not improve much when Similarity increases. When Similarity increases, the proposed technique is still using the generalized information rather than the actual data, the performance is indeed worse than using G P to measure the closeness centrality. However, it only happens when Similarity is over 80% which is very unlikely in reality. T able 4 A verage E r ror in C loseness C entrality of E and G P with different Similarity Similarity 40% 50% 60% 70% 80% 90% E 0.163 0.135 0.148 0.144 0.152 0.159 GP 0.426 0.356 0.249 0.183 0.129 0.044 * Size=50 IHMJ IHM IHLJ IHL IHKJ IHK IH&J IH& IHIJ I MIQ JIQ RIQ > SIQ TIQ O$ UIQ F igure 9 A verage E r ror in C loseness C entrality of E and G P with different Similarity 3.2 E xperiment 2 ! G lobal Salafi Jihad T er rorist Social Networ k 3.2.1 Dataset In this experiment, we used the Global Salafi Jihad terrorist social network [11],[22] as our data source to create incomplete social networks G P and G Q . The Global Salafi Jihad terrorist social network has 366 nodes and 1,275 links. There are four major clusters, Central Staff of al Qaeda (CSQ), Core Arab (CA), Southeast Asia (SA), and Maghreb Arab (MA). These clusters are connected in the Global Salafi Jihad terrorist social network. To simulate the real-world situation that every agency is more informative in one or two terrorist clusters but less familiar with other terrorist clusters, we randomly removed nodes from each cluster with different percentage and then randomly removed edges from the remaining subgraph. First, we generated G P by randomly removing 30%, 30%, 70%, and 70% of nodes from CSQ, CA, SA, and MA respectively. Similarly, we generated G Q by randomly removing 70%, 70%, 30%, and 30% of nodes from CSQ, CA, SA, and MA respectively. An edge with both of its end nodes removed was also be removed. We further removed K % of edges from G P and G Q . Ten pairs of G P and G Q were generated for each K . 3.2.2 Evaluation In this experiment, we tested the impact of K on the performance of the proposed techniques. We compared the average closeness centrality and average error obtained from the proposed techniques ( E ), with those obtained from the complete Global Salafi Jihad terrorist social network ( G ), and those obtained only from the G P (the worst case). 3.2.3 Experimental Result Table 5 and Figure 10 present the average closeness centrality of E , G , and G P for different K . Table 6 and Figure 11 present the average error in closeness centrality of E and G P for different K . It is found that the error for E decreases from 0.302 to 0.206 when K decreases from 50 to 30. The error for G P decreases from 0.378 to 0.274 when K decreases from 50 to 30. The average error of E is consistently lower than that of G P by a substantial amount. When K = 50, 40, and 30, the error of E is reduced from the error of G P by 20%, 23%, and 25% respectively. It shows that more accurate closeness centrality can be obtained by integrating the generalized information from G Q with G P. T able 5 A verage C loseness C entrality of E , G and G P with different K K 50 40 30 E 0.168 0.183 0.191 G 0.241 0.241 0.241 GP 0.149 0.166 0.175 IHKR IHKM IHKK IHK IH&T IH&R IH&M IH&K IH& .%JI .%MI > .%LI O O$ F igure 10 A verage C loseness C entrality of E , G and G P with different Similarity T able 6 A verage E r ror in C loseness centrality of E and G P with different K K 50 40 30 E 0.302 0.240 0.206 GP 0.378 0.312 0.274 IHM IHLJ IHL IHKJ IHK IH&J .%JI > .%MI O$ .%LI F igure 11 A verage E r ror in C loseness C entrality of E and G P with different Similarity 4. Conclusion Social network analysis is very useful in investigating the terrorist and criminal communication patterns and the structure of their organizations. However, most law enforcement force and intelligence agents only have a small piece of information without integration with the information from other agents. Due to privacy issues, information sharing is not always possible. In this paper, we propose to construct generalized graphs before sharing the social network with other parties. The generalized graph will then be integrated with the social network owned by the agent himself to conduct social network analysis such as closeness centrality. Our experiment shows that our proposed technique improves the closeness centrality measurements substantially. In our future work, we shall investigate other subgraph generation techniques and integration techniques and conduct experiments with terrorist social networks. 5. R E F E R E N C ES [1] L. Backstrom, C. Dwork, and J. Kleinberg, "Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography," in WWW'07 Banff, Alberta, Canada, 2007. [2] Z. Baird, J. Barksdale, and M. Vatis, Creating a Trusted Network for Homeland Security, Markle Foundation, 2003. [3] Z. Baird and J. Barksdale, Mobilizing Information to Prevent Terrorism: Accelerating Development of a Trusted Information Sharing Environment, Markle Foundation, 2006. [4] K. Caruson, S. A. Macmanus, M. Khoen, and T. A. Watson, @A)*-=1.+%B-8?(,/<%C(-;1(-+.-33D%E5-%F-6,(/5%)'% F-0,).1=,3*$G% Publius, 35(1), 2005, pp.143-189. [5] R. FH%I(,-+*1..%1.+%JH%KH%L1..).$%@A)*-=1.+%B-8?(,/<%1.+% Community Policing: Competing or Complementing Public B1'-/<%C)=,8,-3$G%Journal of Homeland Security and E mergency Management, 4(4), 2005, pp.1-20. [6] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava, "Anonymizing Social Networks," Technical Report 07-19, University of Massachusetts, Amherst 2007. [7] N. Li and T. Li, "t-closeness: Privacy Beyond k-anonymity and l-diversity," in IC D E '07, 2007. [8] K. Liu and E. Terzi, "Towards Identity Anonymization on Graphs," in ACM SIGMO D '08 Vancouver, BC, Canada: ACM Press, 2008. [9] Machanavajjhala, J. Gehrke, D. Kifer, and M. M-.N,/13?6(1*1.,1*$%@=-Diversity: Privacy Beyond kO.).*,/<$G%C()8--+,.03%)'%/5-%PP.+%Q./-(.1/,).1=% Conference on Data Engineering, 2006. [10] M. E. Nergiz, M. Atzori, and C. Clifton, "Hiding the Presence of Individuals from Shared Database," in SIGMO D '07, 2007. [11] M. Sageman, Understanding Terror Networks, University of Pennsylvania Press, 2004. [12] CH%B1*1(1/,$%@C()/-8/,.0%F-3;).+-./3R%Q+-./,/,-3%,.% Microdata R-=-13-$G%QSSS%E(1.318/,).3%).%T.)U=-+0-%1.+% Data Engineering, 2001. [13] VH%BU--.-<$%@N-Anonymity: A Model for Protecting C(,918<$G%Q./-(.1/,).1=%K)?(.1=%).%W.8-(/1,./<%I?22,.-33% Knowledge-based Systems, 10(5), 2002, pp.557-570. [14] XH%E5185-($%@E5-%V)81=%F)=-%,.%A)*-=1.+%B-8?(,/<$G% Law & Society, 39(3), 2005, pp.557-570. [15] YH%E5?(1,3,.051*$%@B-8?(,/<%Q33?-3%')(%I-+-(1/-+%X1/1613-3% B<3/-*3$G%L)*;?/-(3%1.+%B-8?(,/<$%Z)(/5%A)==1.+$% December, 1994. [16] B. Thuraisingham, Chapter 1. Assured Information Sharing: Technologies: Challenges and Directions, Intelligence and Security Informatics: Applications and Technique, Editors: H. Chen and C. C. Yang, Springer-Verlag, to appear in 2008. [17] FH%LH%J).0$%KH%V,$%OH%I?$%1.+%TH%J1.0$%@4:$N&-Anonymity: An Enhanced k-Anonymity Model for Privacy-Preserving X1/1%C?6=,35,.0$G%C()8--+,.03%)'%BQ[TXX$%O?0?3/%P\-23, Philadelphia, Pennsylvania, US, 2006. [18] ]H%],1)%1.+%^H%E1)$%@C-(3).1=,2-+%C(,918<%C(-3-(91/,).$G% Proceedings of SIGMOD, June 27-29, Chicago, Illinois, 2006. [19] X. Xiao and Y. Tao, "m-invariance: Towards Privacy Preserving Re-publication of Dynamic Datasets," in ACM SIGMO D '07: ACM Press, 2007. [20] LH%LH%^1.0$%ZH%V,?$%1.+%_H%B10-*1.$%@O.1=<2,.0%/5-% E-(()(,3/%B)8,1=%Z-/U)(N3%U,/5%M,3?1=,21/,).%E))=3$G% Proceedings of the IEEE International Conference on Intelligence and Security Informatics, San Diego, CA, US, May 23 ! 24, 2006. [21] LH%LH%^1.0$%@Q.')(*1/,).%B51(,.0%1.+%C(,918<%C()/-8/,).%)'% E-(()(,3/%)(%L(,*,.1=%B)8,1=%Z-/U)(N3$G% Proceeding of IE E E International Conference on Intelligence and Security Informatics, Taipei, Taiwan, 2008. [22] LH%LH%^1.0%1.+%_H%B10-*1.$%@O.1=<3,3%)'%E-(()(,3/%B)8,1=% Z-/U)(N3%U,/5%I(18/1=%M,-U3$G%Journal of Information Science, accepted for publication. [23] X. Ying and X. Wu, "Randomizing Social Networks: A Spectrum Preserving Approach," in SIAM International Conference on Data Mining (S DM'08) Atlanta, GA, 2008 [24] B. Zhou and J. Pei, "Preserving Privacy in Social Networks against Neighborhood Attacks," in IE E E International Conference on Data Engineering, 2008. DESIGN OF A TEMPORAL GEOSOCIAL SEMANTIC WEB FOR MILITARY STABILIZATION AND RECONSTRUCTION OPERATIONS BHAVANI THURAISINGHAM bhavani.thuraisingha [email protected] LATIFUR KHAN [email protected] ABSTRACT The United States and its Allied Forces have had tremendous success in combat operations. This includes combat in Germany, Japan and more recently in Iraq and Afghanistan. However not all of our stabilization and reconstruction operations (SARO) have been as successful. Recently several studies have been carried out on SARO by National Defense University as well as for the Army Science and Technology. One of the major conclusions is that we need to plan for SARO while we are planning for combat. That is, we cannot start planning for SARO after the enemy regime has fallen. In addition, the studies have shown that security, power and jobs are key ingredients for success during SARO. It is important to give positions to some of the power players from the fallen regime provided they are trustworthy. It is critical that investments are made to stimulate the local economies. The studies have also analyzed the various technologies that are needed for successfully carrying out SARO which includes sensors, robotics and information management. In this project we will focus on the information management component for SARO. As stated in the work by the Naval Postgraduate School, we need to determine the social, political and economic relationships between the local communities as well as determine who the important people are. This work has also identified the 5Ws (Who, When, What, Where and Why) and the (H). To address the key technical challenges for SARO, we are defining a Life cycle for SARO and subsequently developing a Temporal Geosocial Service Oriented Architecture System (TGSSOA) that utilizes Temporal Geosocial Semantic Web (TGS-SW) technologies for managing this lifecycle. We are developing techniques for representing temporal geosocial information and relationships, integrating such information and relationships, querying such information and relationships and finally reasoning about such information and relationships so that the commander can answer questions related to the 5Ws and H. To our knowledge we believe that this is the first attempt to develop Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD-WS’09, June 28, 2009, Paris, France. Copyright 2009 ACM 1-59593-439-1…$5.00. MURAT KANTARCIOGLU [email protected] VAIBHAV KHADILKAR vvk072000@utdallas. edu TGS-SW technologies as well as lifecycle management for SARO. Categories and Subject Descriptors H.2.0 [Information Systems]: General – Security, integrity, and protection. General Terms Security. Keywords SARO, SAROL 1. INTRODUCTION According to a GAO Report published in October 2007 [13], “DOD has taken several positive steps to improve its ability to conduct stability operations but faces challenges in developing capabilities and measures of effectiveness, integrating the contributions of non-DOD agencies into military contingency plans, and incorporating lessons learned from past operations into future plans. These challenges, if not addressed, may hinder DOD’s ability to fully coordinate and integrate stabilization and reconstruction activities with other agencies or to develop the full range of capabilities those operations may require.” Around the same time, the Center for Technology and National Security Policy at NDU [4] and the Naval Postgraduate School [16] identified some key technologies crucial for the military stabilization and reconstruction processes in Iraq and Afghanistan. These technologies include those in electronics, sensors, and medical as well as in information management. As illustrated in Figure1-1 (duplicated from [16]), NPS has identified three types of reconstruction efforts; one they classify as the easy activities including building bridges and schools. The second they identity has sensitive is to develop policies for governance. The third they identify as hard is to understand the cultures of the people and the warlords engage in negotiation as well as get their buy in as well as to sustain long-term security. In addition, the military needs to get information about the location of the fiefdoms of the warlords, the security breaches e.g., IEDs) at the various locations as well as associations between the different groups. We are addressing the difficult challenges of military stabilization (as stated in Figure 1-1) by developing innovative Department of Computer Science, Erik Jonnson School of Engineering & Computer Science, The University of Texas at Dallas, 800 W. Campbell Road; MS EC31, Richardson, TX 75080 U.S.A. information management technologies. In particular, we are EASY Build schools SENSITIVE Counter narcotics Support Elections DIFFICULT Build wells Build clinics Support DDR Influence warlords Mitigate conflict Distribute medical supplies Build agriculture systems Support ANA and ANP TE DA P U Improve Human Rights Implement Job programs Foster sustainable economy Promote stable democracy Building Self -Sufficiency ‘ Enduring Security’ Improve Governance ‘Rule of Law’ ‘Reconstruction and Development ’ Figure 1-1. Stabilization and Reconstruction Operations Duplicated from (Guttieri, 2007a) designing and develop a temporal geosocial semantic web that will integrate heterogeneous information sources, identify the relationships between the different groups of people that evolve over time and location and facilitate sharing of the different types of information to support SARO. Our goal is to get the right information to the decision maker so that he/she can make decisions in the midst of uncertain and unanticipated situations. The organization of this paper is as follows. Some of the unique challenges will be discussed in section 2. Supporting technologies we are utilizing are discussed in section 3. Our approach to designing and developing the system for SARO is discussed in section 4. The paper is concluded in section 5. 2. UNIQUE CHALLENGES SUCCESSFUL SARO FOR A The technical challenges, our capabilities related to the challenges and our strategy to solve the problem. 2.1 Ingredients for a Successful SARO Recently several studies on SARO are emerging. Notable among them is the study carried out by NDU [4]. In this study the authors give examples of several successful and failed SAROs. The authors state that the Iraq war was hugely successful in that US and Allied forces were able to defeat the enemy in record time. However the US and the Allied forces were not prepared for SARO and subsequently National Building. For SARO to be successful its planning should be carried out concurrently with the planning of the war. This means as soon as the war ends, plans are already in place to carry out SARO. Sometimes the latter part of the war may be carried out in conjunction with SARO. The authors have discussed the various SAROs that US has engaged in including in Germany, Japan, Somalia and the Balkans and describe the successes of SAROs like Germany and Japan and the failure of SARO in Somalia. The authors also discuss why Field Marshall Montgomery’s mission for regime change in Egypt in 1956 failed. This was because the then Prime Minister Anthony Eden did not plan for what happens after the regime change. As a result the operation was a failure. However overthrowing communism in Malaya in the 1950s was a huge success by Field Marshall Sir Gerald Templer because he got the buy in of the locals – the Malayans, Chinese and Indians. He also gave them the impression that Britain was not intending to stay long in Malaya. As a result the locals gave him the support and together they were able to overthrow communism. Based on the above examples, the authors state that four concurrent tasks have to carry out in parallel. They are the following: (i) Security: Ensure that those who attempt to destroy the emergency of a new society are suppressed. This will include identifying that are the trouble makers or terrorists and destroy their capabilities. (ii) Law and order: Military and police skills are combined to ensure that there are no malicious efforts to disturb peace. (iii) Repair infrastructure: Utilize the expertise of engineers and geographers both from allied countries and local people and build the infrastructure. (iv) Establish an interim government effectively: Understand the cultures of the local people, their religious beliefs and their political connections and establish a government. The authors also state that there are key elements to success and they are (i) security (ii) power and (iii) jobs. We have already explained the importance of security. Power has to be given to key people. For example, those who were in powerful positions with the fallen regime may make alliances with the terrorists. Therefore such people have to be carefully studied and given power if appropriate. Usually after a regime change people are left homeless and without jobs. Therefore it is important to give incentives for foreign nations to invest in the local country and create jobs. This means forming partnerships with the locals as well as with foreign investors. To end this section we will make a quote from General Anthony Zinny, USMC (Ret). former Commander, US Central Command, “What I need to understand is how these societies function. What makes them tick? Who makes the decisions? What is it about their society that is so remarkably different in their values and the way they think in my western white man mentality” Essentially he states what is crucial is getting cultural intelligence. In fact our project will propose solutions to precisely capture the cultural and political relationships and information about the locals, model these relationships and exploit these relationships for SARO. 2.2 Unique Technology Challenges for SARO A technology analysis study carried out by Army Science and Technology states that many of the technologies for SARO are the same as those for combat operations [5]. However the study also states that there are several unique challenges for SARO and elaborates as follows: “The nature of S&R operations also demands a wider focus for ISR. In addition to a continuing need for enemy information (e.g., early detection of an insurgency), S&R operations require a broad range of essential information that is not emphasized in combat operations. A broader, more specific set of information needs must be specified in the Commander’s Critical Information Requirements (CCIR) portfolio. CCIR about areas to be stabilized include the nature of pre-conflict policing and crime, influential social networks, religious groups, and political affiliations, financial and economic systems as well as key institutions and how they work are other critical elements of information. Mapping these systems for specific urban areas will be tedious though much data is already resident in disparate databases.” The challenges involved in computational modeling for culturally infused social networks for SARO in [28]. similarity, relationship similarity and content similarity. Matching is accomplished through determining the weighted similarity between the name similarity and content similarity along with their associated instances. Name similarity between A and B will exploit name match with the help of Wordnet, and Jaro-Winkler string metric. Relationship similarity between A and B takes into account the number of equivalent spatial relationships, along with their sibling similarity and parent similarity. 3. SUPPORTING TECHNOLOGIES FOR SARO Content similarity We describe two algorithms we have developed for content similarity. The first is an extension of the ideas presented in [8] regarding the use of N-grams extracted from the values of the compared attributes. Despite the utility of this algorithm, there are situations when this approach produces deceiving results. To resolve these difficulties, we will present a second instance matching algorithm, called K-means with Normalized Google Distance (KM-NGD), which determines the semantic similarity between the values of the compared attributes by leveraging K-medoid clustering and a measure known as the Normalized Google Distance. We are integrating several of our existing technologies and building new technologies for SARO. We will describe the supporting (i.e., existing) technologies we are using in this section and the new technologies in the next. 3.1 Geospatial semantic web, Police Blotter, Knowledge discovery We have developed a geospatial semantic web and information integration and mining. In particular, we have developed techniques for (i) Matching or aligning ontologies and (ii) Police blotter prototype demonstrated at GEOINT and (iii) Development of GRDF and Geospatial Ontologies for Emergency Response. Matching or Aligning Ontologies: In open and evolving systems various parties continue to adopt differing ontologies [20], with the result that instead of reducing heterogeneity, the problem becomes compounded. Matching, or aligning ontologies is a key interoperability enabler for the semantic web. In this case, ontologies are taken as input and correspondences between the semantically related entities of these ontologies are determined as output, as well as any transformations required for alignment. It is helpful to have ontologies that we need to match to refer to the same upper ontology or to conform to the same reference ontology [12, 25]. Significant number of works have already been done in database perspective [9, 27, 29] and in machine learning perspective [10, 11, 19, 29]. Our approach falls into the latter perspective. However, it focuses not only one traditional data (i.e., text), it also go beyond text (i.e., geo-spatial data). The complex nature of this geo-spatial data poses further challenges and additional clues in matching process. Given a set of data sources, S1, S2, … each of which is represented by data model ontologies O1, O2, … the goal is to find similar concepts between two ontologies (namely, O1 and O2) by examining their respective structural properties and instances, if available. For the purposes of this problem, O1 and O2 may belong to different domains drawn from any existing knowledge domain. Additionally, these ontologies may vary in breadth, depth, and through the types of relationships between their constituent concepts. The challenges involved in the alignment of these ontologies, assuming that they have already been constructed, include the proper assignment of instances to concepts in their identifying ontology. Quantifying the similarity between a concept from the ontology, O1 (called concept A) and a concept from the O2 ontology (called concept B) involves computing measures taking into account 3 separate types of concept similarity: name Content Similarity Using N-grams Instance matching between two concepts involves measuring the similarity between the instance values across all pairs of compared attributes. This is accomplished by extracting instance values from the compared attributes, subsequently extracting a characteristic set of N-grams from these instances, and finally comparing the respective Ngrams for each attribute. In the following, we will use the term “value type” to refer to a unique value of an attribute involved in the comparison. We extract distinct 2-grams from the instances and consider each unique 2-gram extracted as a value type. As an example, for the string "Locust Grove Dr." that might appear under an attribute named Street for a given concept, some Ngrams that would be extracted are 'Lo', 'oc', 'cu', 'st', 't ', 'ov','Dr' and 'r.' N-gram similarity is based on a comparison between the concepts of entropy and conditional entropy known as Entropy Based Distribution (EBD): EBD = H (C | T ) H (C ) (1) In this equation, C and T are random variables where C indicates the union of the attribute types C1 and C2 involved in the comparison (C indicates "column", which we will use synonymously with the term “attribute”) and T indicates the value type, which in this case is a distinct N-gram. EBD is a normalized value with a range from 0 to 1. Our experiments involve 1:1 comparisons between attributes of compared concepts, so the value of C would simply be C1 U C2. H (C ) represents the entropy of a group of value types for a particular column (or attribute) while H (C |T ) indicates the conditional entropy of a group of identical value types. Intuitively, an attribute contains high entropy if it is impure; that is, the ratios of value types making up the attribute values are similar to one another. On the other hand, low entropy in an attribute exists when one value type exists at a much higher ratio than any other type. Conditional entropy is similar to entropy in Figures 2-1a and 2-1b. Distribution of different value types when EBD is high (2a) and low (2b). H(C) is similar to H(C|T) in 2a but dissimilar in 2b the sense that ratios of value types are being compared. However, the difference is that we are computing the ratio between identical value types extracted from different attributes. Figures 2-1a and 2-1b provide examples to help visualize the concept. In both examples, crosses indicate value types originating from C1, while squares indicate value types originating from C2. The collection of a given value type is represented as a cluster (larger circle). In figure 2a, the total number of crosses is 10 and the total number of squares is 11, which implies that entropy is very high. The conditional entropy is also high, since the ratios of crosses to squares within two of the clusters are equal within one and nearly equal within the other. Thus, the ratio of conditional entropy to entropy will be very close to 1, since the ratio of crosses to squares is nearly the same from an overall value type perspective and from an individual value type perspective. Figure 2b portrays a different situation: while the entropy is 1.0, the ratio of crosses to squares within each individual cluster varies considerably. One cluster features all crosses and no squares, while another cluster features a 3:1 ratio of squares to crosses. The EBD value for this example consequently will be lower than the EBD for the first example because H (C |T ) will be a lower value. Problems of N-Gram Approach For Instance Matching Despite the utility of the aforementioned method, it is susceptible to misleading results. For example, if an attribute named 'City' associated with a concept from O1 is compared against an attribute named 'ctyName' associated with a concept from O2, the attribute values for both concepts might consist of city names from different parts of the world. 'City' might contain the names of North American cities, all of which use English and other Western languages as their basis language, while 'ctyName', might describe East Asian cities, all of which use languages that are fundamentally different from English or any Western language. According to human intuition, it is obvious that the comparison occurs between two semantically similar attributes. However, because of the tendency for languages to emphasize certain sounds and letters over others, the extracted sets of 2grams from each attribute would very likely be quite different from one another. For example, some values of 'City' might be "Dallas", "Houston" and "Halifax", while values of 'ctyName' might be "Shanghai", "Beijing" and "Tokyo". Based on these values alone, there is virtually no overlap of N-grams. Because most of the 2-grams belong specifically to one attribute or the other, the calculated EBD value would be low. This would most likely be a problem every time global data needed to be compared for similarity. Using Clustering and Semantic Distance for Content Similarity To overcome the problems of the N-gram, we have developed method that is free from the syntactic requirements of N-grams and uses the keywords in the data in order to extract relevant semantic differences between compared attributes. This method, known as KM-NGD, extracts distinct keywords from the compared attributes and places them into distinct semantic clusters via the K-medoid algorithm, where the distance metric between each pair of distinct data points in a given cluster (a data point is represented as an occurrence of one of the distinct keywords) is known as the Normalized Google Distance (NGD). The EBD is then calculated by comparing the words contained in each cluster, where a cluster is considered a distinct value type. Normalized Google Distance Before describing the process in detail, NGD must be formally defined: NGD(x,y ) = max {log f (x ), log f ( y )} − log f (x ,y ) log M − min{log f (x ),log f ( y )} (2) In this formula, f (x ) is the number of Google hits for search term x, f ( y ) is the number of Google hits for search term y, f (x ,y ) is the number of Google hits for the tuple of search terms xy, and M is the number of web pages indexed by Google. NGD(x, y ) is a measure for the symmetric conditional probability of co-occurrence of x and y. In other words, given that term x appears on a web page, NGD(x, y ) will yield a value indicating the probability that term y also appears on that same web page. Conversely, given that term y appears on a web page, NGD(x, y ) will yield a value indicating the probability that term x also appears on that page. Once the keyword list for a given attribute comparison has been created, all related keywords are grouped into distinct semantic clusters. From here, we calculate the conditional entropy of each cluster by using the number of occurrences of each keyword in the cluster, which is subsequently used in the final EBD calculation between the two attributes. The clustering algorithm used is the K-Medoid algorithm. Geospatial Police Blotter We have developed a toolkit that (a) facilitates integration of various police blotters into a unified representation and (b) semantic search with various levels of abstraction along with spatial and temporal view. Software has been developed for integration of various police blotters along with semantic search capability. We are particularly interested in Policy Blotter Crime Analysis. Police Blotter is the daily written record of events (as arrests) in a police station which is released by every police station periodically. These records are available publicly on the web which provides us wealth of information for analyzing the crime patterns across multiple jurisdictions. The Blotters may come in different data formats like structured, semistructured (HTML), and un-structured (NL text). In addition, many environmental criminology techniques assume that data are locally maintained and the dataset is homogeneous as well as certain. This assumption is not realistic as data is often managed by different jurisdictions and therefore, the analyst may have to spend unusually large amount of time to link related events across different jurisdictions (e.g., the sniper shootings across Washington DC, Virginia and Maryland in October 2002). There are major challenges that a police officer would face when he wants to analyze different police blotters to study a pattern (e.g., a spatial-temporal activity pattern) or trail of events. There is no way a police officer can pose a query where query will be handled by considering more than one distributed police blotters on the fly. There is not a cohesive tool for the police officer to view the blotters from different counties, interact and visualize the trail of crimes and generate analysis reports. The Blotters can currently searched only by keyword through current tools and does not allow conceptual search, and fails to identify spatial – temporal patterns and connect various dots/pieces. Therefore, we need a tool that will integrate distributed multiple police blotters, extract semantic information from a police blotter and provide seamless framework for queries with multiple granularities. individual crime form or aggregated (# of crimes) form. In temporal view, results are shown either in weekly, bi-weekly, monthly, quarterly or yearly basis. Clicking on a crime location in spatial view will show the details of the crime along with URL of original source. To identify correlation of various occurred crime in multiple jurisdictions, SQL query is submitted to the database. After the fetching relevant tuples, subsequent correlation analysis is performed in Main Memory (MM). We have developed a toolkit that • facilitates integration of various police blotters into a unified representation and • semantic search with various levels of abstraction along with spatial and temporal view. With regard to integration, structured, semi-structured and unstructured data are mapped into relational tables and stored in Oracle 10g2. For information extraction from unstructured text we used lingpipe (http://www.alias-i.com/lingpipe/ ). During extraction and mapping process, we exploited Crime Event ontology similar to NBIRS Group A and Group B offences. During extraction process, we extracted crime type, offender sex, and offender race (if available). This ontology is multi-level with depth 4. Figure 2-3. Demonstration of Correlation of crimes across multiple jurisdictions using Google Maps API. By clicking on the path (blue line), all relevant crime records similar to each other are shown in popup windows. Correlation analysis is accomplished by calculating the pair wise similarity of these tuples and constructing a directed graph from the results. Nodes in the graph represent tuples and edges represent similarity values between tuples. If similarity of two tuples falls below a certain threshold, we remove its corresponding edge from the graph. Finally, a set of paths in the graph demonstrate correlation of crimes across multiple jurisdictions. By clicking on the path, all relevant tuples similar to each other are shown in a popup window (see Figure 2-3). For implementation, we have developed our application as an applet in portlet. To show address in map, we have exploited Yahoo's geocoder that converts latitude/longitude from street address. To display map, OpenMap and Google Map API both are used. We used the following data sets in our demonstration and will examine this data set for our SBIR project. Figure 2-2. Basic and Advanced query Interface to perform semantic search using Google Maps API. Details of a crime are shown in a popup window. We have facilitated basic and advanced query to perform semantic search (Figure 2-2). Basic query allows query by crime type from ontology along with Date Filter. Advanced query extends basic query facility by augmenting address field (block, street, city). Ontology allows users to search at various levels of abstraction. Results are sown in two forms of view: spatial and temporal. Furthermore, in each view results are shown either in Police Blotters for Dallas County available online http://www.dallasnews.com/sharedcontent/dws/news/city/collin /blotter/vitindex.html. GRDF and Geospatial Ontologies for Emergency Response: We have developed a system called DAGIS which utilizes GRDF (our version of geospatial RDF [1] and associated geospatial ontologies. DAGIS is a SOA-based system for geospatial data and we use a SPARQL-like query language to query the data sources. Furthermore, we have also implemented access control for DAGIS. We are currently investigating how some of the temporal concepts we have developed for Policeblotter can be incorporated into GRDF and DAGIS [30]. We have also presented this research to OGC (Open Geospatial Consortium) [31]. In addition to this effort, we have also developed a Geospatial Emergency Response system to detect chemical spills using commercial products such as ArcGIS [7]. We will build on our extensive experience on geospatial technology as well as our systems DAGIS/GRDF, Policeblotter, Ontology Matching tools, as well as the Geospatial emergency response system to develop our SARO prototype for the military. 3.2 Social Networking for fighting against bioterrorism Our own model for analyzing various bioterrorism scenarios is a hybridization of social interactions on a household scale, situation intervention, and the simplicity of the SIR approach [24]. The system arose out of a need for a deterministic model that can balance a desire for accuracy in representing a potential scenario with computational resources and time. Recent work has suggested that more detailed models of social networks have a diminished role over the results in the spread of an epidemic [24]. We believe we can generalize complex interactions into a much more concise simulation without adversely affecting accuracy. The ultimate goal of our research is to integrate a model for biological warfare with a system that can evaluate multiple attacks with respect to passive and active defenses. As a result, we have created a simulation that serves as an approximation of the impact of a biological attack with speed in mind, allowing us to explore a large search space in a relatively shorter amount of time as compared to existing detailed models. The base component of our model is the home unit. A home can range in size from a single individual to a large household. Within this unit, the probable states of the individuals are tracked via a single vector of susceptibility, infection, and recovery. Given a population distribution of a region and basic statistical data, we can easily create a series of family units that represent the basic social components from a rural community to a major metropolitan area. A single home unit with no interaction is essentially a basic representation of the SIR model. Interaction occurs within what we call social network theaters. A theater is essentially any gathering area at which two or more members of a home unit meet. The probability of interaction depends on the type of location and the social interaction possible at it. To capture this, separate infection rates are assignable to each theater. In the event of a life-threatening scenario such as a bioterrorist attack, we assume a civil authority will act at some point to prevent a full-scale epidemic. We model such an entity by providing means in our models to affect social theaters and the probabilities associated with state transitions. For simplicity at this point, we do not consider resource constraints, nor do we model how an event is detected. The recognition of an attack will be simulated using a variable delay. After this delay has passed, the infection is officially recognized. The most basic form of prevention is by inoculating the population against an expected contagion. Several options exist at this level, ranging from key personnel to entire cities. Anyone inoculated is automatically considered recovered. Second, a quarantine strategy can be used to isolate the infected population from the susceptible population. This requires the explicit removal of individuals from home units to appropriate facilities, and can be simulated on a fractional basis representing probability of removal with varying levels of accuracy. Third, the infection and recovery rates can be altered, through such means as allocating more resources to medical personnel and educating the general public on means to avoid infection. Finally, a potentially controversial but interesting option is the isolation of communities by temporarily eliminating social gathering areas. For example, public schools could be closed, or martial law could be declared. The motivating factor is finding ways to force the population at risk to remain at home. Such methods could reduce the number of vectors over which an infection could spread. Based on the above model, we simulated various bioterrorism scenarios. A powerful factor that we saw in several of our results in the epidemiology models was the small world phenomenon. The small world effect is when the average number of hops on even the largest of networks tends to be very small. In several cases, the infection spread to enough individuals within 4 days to pose a serious threat that could not be easily contained. The results from closing social theatres made this particularly clear, as many closings beyond the third day did little to slow the advance of many epidemics. However, not all intervention methods are available in every country. It is important to understand how local governmental powers, traditions and ethics can impact the options available in a simulation. In some countries, a government may be able to force citizens to be vaccinated, while others may have no power at all and must rely on the desire for protection to motivate action. In other situations, closing any social theatre may be an explicit power of the state, in contrast to governing entities that may have varying amount of abilities to do the same but will not consider it due to a severe social backlash. The impact on society must be carefully considered beyond economical cost in any course of action, and there is rarely a clear choice. These answers are outside the scope of our work, and are better suited to political and philosophical viewpoints. However, our model helps governing bodies consider these efforts carefully in light of public safety and the expenditure of available resources. A major goal of our research is to provide a means to further our understanding of how to provide a higher level of security against malicious activities. This work is a culmination of years of research into the application of social sciences to computer science in the realm of modeling and simulation. With detailed demographic data and knowledge of an impending biological attack, this model provides the means to both anticipate the impact on a population and potentially prevent a serious epidemic. An emphasis on cost-benefit analysis of the results could potentially save both lives and resources that can be invested in further refining security for a vulnerable population. 3.3 Assured Information Sharing, Incentives, Risks: Our current research is focusing extensively on incentive based information sharing which are major components of the SARO system. In particular, we are working on building mechanisms to give incentives to individuals/organizations for information sharing. Once such mechanisms are built, we can use concepts from the theory of contracts [22] to determine appropriate rewards such as ranking or, in the case of certain foreign partners, monetary benefits. Currently, we are exploring how to leverage secure distributed audit logs to rank individual organizations between trustworthy partners. To handle situations where it is not possible to carry out auditing, we are developing game theoretic strategies for extracting information from the partners. The impact of behavioral approaches to sharing are also currently considered. Finally we are conducting studies based on economic theories and integrate relevant results into incentivized assured information sharing as well as collaboration. Auditing System. One motivation for sharing information and behaving truthfully is the liability imposed on the responsible partners if the appropriate information is not shared when needed [32]. For example, an agency may be more willing to share information if it is held liable. We are devising an auditing system that securely logs all the queries and responses exchanged between agencies. For example, Agency B’s query and the summary of the result given by Agency A (e.g., number of documents, document ids, their hash values) could be digitally signed and stored by both agencies. Also we may create logs for subscriber-based information services to ensure that correct and relevant information are pushed to intended users. Our mechanism needs to be distributed, secure and efficient. Such an audit system could be used as a basis for creating information sharing incentives. First, using such a distributed audit system, it may be possible to find out whether an agency is truthful or not by conducting an audit using the audit logs and the documents stored by the agency. Also, an agency may publish accurate statistics about each user’s history using the audit system. For example, Agency B could publish the number of queries it sent to Agency A that resulted in positive responses, the quantity of documents that are transmitted, and how useful those documents were according to scoring metrics. The audit logs and aggregate statistics could be used to set proper incentives for information sharing. For example, agencies that are deemed to provide useful information could be rewarded. At the same time, agencies that do not provide useful information or withhold information could be punished. An issue in audit systems is to determine how the parties involved evaluate the signals produced by the audit. For example, in public auditing systems, simplicity and transparency are required for the audit to have necessary political support [2]. Since the required transparency could be provided mainly among trusted partners, such an audit framework is suitable for trustworthy coalition partners. Currently, we are exploring the effect of various alternative parameter choices for our audit system on the Nash equilibrium [26] in our information sharing game. Other research challenges in incentivizing information sharing are also currently being explored. First, in some cases, due to some legal restrictions or security purposes, an agency may not be able to satisfactorily answer the required query. This implies that our audit mechanisms, rating systems and incentive set up should consider the existing security policies. Second, we need to ensure that subjective statistics such as ratings should not be used for playing with the incentive system. That is, we need to ensure that partners do not have incentives to falsify rating information. For example, to get better rankings, agencies may try to collude and provide fake ratings. To detect such a situation and discourage collusions, we are working on various social analysis techniques. In addition, we will develop tools to analyze the underlying distributed audit logs securely by leveraging our previous work on accountable secure multi-party computation protocols [18]. Behavioral Aspects of Assured Incentivized Information Sharing. A risk in modeling complex issues of information sharing in the real world to formal analysis is making unrealistic assumptions. By drawing on insights from psychology and related complementary decision sciences, we are considering a wider range of behavioral hypotheses. The system we are building seeks to integrate numerous sources of information and provide a variety of quantitative output to help monitor the system’s performance, most importantly, sending negative alerts when the probability that information is being misused rises above preset thresholds. The quality of the system’s overall performance will ultimately depend on how human beings wind up using it. The field of behavioral economics emerged in recent decades, borrowing from psychology to build models with more empirical realism underlying fundamental assumptions about the way in which decision makers arrive at inferences and take actions. For example, Nobel Prize winner Kahneman’s work focused primarily on describing how actual human behavior deviates from how it is typically described in economics textbooks. The emerging field of normative behavioral economics now focuses on how insights from psychology can be used to better design institutions [3]. A case in point is the design of incentive mechanisms that motivate a network of users to properly share information. One way in which psychology can systematically change the shape of the utility function in an information-sharing context concerns relative outcomes, interpersonal comparisons, and status considerations in economic decision making. For example, is it irrational or un-economical to prefer a payoff of 60 among a group where the average payoff is 40, over a payoff of 80 in a group context where the average among others is over 100? While nothing in economic theory ever ruled these sorts of preferences out, their inclusion in formal economic analysis was quite rare until recent years. We are trying to augment the formal analysis of the incentivized information sharing component of our work with a wider consideration of motivations, including interpersonal comparisons, as factors that systematically shape behavioral outcomes and, consequently, the performance of informationsharing systems. Another theme to emerge in behavioral economics is the importance of simplicity [14] and the paradox of too much choice. According to psychologists and evolutionary biologists, simplicity is often a benefit worth pursuing in its own right, paying in terms of improved prediction, faster decision times, and higher satisfaction of users. We are considering a wide range of information configurations that examine environments in which more information is helpful, and those in which less is more. With intense concern for the interface between real-world human decision makers and the systems, we will provide practical hints derived from theory to be deployed by the design team in the field. 4. OUR APPROACH TO BUILDING A SARO SYSTEM Our approach is described in the following eight subsections. In particular, we will, describe SAROL and our approach to designing and developing SAROL through a temporal geosocial semantic web. 4.1 Overview In our MURI project together with the team, we are developing what is called an AISL (assured information sharing life cycle). AISL is focusing on policy enforcement as well as incentive based information sharing. Semantic web technology is utilized as the glue. AISL’s goal is to implement DoD’s Information Sharing strategy set forth by Hon. John Grimes [15]. However AISL does not capture geospatial information as well as track social networks based on locations (Note that the main focus of AISL is to enforce confidentiality for information sharing). Our design of the SARO system is influence by our work for the MURI project with our partners. Our goal is to develop technologies that will capture not only information but also social/political relationships, map the individuals in the network to their locations, reason about the relationships and information and determine how the nuggets can be used by the commander for stabilization and reconstruction. In doing so, the commander has to also determine potential conflicts, terrorist activities as well as any operation that could hinder and suspend stabilization and reconstruction. In order to reason about the relationships as well as to map the individuals of a network to locations, we will utilize the extensive research we have carried out both on social network models (section 2.3) and geospatial semantic web and integration technologies (section 2.2). The first task in this project will be to develop scenarios based on use cases that are discussed in a study carried out for the Army Science and Technology [6] as well as by interviewing experts (e.g. Chait et al). Furthermore, In her work Dr. Karen Guttieri states that Human Terrain is a crucial aspect and we need hyperlinks to People, Places, Things and Events to answer questions such as * Which people are where? Where are their centers and boundaries? Who are their leaders Who is who in the zoo? What are their issues and needs? What is the news and reporting? Essentially the human domain associations builds relationships between the who, what, where, when and why (5W) [17]. We are designing what we call SAROL (Stabilization and Reconstruction Operations Lifecycle). SAROL will consist of multiple phases and will discover relationships and information, model and integrate the relationships and information, as well as exploit the relationships and information for decision support. The system that implements SAROL will utilize geosocial semantic web technologies, a novel semantic web that we will develop. The basic infrastructure that will glue together the various phases of SAROL will be based on the SOA paradigm. We will utilize the technologies that we have developed in social networking and geospatial semantic web to develop temporal geosocial semantic web technologies for SAROL. It should be noted that in our initial design we are focusing on the basic concepts for SAROL will involve the development of TGS-SW. This will include capturing the social relationships and mapping them to the geolocations. In our advanced design we will include some advanced techniques such as knowledge discovery, and risk based trust management for the information and relationships. 4.2 Scenario for a SARO In the study carried out for the Army, the authors have discussed the surveys they carried out and elaborated on the use cases and scenarios they developed [6]. For example, at a high level, a “City Rebuild” use case is the following: “A brigade moves into a section of a city and is totally responsible for all S&R operations in that area, which includes the reconstruction of an airfield in its area of responsibility.” They further state that to move from A to B in a foreign country the commander should consider various aspects including the following: “Augmenting its convoy security with local security forces (political and military considerations)” and “Avoid emphasizing a military presence near political or religious sites (political and social considerations).” The authors then go on to explain how a general situation can then be elaborated for a particular situation say in Iraq. They also discuss the actions to be taken and provide the results of their analysis. This work is influencing our development of scenarios and use cases. The use case analysis will then guide us in the development of SAROL. 4.3 SARO Lifecycle (SAROL) SAROL consists of three major Relationship/ phases shown in Information Figure 3-1: (1) Discovery information and relationship discovery and acquisition, (2) information and Relationship/ Information relationship Exploitation Relationship/ modeling and integration and Information Modeling/ (3) information Integration and relationship exploitation. Figure 3-1. SAROL During the discovery and acquisition phase commanders and key people will discover the information and relationships based on those advertised as well as those obtained through inference. During the modeling and integration phase the information and the relationship have to be modeled, additional information and relationships inferred as well as the information and relationships integrated. During the exploitation phase the commanders and those with authority will exploit the information, make decisions and take effective actions SAROL is highly dynamic as relationships and information change over time and can rapidly react to situations. The above three phases are executed multiple times by several processes. For example, during a follow-on cycle new relationships and SPARQL for TGS-SW TGS -SW -SW (OWL) TSG-SW Ontologies Ontologies (OWL)forfor TSG -SW -SW TRDF+GRDF+SNRDF TML+GML+SNML TML+GML+SNML URI, UNICODE Figure 3-3. Temporal Geosocial Semantic Web Technologies information that could say political in nature could be discovered, modeled, integrated and exploited. Figure 3-2 illustrates the various modules that will implement SAROL. The glue consists of the temporal geosocial service oriented architecture (TGS-SOA) that supports web services and utilizes temporal geosocial semantic web technologies (TGS-SW). The high level web services include social networking, geospatial information management, incentive management and federation management. 4.4 TGS-SOA Our architecture is based on services and we will design and prototype a TGS-SOA for managing the SAROL. Our TGS-SOA will utilize temporal geosocial semantic web (TGS-SW) technologies. Through this infrastructure information and relationships will be made visible, accessible and understandable to any authorized DoD commanders, allied forces and external partners. As discussed by Karen Guttieri, it is important that the partners have a common operational picture, common information picture and common relationship picture. Based on the use cases scenarios we will capture the various relationships, extract additional relationships and also locate the individuals in the various networks. This will involve designing the various services such as geospatial integration services and social network management services as illustrated in Figure 3-2. The glue that connects these services is based on the SOA (service oriented architecture) paradigm. However such an architecture should support temporal geosocial relationships. We call such an architecture a Temporal Geosocial SOA (TGS-SOA). It is important that our system captures the semantics of information and relationships. Therefore we are developing semantic web technologies for representing, managing and deducing temporal geosocial relationships. Our current work in geospatial information management and social networks is exploring the use of FOAF, GRDF and SNRDF. Therefore we will incorporate the temporal element into these representations and subsequently develop appropriate representation schemes based on RDF and OWL. We call such a semantic web to be Temporal Geosocial Semantic Web (TGS-SW) (Figure 3-3). We are using commercial tools and standards as much as possible in our work including Web Services Description Language (WSDL) and SPARQL. Capturing temporal relationship and information e.g., evolving spatial relationships and changes to geospatial information is key for our systems. Temporal Social Networking Models Temporal social networks model, represent and reason about social networks that evolve over time. Note that in countries like Iraq and Afghanistan the social and political relationships may be continually changing due to security, power and jobs; the three ingredients for a successful SARO. Therefore it is important to capture the evolution of the relationships. In the first phase, our design and development of temporal social networks will focus on two major topics, namely, semantic modeling of temporal social networks and fundamental social network analysis. For semantic modeling of temporal social networks, we will extend the existing semantic web social networking technologies such as Friend-of-a-Friend (FOAF) ontology (FOAF09) to include various important aspects such as relationship history that is not represented in current social network ontologies. For example, we will include features to model the strength and trust of relationships among individuals based on the frequency and the context of the relationship. In addition, we will include features to model relationship history (e.g., when the relationship has started and how the relationship has evolved over time) and the relationship roles (e.g., in the leader/follower relationship, there is one individual playing the role of the leader and one individual playing to role of the follower). In essence, by this modeling we intend to create an advance version of social network such as Face book specifically designed for SARO objectives. Note that there are XML based languages for representing social network data (SNML) and temporal data (TML). However we need a semantic language based on RDF and OWL for representing semantic relationships between the various individuals. We are exploring RDF for repenting social relationships (SNRDF and extended FOAF). Representing temporal relationships (When) is an area that needs investigation for RDF and OWL based social networks. We are using the social network analysis to identify important properties about the underlying network address some of the 5W (who, what, when, where and why) and 1H (how) queries. To address queries for determining who to communicate with, we plan to use various centrality measures such as degree centrality and betweeness centrality [24] to measure the importance of a certain individuals in a given social network. Based on such measures that are developed for social network analysis, we can test which of the centrality measures could be more appropriate in finding influential individuals in social networks. Especially, we will test these measures on available social networks such as Facebook (it is possible to download information about individuals in your own network on Facebook). For example, if a centrality measure is a good indicator, than we may expect individuals with high centrality value to have more posts on their Walls on Facebook or tagged in more number of pictures. To answer the queries for determining what information is needed, we are examining the use of relational naïve Bayes models to predict which attributes of the individual is a more reliable indicator for predicting their friendliness to the military. Since we do not have any open data on such military data, we are using Facebook data to predict attributes that are more important indicators for individual’s political affiliation to test our relational naïve Bayes models. To address queries for determining when to approach them, we are using various domain knowledge rules to first address when not to approach them. For example, in Iraq, it may not be a good idea to approach Muslim individuals during Friday pray. Later on, we will try to build knowledge discovery models to predict the best times for approaching certain individuals based on their profile features (i.e., their religion, social affiliation and etc.). In addition, we plan to use similar knowledge discovery models to answer the queries for understanding how those individuals may interact with the military personnel. In order to test our knowledge discovery models, we are analyzing various e-mail logs of our group members to see whether our model could predict best times to send an e-mail to an individual to get the shortest response time and the positive answer. To address queries for determining why certain individuals’ support is vital, we are examining community structure mining techniques especially cluster analysis to see which group a certain individual belongs to and how the homophily between individuals in the group affects the link structures. Again, Facebook data could be used to test some of these hypotheses. 4.5 Temporal Managements Geospatial Information As part of our MURI research we are developing novel techniques for information quality management and validation, information search and integration and information discovery Figure 3-4. System Architecture for Geosocial Information Management and analysis that make “information a force multiplier through sharing” Our objective is to get the right information at the right time to the decision maker so that he/she can make the right decisions to support the SARO in the midst of uncertain and unanticipated events. Note that while our MURI project focuses mainly on information sharing of structured and text data, our work discussed in this paper is focusing on managing and sharing temporal geosocial information. The system architecture is illustrated in Figure 3-4. It shows how the various geospatial information management components are integrated with the social networking components. Information management and search. Our system, like most other current data and information systems, collects and stored huge amounts of data and information. Such a system is usually distributed across multiple locations and is heterogeneous in data/storage structures and contents. It is crucial to organize data and information systematically and build infrastructures for efficient, intelligent, secure and reliable access, while maintaining the benefits of autonomy and flexibility of distribution. Although such a system is building on top of database system and Web information system technology, it is still necessary to investigate several crucial issues to ensure our system have high scalability, efficiency, fast or real-time response, and high quality and relevance of the answers in response to users’ queries and requests, as well as high reliability and security. We are exploring ways to develop, test, and refine new methods for effective and reliable information management and search. In particular, the following issues are being explored: (1) data and information indexing, clustering, and organizing in structure way to facilitate not only efficient but also trustable search and analysis; (2) relevance analysis to ensure the returning of highly relevant and ranked answers; and (3) aggregation and summary over time window. For example, a user may like to ask some information which may involve weapons and insurgents of a particular location over a particular time frame. Our design and prototype of the police blotter system is being leveraged for the SARO system. Information Integration. We are examining scalable integration techniques for handling heterogeneous geosocial data utilizing our techniques on ontology matching and aligning. Moreover, to ensure data/information from multiple heterogeneous sources can be integrated smoothly, we are exploring data/information conversion and transformation rules, identify redundancy and inconsistence. In our recent research on geospatial information integration, we have developed knowledge discovery methods that resolve semantic disparities among distinct ontologies by considering instance alignment techniques [21]. Each ontological concept is associated with a set of instances, and using these, one concept from each ontology is compared for similarity. We examine the instance values of each concept and apply a widely popular matching strategy utilizing N-grams present in the data. However, this method often fails because it relies on shared syntactical data to determine semantic similarity. Our approach resolves these issues by leveraging Kmedoid clustering and a semantic distance measure applied to distinct keywords gleaned from the instances, resulting in distinct semantic clusters. We claim that our algorithm outperforms N-gram matching over large classes of data. We have justified this with a series of experimental results that demonstrate the efficacy of our algorithm on highly variable data. We are exploiting not only instance matching but also name and structural matching in our project for geosocial data that evolves over time. Information Analysis and knowledge discovery. The tools we have developed for information analysis and knowledge discovery are being exploited for the SARO system [21] for this project. We are exploring research in the following directions which extends our prior research. First, information warehousing, mining and knowledge discovery are being applied to distributed information sources to extract summary/aggregate information, as well as the frequency, discrimination, and correlation measures, in multidimensional space for items and item sets. Note that the 5W’s and 1H’s (who, what, why, when, where and how) are important for SARO. For example, "When to approach them" might be used where a need exists to meet with road construction crews in an area that has experienced sporadic ethnic violence mostly during specific times in the day. For example, "find all times in a given day along major roads in Baghdad less than 10 miles from Diyala Governorate where violent activity has not occurred in at least 2 years." We are extending our geospatial and social network analysis techniques to address such questions. Our work is focusing on the following aspects: (1) scalable algorithms that can handle large volumes of geosocial information (information management and search) to facilitate who, where, when issues; (2) information integration analysis algorithms to address where and who issues; and (3) knowledge discovery techniques that can address why and how issues. 4.6 Temporal Geosocial Semantic Web We are integrating our work on modeling and reasoning about geospatial information with the geospatial information. First we are utilizing concepts from concepts from SNML (social network markup language), GML (geospatial markup language) and TML (temporal markup language). However to capture and semantics and make meaningful inferences we need something beyond syntactic representation. As we have stated, we have developed GRDF (Geospatial RDF) that basically integrates RDF and GML. One approach is to integrate SNRDF (Social network RDF that we are examining) with GRDF and also incorporate the temporal element. Another option is to make extensions to FOAF to represent temporal geosocial information. Next we are integrating the geosocial information across multiple sites so that the commanders and allied forces as well as partners in the local communities can form a common picture. We have developed tools for integrated social relationships as well as geospatial data using ontology matching and alignment. However, we are exploring ways of extending our tools to handle possibly heterogeneous geosocial data in databases across multiple sites. We are also exploring appropriate query languages to query geosocial data. There are query languages such as SPARQL and RQL being developed for RDF databases. We have adapted SPARQL to query geospatial databases. We are also exploring query languages for social network data. Query languages for temporal databases have been developed. Therefore our challenge in this subtask is to determine the constructs that are needed to extend languages like SPARQL to query geosocial data across multiple sites. It is also crucial to reason about the information and relationships to extract useful nuggets. Here we are developing ontologies for temporal geospatial information. We have developed ontologies and reasoning tools for geospatial data and social network relationships based on OWL and OWL-S. We are incorporating social relationships and temporal data and reason about the data to uncover new information. For example, if some event occurs at a particular time at a particular location for 2 consecutive days in a row and involves the same group of people, then it will like occur on the third day at the same time and same location involving the same group of people. We can go on to explore why such an event has occurred. Perhaps this group of people belong to a cult and have to carry out activities in such a manner. Therefore we are developing reasoning tools for geosocial data and relationships. 5. CONCLUSION This paper has described the challenges for developing a system for SARO. In particular, we are designing a temporal geospatial social semantic web that can be utilized by military personnel, decision makers, and local/government personnel to reconstruct after a major combat operation. We essentially develop a lifecycle for SARO. We will utilize the technologies we have developed including geospatial semantic web and social network system as well as build new technologies to develop SAROL. We believe that this is the first attempt for building such a system for SARO. There are several areas that need to be included in our research. One is s4ecurity and privacy. We need to develop appropriate policies for SAROL. These policies may include confidentiality policies, privacy policies and trust policies. Only certain geospatial as well as social relationship may be visible to certain parties. Furthermore, the privacy of the individuals involved have to be protected, Different parties may place different levels of trust on each other. We believe that building a TGS-SW is a challenge. Furthermore, incorporating security will make it even more complex. Nevertheless security has to be considered at the beginning of 6thd design and not as an afterthought. Our future work will also include handling dynamic situations where some parties may be trustworthy at one time and may be less trustworthy at another time. We believe that the approach we have stated is just the beginning to building a successful system for SARO. 6. ACKNOWLEDGEMENTS This paper is based on the keynote presentation given by the authors at PAISI (Pacific Asia Intelligence and Security Informatics) on April 27, 2009. The research is supported by Texas Enterprise Funds. The authors are grateful to the AFOSR MURI Project as well as the IARPA KDD Project for influencing the research described in this paper. We thank our student Pankil Doshi for reviewing this paper. 7. REFERENCES [1] Alam Ashraful et al, 2008. GRDF and Secure GRDF. IEEE ICDE Workshop on Secure Semantic Web, Cancun Mexico, April 2008. (version also to appear in Computer Standards and Interfaces Journal) [2] N. Berg, 2006. “A Simple Bayesian Procedure for Sample Size Determination in an Audit of Property Value Appraisals”, Real Estate Economics 34(1), 133-155, 2006 [3] N. Berg, “Normative Behavioral Economics”, Journal of Socio-Economics 32, 411-423, 2003 [4] H. Binnendjik and S. Johnson, 2004. Transformation for Stabilization and Reconstruction Operations, Center for Technology and National Security Policy, National Defense University Press. [5] R. Chait, A. Sciarretta, and D. Shorts, October 2006. Army Science and Technology Analysis for Stabilization and Reconstruction Operations, Center for Technology and National Security Policy, NDU Press. [6] R. Chait, A. Sciarretta, J. Lyons, C. Barry, D. Shorts, and D. Long, September 2007. A Further Look at Technologies and Capabilities for Stabilization and Reconstruction Operations, Center for Technology and National Security Policy, NDU Press. [7] Pavan Chittumala et al, November 2007. Emergency Resposne System to handle Chemical Spills, IEEE Internet Computing. [17] Karen Guttieri, Stability, Security, Transition Reconstruction: Transformation for peace, Quarterly Meeting of Transformation Chairs, Naval Postgraduate School, February 2007. [18] W. Jiang, C. Clifton and M. Kantarcioglu, "Transforming Semi-Honest Protocols to Ensure Accountability", Data and Knowledge Engineering (DKE), an Elsevier journal, 2007. [19] Y. Kalfoglou and M. Schorlemmer. IF-Map: an ontology mapping method based on information flow theory. Journal on Data Semantics, 1(1):98–127, Oct. 2003. [20] Latifur Khan, Dennis McLeod, and Eduard H. Hovy: Retrieval effectiveness of an ontology-based model for information selection. VLDB J. 13(1): 71-85 (2004). [21] Latifur Khan, et al Geospatial Data Mining for National Security, Proceedings Intelligence and Security Conference, New Brunswick, NJ, May 2007. [22] J. Laffont and D. Martimort, “The Theory of Incentives: The Principal-Agent Model”, Princeton University Press, 2001. [8] Bing Tian Dai, Nick Koudas, Divesh Srivastava, Anthony K. H. Tung, and Suresh Venkatasubramanian, "Validating Multi-column Schema Matchings by Type," 24th International Conference on Data Engineering (ICDE), pp. 120-129, 2008. [23] Madhavan, 2005. Corpus-based Schema Matching, ICDE05. [9] A. Doan and A. Halevy, Spring 2005. Semantic Integration Research in the Database Community: A Brief Survey, AI Magazine, Special Issue on Semantic Integration. [25] N. F. Noy and M. A. Musen. The PROMPT suite: Interactive tools for ontology merging and mapping. International Journal of Human-Computer Studies, 59(6):983–1024, 2003. [10] AnHai Doan, Jayant Madhavan, Pedro Domingos, Alon Y. Halevy: Ontology Matching: A Machine Learning Approach. Handbook on Ontologies 2004: 385-404. [11] Erhard Rahm and Philip A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334–350, 2001. [12] Marc Ehrig, Steffen Staab, York Sure, Bootstrapping Ontology Alignment Methods with APFEL, In Y. Gil, E. Motta, V. R. Benjamins, M. A. Musen, Proceedings of the 4th International Semantic Web Conference, ISWC 2005, Galway, Ireland, November 6-10, 2005., volume 3729 of LNCS, pp. 186-200. Springer, Nov. 2005 [13] GAO Report, STABILIZATION AND RECONSTRUCTION Actions Needed to Improve Governmentwide Planning and Capabilities for Future Operations Statement of Joseph A. Christoff, Director International Affairs and Trade, and Janet A. St. Laurent, Director Defense Capabilities and Management, Oct. 2007 [14] G. Gigerenzer and P. M. Todd , “Simple Heuristics That Make Us Smart”, Oxford University Press, 1999 [15] John Grimes, DoD Information Sharing Strategy, 2007. [16] Karen Guttieri, Integrated Education and Training Workshop, Peacekeeping and Stability Operations Institute, Naval Postgraduate School, September 2007. [24] Newman E. J., The structure and function of complex networks, arXiv, 2003 [26] M. Osborne and A. Rubinstein, “A Course in Game Theory”, MIT Press, 1999. [27] A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334–350, 2001. [28] Santos, 2007. "Computational Modeling: Culturally-Infused Social Networks", Office of Naval Research (ONR) Workshop on Social, Cultural, and Computational Science, and the DoD Workshop on New Mission Areas: Peace Operations, Security, Stabilization, and Reconstruction, Arlington, VA, 2007. [29] G. Stumme and A. M¨adche. FCA-Merge: Bottom-up merging of ontologies. In 7th Intl. Conf. on Artificial Intelligence (IJCAI ’01), pages 225–230, Seattle, WA, 2001. [30] Ganesh Subbiah, DAGIS System, MS Thesis, The University of Texas at Dallas. [31] Bhavani Thuraisingham, Geospatial data management research at the University of Texas at Dallas, presented at OGC meeting for University members and OGC Interoperability day, Tysons Corner, VA, October 2006. [32] B. Thuraisingham, Assured Information Sharing, Book Chapter: Intelligence and Security Informatics, Springer 2008 (Editor: Chris Yang). On the Efficacy of Data Mining for Security Applications Ted E. Senator SAIC1 3811 N. Fairfax Drive, Suite 850 Arlington, VA 22203, USA 1-703-469-3422 [email protected] ABSTRACT Data mining applications for security have been proposed, developed, used, and criticized frequently in the recent past. This paper examines several of the more common criticisms and analyzes some factors that bear on whether the criticisms are valid and/or can be overcome by appropriate design and use of the data mining application. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – data mining, I.5.2 [Pattern Recognition]: Design Methodology – classifier design and evaluation, feature evaluation and selection, pattern analysis. I.6.3 [Simulation and Modeling]: Applications. J.7.2 [Computers in Other Systems] K.4.1 [Computers and Society]: Public Policy Issues – human safety, privacy, use/abuse of power. General Terms Management, Measurement, Design, Economics, Security, Legal Aspects. Keywords Data mining, security, applications, pattern matching. 1. INTRODUCTION Many data mining applications for solving security problems have been proposed and designed in the recent past, especially since September 11, 2001. [12] Some of these applications have been developed and deployed and are in use today. Others have been cancelled before they were deployed. [16] The reasons for cancellation have typically been because of concerns over effectiveness and/or concerns over societal impact, particularly with respect to privacy and civil liberties. Sometimes, the concerns have been expressed as a tradeoff between the benefits in terms of the amount of protection afforded and the costs in terms of the societal impact of using the system. Some critics of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSI-KDD’09, June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-4 $5.00. the use of data mining for security applications have simultaneously criticized systems both for being ineffective and for threatening civil liberties. Often – especially in the political and policy communities – the discussion of these issues is based on less than thorough analyses of the way these systems actually operate or the way they could operate if they were designed effectively. 1 This paper analyzes several of the criticisms directed at the effectiveness of various data mining applications for security by (1) clearly defining the distinct activities that are performed as part of data mining projects for security applications and (2) analyzing the criticisms in the context of these activities, while proposing alternative designs to those assumed by the critics. This paper does not address the vital issues of privacy and civil liberties that are raised by these systems, not because these issues are not important – they most certainly are fundamental considerations in the design and adoption of any system involving data mining for security – but simply because that is a separate topic that requires far more thorough analyses and discussion than has been addressed in the research and analysis reported here. The main purpose of this paper is not to argue for any particular position with respect to the societal costs and benefits of using data mining for security applications; rather, it is to suggest ideas that would be part of a more thorough and principled framework within which to understand the inherent design issues, impacts, tradeoffs, and possibilities, in the hope that such a framework and understanding can be used to support rational and informed societal choices leading to effective security systems that respect privacy and civil liberties. This paper is offered in the spirit of [2] – to contribute to informed public debates and sound policy making that provide appropriate security and maintain civil liberties informed by careful analyses of alternatives and possibilities. It is hoped that the discussion and analysis in this paper will provide more of a mutual understanding between the technical and policy communities, at least in terms of the ability to communicate, discuss, and debate issues with a common understanding of what different terms mean and alternative solutions may imply. The paper is based on actual and proposed data mining applications in the U.S. with which the author is familiar; 1 Affiliation is provided solely for identification purposes. The views and opinions expressed herein are not necessarily those of SAIC, any of is clients, or of any other organization with which the author has been or is affiliated. This work has been prepared independently of the author’s duties and responsibilities as an employee of SAIC. however, the general ideas discussed should apply equally well to non-U.S. applications. What may differ across countries is not the scientific and engineering principles on which such systems are based, but rather the values on which societal judgments are based and the corresponding legal, political, regulatory, and policy environments in which decisions are made regarding the benefits and costs of such systems. The paper begins with a review of various definitions of data mining. It identifies several distinct but related activities that fall within these definitions. Next, it defines a model of data mining systems for security applications. This model forms the basis for the analysis that follows. The model is used to analyze several criticisms that have been levied at security applications of data mining. The paper then discusses metrics for the evaluation of security applications of data mining and finally concludes. 2. WHAT IS A SECURITY APPLICATION OF DATA MINING? This section discusses what is meant by an application of data mining for security. As will be seen from the various examples that are cited, there is often much confusion about this very issue, and this confusion is a large contributor to misunderstandings about the effectiveness of data mining applications. The section begins with a comparison of definitions of data mining as used in the technical and the political/policy communities. It continues with a discussion of what data miners do and suggests a framework and terminology for distinct but related tasks. It then uses this framework to understand security applications of data mining and concludes with a discussion of the sources and role of patterns in security applications. 2.1 Definitions of Data Mining There are a number of well-accepted definitions of data mining in the scientific community. Most of them center on the idea of pattern discovery. The most widely used definition, from [3] is that data mining is “the non-trivial process of identifying valid, novel, potentially useful and ultimately understandable patterns in data.” A newer definition from Jonas and Harper [5] defines data mining as “the process of searching data for previously unknown patterns and often using these patterns to predict future outcomes.” Note how these definitions, cited in the same order as they were proposed, both focus on the discovery of patterns while the new definition adds the emphasis on the use of these discovered patterns, especially for prediction. 2 Note also how neither of these definitions mentions data collection, data aggregation or linking, or particular applications. In contrast to the definitions used in the scientific community, politicians have defined data mining both more broadly and more narrowly. These definitions are broader in so far as they include search and collection of data, and they are narrower in that they typically refer to security applications the purpose of which is to 2 Prediction is actually used in two senses here. Prediction can mean “to infer some value about another entity in the database,” or it can mean “to suggest something that might occur in the future based on an analysis of the past.” As the physicist Niels Bohr said, “Prediction is very difficult, especially about the future.” prevent terrorism. “Data Mining is a broad search of public and non-public databases in the absence of a particularized suspicion about a person, place or thing. Data mining looks for relations between things and people without any regard for particularized suspicion” according to U.S. Senator Russ Feingold on January 16, 2003. The U.S. Department of Defense Technology and Privacy Advisory Committee in March 2004 defined data mining as “searches of one or more electronic databases of information concerning U.S. person by or on behalf of an agency or employee of the government.” Senator Feingold’s proposed amendment to HR 5441 defined data mining as “a query or search or other analysis of 1 or more electronic databases, whereas – (A) at least 1 of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement; (B) a department or agency of the Federal Government or a non-Federal entity acting on behalf of the Federal Government is conducting the query or search or other analysis to find a predictive pattern indicating terrorist or criminal activity; and (C) the search does not use a specific individual’s personal identifiers to acquire information concerning that individual.” Senator Patrick Leahy, opening the Senate Judiciary Committee Hearing on the “Balancing Privacy and Security: The Privacy Implications of Government Data Mining Programs” on January 10, 2007, defined data mining as “the collection and monitoring of large volumes of sensitive personal data to identify patterns or relationships.” It is important to note how these definitions of data mining differ from the scientific definitions. First, they assume a particular purpose, namely, security applications. Second, they include the concepts of data collection, monitoring, and search. And third, they assume that pattern-based searches are being conducted to identify specific individuals who fit the patterns. The focus is not so much on pattern discovery, but rather on pattern matching; the end product is not the patterns themselves as in the scientific definitions, but rather the matches of the patterns to people (and perhaps also places, things, events, etc.) to predict something of interest having to do with security. Note that neither of these sets of definitions covers the primary activity of data mining researchers, i.e., developing new algorithms for pattern discovery. In the U.S., the Federal Agency Data Mining Reporting Act of 2007 (“Data Mining Reporting Act”) requires the “head of each department or agency of the Federal Government” that is engaged in activities defined as “data mining” to report annually on such activities to Congress. The Data Mining Reporting Act defines data mining as “a program involving pattern-based queries, searches, or analyses of 1 or more electronic databases” in order to “discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity.” According to [7] and [8], “the limitation to predictive ‘pattern-based’ data mining is significant because analysis performed … for counterterrorism and similar purposes is often performed using various types of link analysis tools. These tools start with a known or suspected terrorist or other subject of foreign intelligence interest and use various methods to uncover links between that known subject and potential associates or other persons with whom that subject is or has been in contact. The Data Mining Reporting Act does not include such analyses within its definition of ‘data mining’ because such analyses are not ‘pattern-based.’ Rather, these !"##$%&%#'()()*#+,-,%./0 $%&% $%&% $%&% Research 12*3.(&04 5"##6$%&%#'()()*7 ="#$'#1>>2(/%&(3) $%&% $%&% $%&% $%&% Pattern Discovery 8%&&,.)- Linking, Targeting, Investigating, etc. ?"#@,/A.(&B 1>>2(/%&(3) Pattern Detection 8.,9(/&(3)-: ;)<,.,)/,- C3223DEF) 1/&(3)- Figure 1. Data Mining Activities analyses rely on inputting the ‘personal identifiers of a specific individual, or inputs associated with a specific individual or group of individuals,’ which is excluded from the definition of the act.” 2.2 What Data Miners Do Based on the above definitions, it appears that data miners engage in three distinct but related activities: (1) data mining research, the primary focus of which is algorithm development, (2) data mining itself, whose primary focus is pattern discovery, and (3) data mining applications, whose primary focus is predicting or inferring the value of a feature for some purpose. (In the case of security applications of data mining, the feature is typically a likelihood that a particular person is high-risk.) Finally, the data mining application developer may also engage in a fourth activity: (4) the design and development of other aspects of an end-to-end system that makes use of the predicted feature for some particular purpose. This end-to-end system can involve a variety of data sources and analytical and investigatory techniques; may result in many alternative downstream analyses, decisions and actions; and typically involves human analysts and other actors. Figure 2 of [15] provides an example of how this end-to-end process may occur in the context of law enforcement investigations. (It is important to note that the result of a positive match to a pattern that may be indicative of increased risk is usually and appropriately a more thorough analysis by a human analyst; rarely if ever is a pattern match relied on for any consequential action, nor should it be.) Each of the activities performed by data miners has distinct data requirements and distinct products, as depicted in figure 1. The figure is intended to depict not only the distinct activities, but also the different data needs and uses for each activity. Data mining researchers typically identify a set of databases that have a characteristic that has not been previously exploited for effective pattern discovery. They acquire several – or as many are as readily available – databases that share this characteristic and develop an algorithm that takes advantage of this characteristic and results in the discovery of more effective patterns. (Note that the effectiveness of the patterns is defined with respect to some particular application task.) The databases the data mining researchers use need not have information that identifies the entities in the databases, although they often do need to maintain unique identifiers for some classes of patterns. The databases need not be complete or even close to complete with respect to the actual populations, although some degree of representativeness is highly desirable. Multiple databases that are about a diverse set of domains are strongly preferred in order to demonstrate the widespread applicability and utility of the newly developed algorithm. The second activity, the actual mining of data, uses various algorithms to develop models, or, synonymously, to discover patterns. This step is sometimes called knowledge discovery, and the resulting patterns or models are referred to as knowledge. This activity typically is based on a single database, or at least a single “virtual database” in so far as the analysis is concerned. 3 All fields of a database are relevant here, as the purpose of pattern discovery is to determine which of the fields are relevant and which are not for the detection of the phenomena being modeled; however, it does not necessarily require records referring to all elements of the population, just a large enough sample with enough examples of the phenomena of interest. This activity of data mining may more generally be termed “data analysis” – it could use, for example, statistical or other techniques. The result of this mining of data is a set of patterns that have predictive value. This is the activity that conforms to the widely accepted definition of data mining in the technical community. 2.3 Security Applications of Data Mining The third activity in which data miners engage is the actual prediction itself. Predictions are made using the patterns discovered by the second activity on new data elements that had not been used in the pattern discovery. This activity would typically be widely applied to all members of the population of interest, but would require only those fields or attributes that have been determined to be relevant during the actual data mining analysis. Each record in the database is matched against the pattern and an inference or prediction is made. These inferences or predictions may then be used, typically in conjunction with additional information that has been collected based on knowing the identity of the individual whose particular records matched the pattern, to make a further determination of interest and take appropriate actions, outside the scope of but resulting from the data mining application. An example of such inference might be the assignment of a credit score to a new applicant, based on a match to patterns that were discovered during the second activity and were determined to be useful in predicting credit risk. With these four activities in mind, we see that what is typically referred to as a security application of data mining may combine aspects of several activities but usually emphasizes some combination of the third and fourth – the detection of entities in the database that match particular patterns of interest and their use in an end-to-end application process for evaluating risk, initiating and/or conducting investigations, and taking appropriate actions. Such an application could involve no automated pattern matching at all, or it could be totally dependent on such automated pattern matching. Many real applications, such as the one described in [14], combine these aspects. While the use of the application for its intended purpose may not include any algorithm development or pattern discovery, its development process may benefit from these activities. The application may also include such activities to enable continuous updating of the patterns to account for changes in behavior. It may also occasionally take advantage of the first activity, if improved algorithms can result in the discovery of more effective patterns. The fourth activity type is the actual end-to-end security application. This activity is not performed by scientists or engineers, but by some organization with an operational mission. The organization may be responsible for screening applicants for some purpose, or in a non-security context it might be responsible for marketing a particular product or service. Such an organization does not care about the source of the knowledge used in its systems; it cares simply about their effectiveness. This knowledge may be the result of patterns discovered by data mining, or it may arise from other sources, such as a deep understanding of the domain, a formal set of regulations, etc. Often, these domain-specific applications do not incorporate any data mining algorithms or any patterns that were discovered using such algorithms; why they do not is an issue both for application developers, who could potentially build more effective systems by taking advantage of data mining techniques and results, and for data mining researchers, who could potentially provide more useful technology for real applications. Key issues in the design of such applications are (1) what is the purpose of the application?, (2) what data sources are available, appropriate, and useful for the intended purpose?, (3) what techniques will best accomplish these purposes with the available data and patterns?, (4) what additional justifications are required to acquire additional data?, (5) what records are kept after an analysis is performed?, and (6) what follow-on actions are allowed as a result of the application? The purpose of the application is determined by some need, having to do with an organization’s mission and independent of the consideration of any use of data mining. The application may use data from which useful patterns have been mined, or it may not, depending on whether such sources and patterns exist, whether it is appropriate to use such sources for the intended purpose, and whether other sources are more useful for the intended purpose. In the context of security applications, data that have been collected for security purposes (e.g., existing law enforcement and intelligence databases) are often useful and appropriate sources; other data sources such as commercial transactions are often neither useful nor appropriate. The selection bias that results in inclusion in such a security database in the first place may, in fact, be viewed as a prima facie indicator of a high-risk entity; this is why techniques that start from known risky individuals and “connect the dots” are often most effective for security applications. Other pre-screening techniques, such as observations of suspicious The next section of this paper explores end-to-end security applications in more depth. 3 Note that for purposes of simplicity, we ignore the field of distributed data mining. It is recognized that distributed data mining techniques can be used to discover local patterns that can later be merged; the relevant question in this field is not how databases can be split vertically for pattern discovery, but rather how they can be split horizontally and the patterns combined. Before continuing the discussion, however, two other points are worth mentioning. First, while this discussion of data mining activities is depicted in terms of propositional data, the basic ideas apply to relational data as well. Second, we note that these four distinct activities are often conflated not only by policymakers, end users, and other stakeholders, but also by data miners themselves. In particular, data mining researchers tend to view every activity other than data mining research as “applications,” while those responsible for end-user applications tend to group activities one and two as “research.” 4 4 In fact, this confusion often manifests itself not only in policy debates, but also in acceptance criteria for data mining conferences and journals. behavior or setting off alarms for carrying potentially dangerous material provide a similar selection bias that may be at least as effective as more abstract pattern matching for purposes of a particular security application. Techniques that are useful for security applications may include pattern matching, link analysis, anomaly detection, and others; often, some combination of these techniques is most effective. Often the security application includes additional data collection about entities for whom more information is justified based on initial indicia of risk or suspicion; this additional data collection may involve additional entities who are somehow “connected” to known entities or it may involve collection of additional types of information through a subject-based query on a known entity to enable an accurate determination of that entity’s status. (Such additional collection based on subject-based queries is depicted by the bi-directional arrows in activity 4 in figure 1.) The application may store or discard various intermediate results about a particular entity; these data retention issues are crucial because of the potential long-term effect on an individual. If an individual is determined to be lowrisk only after an extensive analysis of additional data, pertinent data about that individual could be discarded at a cost of having to repeat the analysis in the future; however, if such data are retained, then it would be essential to prevent the use of such additional data for any purpose other than avoiding a more detailed analysis of such an individual in the future. Finally, the application exists in the context of some business process and some set of authorizations and authorities, both of which determine what follow-on actions may result from use of the application. It is typically this last issue that is of most societal concern, for this is when consequences can occur. It is important to note that these consequences are not the direct result of the use of data mining to discover patterns; rather, they are the result of policies and procedures that are adopted by a user organization with regard to the results of a security application. A key feature of all security applications is that they are multistage processes. Each stage passes along the riskiest entities to the subsequent stage for more detailed analysis and discards the lowrisk entities. The more detailed analysis may incorporate additional data sources – data sources that are more expensive to obtain or data sources the use of which is restricted until some additional justification exists. The more detailed analysis may, and often does, result in a conclusion that the entity under consideration is, in fact, not as risky as determined by the previous stage and, therefore, cause the entity to be removed from the risky category. A crucial issue in the design of any security application is the source and role of patterns. Pattern detection may be an effective technique, especially when applied to existing law enforcement and intelligence data and used to detect low-level activities and combine such low-level activities into higher-level plans or organizations. It may be less useful when applied to screening of individuals. An important point is that the utility of any pattern must be established and verified empirically before such a pattern is used as a component of a security application. Patterns may come from data mining, but also from other sources. For example, patterns used in [13] resulted from an analysis of market regulations and hypotheses about possible schemes to engage in improper market behavior, and these patterns were deployed only after a rigorous and iterative process of modification and validation. Patterns may also come from external sources; for example, knowledge of a newly confirmed or suspected adversarial technique could result in the development and use of a pattern for its detection. Patterns may also arise from anomaly detection techniques; in this case, normal patterns of activity are removed from the data and what remains is considered unusual and potentially suspicious. We consider two examples of security applications to illustrate these ideas. First, imagine an application for screening passengers at airports. In such an application, there is no a priori reason to suspect that people who choose to fly on airplanes are more likely than the general population to be dangerous. Rather, the concern is with the possibility of any particular, dangerous person being aboard an aircraft. In such an application, initial screening might be, and often is, based on a physical inspection, close behavioral observations, detailed questioning (in the case of El Al Israel Airlines), or some combination, rather than on pattern matching. Additional data might be considered for people who somehow appear suspicious on one of these tests. This situation contrasts with an application that might be used to determine where to focus investigatory resources based on people who appear in lawfully collected intelligence databases. In such an intelligence application, the initial indicia of risk come directly from the fact that a person is included in the database itself; hence, patternbased analysis is likely to be a useful tool. Finally, it must be noted that a security application will almost always and should always include strict audit functions, controls on use, and review mechanisms to ensure that the application is being used solely for its intended purpose and is not being abused in any way. In fact, data mining techniques independently applied to the audit logs are themselves one method to detect, deter, and guard against abuses of security applications themselves. 3. CRITICISMS OF SECURITY APPLICATIONS OF DATA MINING Security applications of data mining that have received the most criticism include Total Information Awareness (TIA), ComputerAssisted Passenger Prescreening System (CAPPS II), Multistate Anti-Terrorism Information Exchange (MATRIX). [12] These systems/projects were all cancelled after expenditures of millions of U.S. dollars., because of concerns both about privacy and civil liberties and about their effectiveness. [16] Secure Flight, a follow-on to CAPPS-II, and the Department of Homeland Security’s Analysis, Dissemination Visualization Insight and Semantic Enhancement (ADVISE) System were also cancelled due to security vulnerabilities and privacy concerns, respectively. [16] Even research programs that incorporated a full measure of privacy protection and had the sole purpose of determining whether a particular algorithm, technique or approach could develop patterns that indicate terrorist activity were reported, although they did not meet the requirements of the Data Mining Reporting Act [7], and were later cancelled according to [8]. These research programs were an example of the first activity depicted in figure 1. Criticisms of the effectiveness of data mining security applications appear in [1], [5], [9], [10], and [11]. Analyses of some of these criticisms are contained in [4], [6], [14], and [15]. This section of this paper summarizes the criticisms and analyzes how they may be addressed in security applications of data mining, using the model presented in section 2 as the basis for distinguishing separate, and different activities. 3.1 Too Many False Positives The simplest criticism of security applications of data mining is frequently expressed as “too many false positives.” In particular, while it is accurately noted that for events that occur far less frequently than the accuracy of a classifier, most instances of positive results will be false positives. This criticism is addressed in detail in [14]; a multi-stage classification architecture preceded by a high-risk population selection and followed by link analysis is shown to be one method of mitigating this problem of too many false positives. The example of a 99.9 percent accurate classifier applied to a population of 300 million entities containing only 3,000 true positives, i.e., 0.001 percent, would yield over 100 times more false positives by itself. However, with multi-stage classification techniques consisting of two independent stages at 99 percent and 99.9 percent accuracy and assuming 5 percent of the population in a high-risk group that was 10 times more likely to be positive, almost all groups of reasonable size would be detected. The “false-positive” criticism is also addressed in [4] in the context of relational data, ranking classifiers, and multi-pass inference. The flaw in the false-positive criticism is that it assumes a singlestage classifier. As should be clear from the discussion in section 2.3 of this paper, no serious security application would be this simplistic, if only because any credible application designer would be aware that such an approach would not work. An initial classifier might be used in some applications as a first level screener to manage a large workload; such a classifier would have to be tuned to minimize false negatives. The application would rely on subsequent stages to rule out false positives. These subsequent stages would likely employ a combination of techniques. 3.2 Nobody Does That Anymore This criticism suggests that matching known patterns is not useful. The flaw in this criticism is that even past threats are still dangerous if they can be executed again. It is important to prevent instances of known attack patterns, or they can be reused by the attackers. There is no a priori reason to assume that a known attack pattern will not be reused; in fact, if something is successful, human nature suggests trying it again. Active detection of indicators of past patterns – and publicizing the ability to do so, although not the details of how it is done – will not only detect such patterns, but also deter individuals from trying them again. This is why we still have to take off our shoes at airports even though there have not been any publicized accounts of attempted use of shoe bombs in quite a while; if we did not have to have our shoes inspected, then shoe bombing might return as it is a proven and low-cost attack method. Further, there may be many potential adversaries who are capable of executing only a single type of attack; preventing them from using that method removes them from the potential population of adversaries. And finally, detecting known attack patterns prevents potential increases in the population capable of using and motivated to use that attack type by avoiding the possibility of copycat attacks. In the context of section 2.3, this criticism relates to the choice of patterns to use in the security application. These patterns are easy to specify precisely because they are known, and therefore, pattern detection is a useful technique for this security measure. 3.3 It Won’t Be Perfect Some systems are criticized because they will not be perfect – there is no way at an acceptable cost to prevent all potential attacks. This criticism is often explained in terms of the cost of a false negative – if even one terrorist attack occurs because it is not detected, the cost to society would be astronomical. (This situation is frequently contrasted with the cost of a false negative in a marketing or fraud detection application, in which case, the right thing to do is minimize the cost across a large number of cases, in contrast to security applications where the goal is to prevent all false negatives.) What this criticism ignores is the fact that no system is or can ever be perfect; rather, the goal is to maximize effectiveness at a fixed or minimal cost (in terms of effort to develop and use the system, in terms of disruptions to normal functions, and in terms of the impact on privacy and civil liberties). Comparing alternative resource allocations to maximize effectiveness is the subject of [6]. The right question to ask is not “Is this system perfect?” but rather “How does this system increase our overall security in the context of all our other systems?” An effective security application will be part of a layered defense that uses a multitude of techniques with uncorrelated errors; such a design will be most effective at providing maximum security for a fixed resource allocation. 3.4 It Will Just Make Them Try Something Else Many bad guys are intelligent adversaries. They can be very creative in creating attacks of different types. This criticism typically suggests that there is no point in preventing one type of attack because another equally costly attack can easily be devised and substituted. However, this criticism ignores the fact that not every bad guy is capable of creating a new attack method. Preventing known attacks forces adversaries to spend time developing new attack methods, acquiring new capabilities and resources, and training new attackers. This prevention of known attack types, therefore, has a real cost to adversaries. And not doing so would have a huge cost in morale to those being attacked repeatedly by the same methods with no effective response. One technique for increasing security in the face of potential new attack types includes red-teaming potential new attack types and incorporating such patterns in the security application. A second, related technique is to use patterns corresponding to variants of known attack types, based on the assumption that variants of previous attacks are likely to be tried by an intelligent adversary because they involve minimal change. A third technique to detect new attack types is to decompose the known attacks into required constituent activities, and then create new patterns based on novel recombinations of these lower-level activities. 3.5 It Will Make Them Try Something More Complicated and Serious This criticism suggests that prevention of low-consequence attacks will result in more devastating attacks as adversaries creatively invent new methods. This is a variant of the previously discussed criticism – not only will prevention of some attack types cause other attack types to be used, but the new attack types will be more serious than those that have been prevented. What this criticism ignores is that more serious attacks are typically more complicated; they require far more planning, capabilities, training, and resources that less serious simpler attacks. This additional complexity typically involves a longer time to plan and prepare for the attack, the involvement of more people in the plan, and, perhaps most important, more interactions with nonconspirators. All of these factors make it easier to detect the more complicated and serious attack before it is executed – only one starting point is needed and many more are available. Not only is there a cost of something new, but there is an additional cost of something more complicated. 3.6 It Will Make Them Try Something New That You Haven’t Thought Of A further criticism is that effective detection of known attack patterns ignores detection of new attack patterns that have not yet been conceptualized by those responsible for security applications. This criticism is countered by several observations: (1) that even novel attack patterns involve low level activities that arouse suspicion (think of the flight training prior to 9/11), (2) that starting from known subjects can lead to other bad guys (this is the essence of link analysis), and (3) that novel attacks are difficult and expensive to devise. It is this last observation that is key – by detecting previously used attack patterns, those responsible for security are forcing the bad guys to adapt constantly. Every attack they try is new, and is being tried for the first time. This greatly increases the probability that an attack will not be successful – who gets everything right on the first try? Forcing their adversaries to invent more complicated and novel attacks makes their tasks as difficult as possible. It also forces adversaries to test components of a new attack, which converts these component activities from novel actions to repeated ones and makes them amenable to detection techniques that rely on the use of automated pattern discovery to detect repeated sequences of related activities. 3.7 There’s Not Enough Training Data Often, data mining applications for security are compared to applications in credit card fraud detection. The discussion typically has an advocate for data mining applications who cites the high effectiveness in real time of scoring credit card transactions and a critic who points out that that there are a multitude of examples from which a system can learn the indicators or patterns of fraud in credit cards compared to few examples of terrorist attacks. 5 Jonas and Harper [5] make this argument quite effectively, pointing out that there are “a relatively small number of attempts every year and only one or two major terrorist incidents every few years – each one distinct in terms of planning and execution – that there are no meaningful patterns 5 This is similar to a scene from “Fiddler on the Roof.” In the scene, Tevye hears an argument between his neighbors Perchik and Mordcha, and after hearing each of their positions, says “you are right” and “you are also right.” Another character, Avram, says, “He’s right and he’s right? They can’t both be right.” Tevye replies, “You know, you are also right.” As in this scene, who is really right in this situation? that show what behavior indicates planning or preparation for terrorism.” There are certainly and fortunately a small number of examples of successful terrorist attacks and known disrupted attacks and, presumably, a larger but not extremely large number of unknown disrupted attacks, but nowhere near the amount needed for statistically valid pattern discovery. However, as in many types of fraud detection applications, the components of such attacks are similar. They all involve financing, acquisition of material, recruitment of participants, communication between the participants, etc. While these activities occur frequently and predominantly for legitimate reasons, when combined in particular contexts, they can potentially provide enough cause for further information collection and analysis, enabling the type of link analysis that Jonas and Harper advocate. Improvements in data mining algorithms that would enable the learning of usefully discriminative patterns from minimal training data is a challenge for the research community; while pattern-based data mining may be inadequate at present and even for the foreseeable future, new techniques may prove to be useful at some point in the future. Before they would be deployed or even considered for inclusion in a security application, such techniques would have to be subject to a rigorous cost-benefit analysis, including considerations of data use and privacy implications. In all likelihood, such techniques would be useful only in combination with link-analysis techniques, referred to as “subject-based data analysis” and contrasted with “pattern-based data analysis” by Jonas and Harper. As the first step in a mass screening system, such predictive data mining is unlikely to be useful for the reasons pointed out by Jonas and Harper. However, despite their use of the term “predictive data mining” to describe what would be ineffective, they really are arguing against only a particular design choice rather than against the entire set of data mining techniques described in section 2 of this paper. 3.8 They Can Reverse Engineer the System The Carnival Booth algorithm has been proposed as a way that bad guys can reverse engineer a security system. [1] This is a serious criticism, and it deserves a serious and thorough analysis. The Carnival Booth algorithm is developed and analyzed in the context of the CAPPS system. The conclusion is that selecting individuals for increased scrutiny can actually decrease security because the individuals can probe the selection algorithm to determine who is likely to be selected and who is not. The analysis assumes a fixed percentage of people who can be subject to secondary screening at airports due to a fixed amount of screening resources and the need to keep passengers flowing through the system at a reasonable rate. Using the Carnival Booth algorithm, a terrorist group can determine who is not likely to be selected for increased scrutiny and then use that person to execute an attack. It suggests that this algorithm would be most effective when there is a diverse population of potential attackers and presents anecdotal information that this is indeed the case. Essentially, the Carnival Booth algorithm works if a terrorist can determine that his chance of being selected for secondary screening is less than average; for example (using the numbers from [1]) if 8 percent of passengers are subject to secondary screening and 2 percent are selected randomly, then a terrorist has to reduce his chance of being selected for enhanced screening to less than 6 percent. While an individual potential attacker can not change his chance of being selected under this model, a terrorist group leader could use a population of potential attackers and select those who do not get selected on a large number of probing flights. A potential attacker is not reducing his actual chance of being selected; rather, he is decreasing the uncertainty in his estimate of his chance of being selected by repeatedly probing the system. Once some potential attacker is determined to have a lower than average chance of being selected for increased scrutiny, he is given the mission to execute an attack. What are the flaws in this strategy? For one, it requires multiple recruits rather than a single recruit for each position on the attack team. While there may be one recruit whose profile causes him to be less likely than average to be selected for increased scrutiny, it is unlikely that, on average, the recruits will be less likely than average to meet the selection criteria. In fact, one can make an argument that people who are subject to terrorist recruitment are actually, on the average, more likely to be subject to increased scrutiny, especially if those designing the selection criteria have insights into what makes someone susceptible to terrorist recruitment. The Carnival Booth algorithm also assumes that even if a recruit is selected for increased scrutiny, he will not be arrested when he is not actually on an attack mission, because he will not be in possession of any suspicious materiel. This assumption ignores the fact that the recruit knows he is on a probing mission for the terrorist group and may behave in a way that arouses increased suspicion. Even if he is allowed to fly, his behavior may result in his being the subject of additional information collection – i.e., the starting point for a link analysis. The Carnival Booth algorithm, therefore, shares some characteristics with the classic gambler’s strategy of doubling every losing bet – without an infinite amount of resources, it will eventually fail. The fact that there must be a reasonably large number of recruits for the Carnival Booth algorithm to result in one recruit with a lower than average selection probability creates additional risk of mistakes or exposure to the terrorist group. And the probing activity of the recruits could result in adaptation of the profiles used for selection, which would defeat the supposed advantage of the Carnival Booth algorithm, especially if this adaptation occurred as frequently as the probes. 3.9 You Can’t Catch the Lone Wolf This criticism is the most serious of all that have been proposed. A capable individual acting alone, who devises and executes a serious new attack scheme, will likely be able to evade detection. Reducing the possibility that this scenario can occur would seem to be a critical aspect of providing increased security. Tighter controls on dangerous materials, separating components needed to create weapons, and reducing the motivations of people to engage in terrorist activities would seem to be the most effective strategies for this difficult problem. In a sense, this criticism says that a security application won’t be able to detect someone who manages to avoid all of its data sources and analytical techniques. Such an interpretation is obviously true but not particularly insightful. 4. METRICS Any paper with the subject of the efficacy of data mining applications for any purpose whatsoever is incomplete without at least a brief discussion of metrics. This paper is no different. We briefly present a number of metrics that are either implicit or explicit in the evaluation of data mining applications for security. Ultimately, it is not analyses of the type discussed herein, but rather rigorous metrics-based experiments that will establish the efficacy of alternative designs and techniques for security applications. Because of the high cost of a terrorist attack, the typical metric for a security application is the number (or probability) of a false negative; i.e., failure to prevent an attack. This is typically traded off against the probability of a false positive. Because of the vastly unequal costs of false negatives (extremely high) compared to false positives and the vastly different numbers of true positives (extremely low) compared to true negatives in the population, the likelihood of misclassifications must be weighted by the costs and frequencies to determine the overall costs of a security application. The benefits of the security application are expressed in terms of threats averted. Because the benefits of a security application depend on the assumed distribution of threats in a population, it is desirable to have a metric that illustrates its effectiveness independent of this distribution. A metric with this property is discussed in [13]. Other metrics that may be used to evaluate alternative system designs include the distribution of costs (e.g., is it better or worse to inconvenience one person a lot or many people a little?), minimizing the worst-case outcome (i..e, maximizing the likelihood of preventing the most serious threats even if this means increasing the chances that more instances of less serious threats will occur). Another class of metrics relates to the various criticisms. How quickly can new patterns be discovered, validated, and deployed? What is the value in preventing previous attack patterns compared to detecting new ones? How can security forces cause maximum disruption to attackers while minimizing costs to those whom they are protecting? 5. CONCLUSIONS What can we conclude? Is data mining useful for security or not? What aspects of data mining are likely to be useful and what aspects are likely to be ineffective? What criticisms are valid because of the requirements of security applications, and what criticisms really just point out ineffective designs? Where might additional research yield useful new techniques, and where is it unlikely to do so? In some areas, the jury is still out. Data mining algorithms have not yet resulted in the ability to discover patterns that can predict terrorism or other security threats that manifest themselves rarely and as a complex set of related events. They have been effective at discovering patterns that can detect common events that occur more frequently, such as cellular telephone or credit card fraud. A challenge for the research community is to design algorithms that can extend the range of feasible applications. Even as this range is extended, it is extremely unlikely that completely automated pattern discovery will be useful by itself for the detection of terrorist events. However, automated pattern discovery tools may be able to aid in the discovery of patterns of activity that are components of such threats and that can be incorporated into security applications. These security applications would have to include other techniques as well in order to be useful for their specific purposes. So while data mining will not be an entire solution, it can be a useful component of such a solution. The hardest threat to detect is the threat of a capable, intelligent adaptive adversary acting alone. Therefore, the most effective strategy is one that makes this threat increasingly unlikely. The other threats, of less capable adversaries, non-adaptive adversaries, and less-intelligent adversaries, can be effectively countered by appropriately designed and deployed data mining applications as a key part of a multi-layered prevention and detection system. Data mining can be one technique for pattern discovery, but it is only a part of the design and deployment of an effective security application. And other techniques such as starting from known subjects and performing link analyses as well as detection of dangerous materials and discouraging terrorist recruitment are at least as important. While patterns may be useful to guide a search, following connections from known risky subjects matters more. Finally, it is important once again to note that effective security applications are complex systems that must have a clearly defined purpose, clearly specified authorities and authorizations, appropriate, available and useful data, and clear and manageable business procedures in addition to effective technologies if they are to succeed. And they must respect all aspects of privacy, civil liberties, and other considerations regarding the use and retention of data for specific purposes. Even with all these constraints, it is possible to design security applications that can be useful and to continue research into how to do so. 6. ACKNOWLEDGMENTS I thank the many colleagues with whom I have had the opportunity to discuss and refine many of the ideas in this paper over the years. In particular, I thank Henry Goldberg for helping to develop many of the ideas discussed in the paper and David Jensen for helping to develop the model used in figure 2 as well as for much useful discussion and feedback. Responsibility for the ideas in the paper is, of course, solely that of the author. 7. REFERENCES [1] Chakrabarti, S. and Strauss, A. "Carnival Booth: An Algorithm for Defeating the Computer-Aided Passenger Screening System," First Monday Vol 7., No. 10, 7 October 2002. http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/art icle/view/992/913 [2] Executive Committee on ACM Special Interest Group on Knowledge Discovery and Data Mining. “Data Mining” Is NOT Against Civil Liberties. June 30, 2003 (revised July 28, 2003). http://www.sigkdd.org/civil-liberties.pdf [3] Fayyad, U.M., Piatesky-Shapiro, G., and Smyth, P. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data Mining, eds. U. Fayyad, G. Piatesky-Shapiro, P. Smyth and R. Uthurusamy, 1-30. Menlo Park, CA: AAAI Press 1996. [4] Jensen, D., Rattigan, M., and Blau, H. Information Awareness: A Prospective Technical Assessment. In Proceedings of the Ninth ACM SIGKDD International Conference o Knowledge Discovery and Data Mining (KDD2003). (Washington, DC, USA, August 24-27, 2003). ACM Press, New York, NY, 2003, 378-387. [5] Jonas, J. and Harper, J. Effective Counterterrorism and the Limited Role of Predictive Data Mining, Policy Analysis No. 584. Cato Institute (December 11, 2006). http://www.cato.org/pubs/pas/pa584.pdf [6] McLay, L.A., Jacobson, S.H, and Kobza, J.E. Making Skies Safer: Applying Operations Research to Aviation Passenger Prescreening Systems. OR/MS Today. October 2005. [7] Office of the Director of National Intelligence. Data Mining Report. 15 February 2008. http://www.fbiic.gov/public/2008/feb/ODNI_Data_Mining_ Report.pdf [8] Office of the Director of National Intelligence. Data Mining Report. January 31, 2009. http://www.dni.gov/electronic_reading_room/ODNI_Data_ Mining_Report_09.pdf [9] Paulos, J., Do the Math: Rooting Out Terrorists is Tricky Business. Los Angeles Times, January 23, 2003. [10] Schneier, B. and Hawley, K. Interview with Kip Hawley (July 30, 2007). http://www.schneier.com/interviewhawley.html [11] Scientific American (editorial). Total Information Overload. Scientific American, March 2003, 12. [12] Seifert, J.W., Data Mining and Homeland Security: An Overview. Congressional Research Service (Order Code RL31798) Updated January 27, 2006. http://www.au.af.mil/au/awc/awcgate/crs/rl31798.pdf [13] Senator, T.E. Ongoing Management and Application of Discovered Knowledge in a Large Regulatory Organization: A Case Study of the Use and Impact of NASD Regulation’s Advanced Detection System (ADS). In KDD-2000: 44-53. [14] Senator, T.E. Multi-Stage Classification. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM ’05). (Houston, TX, November 27-30, 2005). [15] Taipale, K. Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data, Columbia Science and Technology Law Review. Vol. 5. No. 2 (Dec 2003). Available at SSRN: http://ssrn.com/abstract=546782 [16] Vijayan, J. House Committee Chair Wants Info on Cancelled DHS Data-Mining Programs: Millions Have Been Spent on Work That Was Eventually Abandoned. ComputerWorld, September 18, 2007. http://www.computerworld.com/action/article.do?command= viewArticleBasic&articleId=9037319 A Study of Online Service and Information Exposure of Public Companies S. H. Kwok Anthony C.T. Lai Jason C.K. Yeung Department of ISOM School of Business Administration HKUST [email protected] Department of ISOM School of Business Administration HKUST Handshake Networking Hong Kong [email protected] [email protected] ABSTRACT It is believed that public companies should have put lots of efforts and resources in designing and implementing effective security policy in their daily information processing and management against potential cyber attacks. A company web server accessible by the general public and attackers is usually a common entry point for cyber attacks. This paper studies and reports the security problems in web servers of public companies. We applied several commonly used tools and systems to collect information from publicly accessible web servers of selected public companies, and studied some known security aspects in those public companies. Our findings will provide an insight to the effectiveness of web servers in public companies against cyber attacks. This paper also proposes a risk analysis tool for cyber attacks, which is known as pyramid risk analysis tool. and problems in web servers of public companies. Findings in our paper basically address the security problems of (1) Common and Necessary Open Ports, (2) Allowed Open DNS, (3) Allowed Open Mail Relay, (4) Web Server Banner and Version Exposure, (5) SPAM Mail Black List, (6) SPF Support, (7) Sensitive Administrative Console and Server Information Exposure. (8) Opened Vulnerable Network Service Ports, and (9) Online Internal-use-only Services. Descriptions of the above security problems are presented in Table 1. Surveyed Areas Descriptions and Problems (1) Common and Necessary Open Ports (3) Allowed Open Mail Relay Disclose internal server information and layout. In the current research, our survey only included TCP port scanning. More information can be obtained in http://en.wikipedia.org/wiki/Domain_name _system. Allow external party to query internal mapping between IP addresses and hosts of servers. Allow sending non-legitimate emails from unauthorized parties. (4) Web Server Banner and Version Exposure The information could be useful for malicious hackers and attackers to carry out further vulnerability exploitation. Cyber security, public company, ports, web server, public server, malicious hackers. (5) SPAM Mail Black List 1. INTRODUCTION (6) SPF Support (Sender Policy Framework) Disrupt business operations if mails are blocked once its domains are put in SPAM mail black lists. Restrict other people sending emails with one's domain with using SPF. In other words, it prevents sender’s address forgery. More information can be obtained in http://www.openspf.org/Introduction. Categories and Subject Descriptors K.6.5 [Security and Protection]: Unauthorized access (e.g., hacking, phreaking). General Terms Cyber security, public company, ports, web server, public server. Keywords It comes as hackers, criminals and spies have increased attacks on information systems and databases in public companies that contain sensitive information. Targets in public companies include financial systems, operations systems, management systems, and so on. An easy and common target is public web server of public companies because most of the public companies enable people and users to access their products and services via the Internet. This paper addresses a research question of the effectiveness of the cyber security protections in public companies. An extended investigation will include an exploration of network diagram of the public company that includes its sub-networks, domain addresses, technology and systems being used in the companies. This requires access to internal network and confidential information. This paper is primarily focused on the security issues Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CSI-KDD'09, June 28, 2009, Paris, France. Copyright 2009 ACM 978-1-60558-669-4...$5.00. (2) Allowed Open DNS (7) Sensitive Administrative Console and Server Information exposure Enable malicious hackers and attackers to access the internal applications. (8) Opened Vulnerable Network Service Ports Apart from necessary open ports for web services, there are other common top vulnerable ports which are easy targeted by malicious hackers and attackers. (9) Online Internal-use-only Services Other online portals/entries accessible by the public may increase threats from hackers and attackers. Table 1: Security Check Areas. In this paper, we randomly selected ten public companies from Hong Kong Hang Seng Index (HSI) and ten public companies from Hong Kong China Enterprises Index (CEI). We applied several commonly used approaches for data collections. The techniques include DNSstuff (http://www.dnsstuff.com), Google (http://johnny.ihackstuff.com/ghdb/), Maltego (http://www.paterva.com), and Nmap (http://nmap.org/). They are commonly used in penetration test in IS auditing disciplines and practices. With those techniques, we also investigated several common attacks, including (1) web site defacement, (2) phishing attacks, (3) unauthorized access to internal system information and database, and (4) denial of service and vulnerability exploitations if the systems are not patched immediately (these vulnerabilities could result in negative impacts to the business operations once they are exploited and manipulated). 2. Web Sites and Data Collection Tools 2.1 Selected Web Sites We have selected ten HSI and ten CEI public listed companies, Those HSI companies comprise of industries in finance, banking, property, and utility. Meanwhile, those CEI companies are composed of industries in transportation facility, oil, railway, manufacturing, telecommunications, and banking and finance [1]. Our surveyed information could be obtained from Google and various network service provider companies, in addition, protect the surveyed companies, the figures and comment will not contribute to identify any company. Restrict search engine with robots.txt Directory traversal Table 2: Selected Google checking criteria for our study. 3. Results and Data Analysis 3.1 Risk Rating Definitions Based on industry experiences of our authors, we have established a risk-rating table shown in Table 3 so as to measure the risk exposure level of various services. (Similar research works may be referred to the annual Data Breach Report by Verizon Business (http://www.verizonbusiness.com/products/security/risk/databreac h). Risk Rating Legend The risk item will be marked in the following color to represent its risk rating Risk Rating Description Critical The vulnerability could be used to compromise the application or infrastructure resulting in a severe business impact. High The data collection tools used in this paper include DNSstuff, Google [2], Maltego and Nmap. Common Online Services Possible Impacts Remote Desktop (Microsoft) Allow malicious hackers to take over the machine by remotely connecting to the server or workstation. OWA (Microsoft Outlook Web Access) Allow malicious hackers to get the email access Citrix Metaframe Login Allow malicious hackers to access internal networks Lotus Notes Allow malicious hackers to get the email access System/source code/database data back up files, server information and configuration files Various administrative consoles Expose system, internal and business information Allow malicious hackers to gather information about folders that may contain sensitive information through the robots.txt file Allow malicious hackers to gather sensitive information about folders and files of a company A medium to high level of technical knowledge is required for an attacker to gain unauthorized access to a system/data from a single vulnerability. The ability for an attacker/user to harm the professional image of a corporation. Medium A vulnerability that, by its self will not allow unauthorized access to systems/Data. However, two or more Medium rated vulnerabilities used in conjunction may allow an attacker unauthorized access to systems or data. Low Information disclosure. The information gleaned from these vulnerabilities will not allow an attacker to gain direct access to systems or data. It may, however, be used to escalate a separate vulnerability. Observation and Unknown Software/System that is considered vulnerable, but the team was unable to complete testing. Table 3: Risk Rating Definitions. 3.2 Data Analysis Ten HSI Companies Allow malicious hackers to access internal applications We have summarized results of surveyed areas and risk analysis on HSI companies in Table 4. Surveyed Areas Analysis on HSI Analysis on CEI (1) Common and Necessary Open Ports For most of the web servers, only necessary ports (TCP port 80 and 443) are open, which are considerably good practice. Most of the web servers, only enable necessary ports (TCP port 80 and 443), which are considerably good practice. (2) Allowed Open DNS Recursive lookups are enabled on most of its DNS servers. This may cause an excessive load on the DNS server but does not consider as a serious risk. Recursive lookups are enabled on most of its DNS servers. This may cause an excessive load on the DNS server but does not consider as a serious risk. (3) Allowed Open Mail Relay There is no open mail relay found from the surveyed companies which are good practice to against SPAM mails and non-legitimate mail delivery. There is no open mail relay found from the surveyed companies which are good practice to against SPAM mails and nonlegitimate mail delivery. (4) Web Server Banner and Version Exposure Most of the web servers, especially Microsoft IIS, are configured with default IIS banner. If the service banners contain useful information such as the software name and its version number then hackers can better target their exploits. Scriptkiddies often scan whole IP blocks for a known vulnerability, and only attack those who give back a banner showing that they run the vulnerable service. Most of the web servers, especially Microsoft IIS, are configured with default IIS banner. If the service banners contain useful information such as the software name and its version number then hackers can better target their exploits. Script-kiddies often scan whole IP blocks for a known vulnerability, and only attack those who give back a banner showing that they run the vulnerable service. (5) SPAM Mail Black List None of the surveyed companies puts in SPAM Mail Black Lists. Eight servers are placed in some popular SPAM blacklist. Mails sent from these servers are blocked. Service availability is damaged. (6) SPF Support Two public companies from banking industry have adopted SPF so as to prevent from any forgery email Two public companies from banking industry have adopted SPF so as to prevent from any forgery email sending. sending. (7) Sensitive Administrative Console and Server Information Exposure All scanned web servers are well protected against exposing administrative console that implied it was well hidden and probably just available in an intranet environment. Most of the scanned web servers are well protected against exposing administrative console that implied it was well hidden and probably just available in an intranet environment. (8) Opened Vulnerable Network Service Ports There is a company has shown its FTP port 21 which may not be necessary and available to public to access. For most of the web servers, many unnecessary ports (TCP port 21, 445 and 3389) are opened, which are not expected to be accessible from the Internet. Some of them are even public services ports (possibly FTP, and Windows Remote Desktop services). Once the port is exploited, the hacker or attacker may have full access to the system. (9) Online Internal-useonly Services Generally acceptable, but using insecure protocol like FTP (ftp.domain.com) and self-described naming scheme (secure.domain.com) are not recommended. An attacker might be easily identified all sensitive important components by brute-force attack with a list of keywords (such as secure.domain.com, vpn.domain.com, ssl.domain.com...etc) It is generally acceptable. But using insecure protocols like FTP (ftp.domain.com) and self-described naming scheme (secure.domain.com) is not recommended. An attacker might easily identify all sensitive important components by brute-force attack with a list of keywords (such as secure.domain.com, vpn.domain.com, ssl.domain.com...etc) However, there is a highway company exposed its content management system administrative console that could be targeted by malicious hackers to carry out account guessing. Table 4: Summary of risk analysis on the ten HSI and CEI companies. In our analysis on the ten HSI companies, the general controls over DNS and mail server are satisfactory. They will not allow open mail relay for external parties to send non-legitimate emails with its mail gateway. In addition, very few of them are put to SPAM mail black list. Two International banks in Hong Kong have actually used SPF. However, 50% of surveyed companies show that they allow open DNS query so that external parties could make enquiry on their internal IP address mapping, which could expose internal system infrastructural information. We have rated these items as Medium risk. The security problem of opened ports in those surveyed companies is satisfactory. Those surveyed companies just enabled necessary ports and did not disclose further information on their server services. However, over 80% have exposed their web server banners and versions, and 40% of them exposed their internal-use-only services including FTP, webmail and intranet site with password-protected. However, these services should not be available and checked by the public. We have rated these items as Medium risk. From the Google hacking, the results are satisfactory and no company disclosed sensitive files and enabled administrative consoles. Nevertheless, there is a local famous property company exposed its server information via the phpinfo() page, which exposes its IP address, internal system path mapping, database management system in use (whether LDAP and MySQL database are running or not) and versions of web servers. Even it does not lead to immediate risk, however, we have rated it high as it implies its control over change management process and review as well as server hardening is insufficient. This item rated for that property company is of High risk. A company even exposed its content management administrative console to the public without applying secure channel and limited to specific console to use. These services should not be searched by and available to the public. We have rated these items as high risk. In addition, 50% of web servers disclosed their version information. This is proven to be vulnerable to existent attack [3] including Cross-Site Scripting (XSS), which allows malicious attackers to execute his/her program code through those web servers. The Google hacking results are satisfactory and no company disclosed sensitive files and enabled administrative consoles. Nevertheless, there is a property company exposing its content management administrative console. We have rated this item as Medium risk. Within these ten CEI companies, the railway and oil refinement companies with government supports, in general, have stronger controls and server hardening compared with other privately owned companies is higher. 4. Pyramid Risk Analysis Based on the top vulnerable ports published in SANS Top 20 in 2007 [5], we may identify common vulnerable services and ports targeted for malicious hackers. With the use of risk ratings defined in Section 2, we can rate those vulnerable services and ports against our risk ratings, and the results are presented in Table 5. Services Category Web server services Ten CEI Companies We have summarized results of surveyed areas and risk analysis on CEI companies in Table 4. Our analysis shows that the general controls over DNS and mail server in the selected DEI companies are satisfactory. They did not open mail relay for external parties to send non-legitimate emails with its mail gateway. In addition, very few of them had put to SPAM mail black list. However, 70% of surveyed companies allowed open DNS query so that external parties could make enquiry on their internal IP addresses mapping, which could expose internal system infrastructural information. We have rated these items in risk items as Amber. 50% of surveyed CEI companies opened unnecessary ports opened and disclosed further information on server services. In the worst case, two companies had opened nearly all vulnerable ports in their web servers. Those vulnerable ports include remote desktop connection, NetBIOS, Remote Procedure Call (RPC), Database connection etc. They are considered to be easily targets by worms and determined hackers to compromise the company services. We rate this risk area of these two CEI companies as Critical Over 80% exposed their web server banners and versions, and 80% of them exposed their internal-use only services including FTP, webmail and intranet sites with password-protected control. Risk Ratings Services and their opened ports targeted by malicious hackers Low • HTTP port(80/tcp, 8000/tcp, 8080/tcp,8888/tcp) Medium • DNS (53/udp) to all machines which are not DNS servers DNS zone transfers (53/tcp) LDAP (389/tcp and 389/udp) SMTP (25/tcp) POP(109/tcp and 110/tcp), IMAP (143/tcp) Login services-telnet (23/tcp) FTP (21/tcp) NetBIOS (139/tcp) rlogin et al (512/tcp ~ 514/tcp) Portmap/rpcbind (111/tcp and 111/udp), Naming services • • Medium Mail High • • • • • Login Services High RPC and NFS • • • • NFS (2049/tcp cM 2049/udp) , lockd (4045/tcp and 4045/udp) High NetBIOS in Windows NT and XP X Windows • • High • High • • • • • • • Miscellaneous Critical • • • • • • Database • • • Critical Backup servers • • • 135 (tcp and udp), 137 (udp), 138 (udp), 139 (tcp). Windows 2000 or earlier ports plus 445(tcp and udp) 6000/tcp ~ 6255/tcp TFTP (69/udp) finger (79/tcp) NNTP (119/tcp) NTP (123/udp) LPD (515/tcp) Syslog (514/udp) SNMP (161/tcp and 161/udp, 162/tcp cM 162/udp) BGP (179/tcp) SOCKS (1080/tcp Microsoft SQL via TCP port 1433 and UDP port 1434 Oracle via TCP port 1521 IBM DB2 via ports 523 and 50000 up IBM Informix via TCP ports 9088 and 9099 Sybase via TCP 4100 or 2025 MySQL via TCP port 3306 PostgreSQL via TCP port 5432 Symantec Veritas Backup Exec TCP/10000 TCP/8099, TCP/6106, TCP/13701, TCP/13721 and TCP/13724 (A listing of ports used by Veritas backup daemons is available here.) CA BrightStor ARCServe Backup Agent TCP/6050, UDP/6051, TCP/6070, TCP/6503, TCP/41523, UDP/41524 Sun and EMC Legato Networker TCP/7937-9936 Table 5: Relationships between services and ports, and defined risk ratings. This paper proposes a pyramid risk analysis tool to illustrate the relationships as shown in Figure 1. The pyramid contains all vulnerable ports and services addressed by SANS. Risk ratings are inserted to indicate their risk levels. The apex of the pyramid refers to the most common and necessary ports to be accessible by the public, whilst the bottom of the pyramid refers to the ports and services that provide sensitive and confidential information and should not be accessible by external parties. If one company has opened services and ports, for example database and backup services, it is considered to be dangerous and the company will be classified into critical risk level. If one company has only opened web services, the company is of low risk. This pyramid risk tool and model is established according to authors' industry experience in auditing, risk assessment and penetration as well as incorporating SANS Top 20 vulnerabilities [5] into it. The idea behind the pyramid risk tool is to present various vectors and dimensions between risk levels, number of available services accessible externally and internally, and the bottom of pyramid represents the most critical and important systems/components as it could contain enterprise sensitive data/information, once it is compromised, the business operation could be highly impacted and interrupted. For example, when a company web site being attacked, it does not cost any legal and company operational impacts, however, attackers could manipulate the listed opened and vulnerable services from the top of pyramid to the bottom of it and they could eventually steal company or customer confidential information. In addition, from 2009 Data Breach Investigations Report published by Verizon Business [6], according to 90 confirmed data breach cases from Verizon Business in 2008, database server with online data has contained 75% of total number of data records among other company assets including POS system, application server, kiosk system, Web server, File server and Workstation. The percentage of breaches is 30%, it proves that it is one of major target of malicious hacker to break in.We applied the pyramid risk tool for evaluating the risk levels of those ten HSI companies and CEI companies. The results are presented in Figure 2. The results show that the risk level for those ten HSI companies is lower than those ten CEI companies. It is in particular dangerous for those two CEI companies that opened their database and backup services ports. They will attract attentions from bots and worms to carry out future attacks. On the contrary, it is regarded as safer for those ten HSI companies because their servers enabled least and necessary services to be accessible by the public. From 2009 Data Breach Investigations Report published by Verizon Business[6], remote access and management and web application are the most popular and easy attack pathway with contribute 75% of the total number of breach cases. It readily cause concerns over how an enterprise to protect its online assets. For those HIS companies, most of them are local with foreign investors, their awareness and knowledge on web server hardening and server protection are far better than those in surveyed CEI companies. Especially, the bank’s online server security could be rated the most secure one among all of other companies in other industries because monetary authority in Hong Kong has imposed strict compliance over technology risk control [4]. Figure 1: Pyramid risk tool. To better and improve the security gaps among CEI companies, relevant security training and awareness should be given to the operational and administrative staffs. In addition, government should impose compliance over online security and data privacy protection of companies similar to those in banking and finance industry. Furthermore, regular security audits and penetration tests should be engaged so as to detect potential risks and flaws earlier. We also suggest that companies should put more efforts on preparing detailed security policies and standards for daily operations and management. For more detailed recommendations, readers should refer to SANS Top 20 [5] and Data Breach Investigation Report 2009 [6] for in depth control implementation. The proposed pyramid risk analysis tool is used to rate the risk level of company. Our results indicate that the risk level of those CEI companies is higher than HSI companies because those CEI companies open high-risk ports and services to the public. Those high-risk ports and services are dangerous because hackers and attackers are good at exploiting them and may cause significant impacts to the company being attacked. The pyramid risk analysis tool can be used extensively in surveying more public and private companies in understand the risk level of companies in a particular graphical location, or a particular industry, or a particular nation. 6. ACKNOWLEDGMENTS Our thanks to security researchers from Valkyrie X Research Lab including C.K. Huen, Tony Miu, Alan Chung, William Cheung and Jason So, Jason Yeung, Eddie Lau and Leng Lee for their help in data collections. Figure 2: Risk levels of the selected ten HSI and ten CEI companies. 5. Conclusions In this paper, we collected data from twenty public companies (ten HSI companies and ten CEI companies) and analyzed their risk levels. In general, the web server and other online services of those CEI companies are rather open and vulnerable to attacks. In addition, some CEI companies do not own its server but host its services via third-party service providers. It implies that they do not have adequate controls over outsourcing vendors. Generally speaking, the server hardening is weak in those CEI companies as the banners and server version could be important information for malicious attackers to carry out further attacks. In addition, those published and exposed services are with existing exploits, which have not been upgraded to the latest version, reducing the risk of attacks. 7. REFERENCES [1] Hang Seng Indexes http://www.hsi.com.hk/HSI-Net/ [2] Johnny Long, 2008. Google Hacking for Penetration Testers, Volume 2. [3] Vulnerable apache web server versions http://httpd.apache.org/security/vulnerabilities_13.html [4] Technology Risk Management imposed by Hong Kong Monetary Authority http://www.info.gov.hk/hkma/eng/bank/spma/index.htm [5] SANS Top 20, 2008 http://www.sans.org/top20/ [6] 2009 Data Breach Investigations Report , May 2009 http://www.verizonbusiness.com/resources/security/reports/2 009_databreach_rp.pdf