Download Proceedings Template - WORD

Document related concepts
no text concepts found
Transcript
Proceedings of the ACM SIGKDD Workshop on
CyberSecurity and Intelligence Informatics
(CSI-KDD)
June, 28, 2009, Paris, France
held in conjunction with
SIGKDD’09
Workshop Organizers
Hsinchun Chen
Marc Dacier
Marie-Francine Moens
Gerhard Paass
Christopher C. Yang
The Association for Computing Machinery, Inc.
2 Penn Plaza, Suite 701
New York, NY 10121-0701
Copyright © 2009 by the Association for Computing Machinery, Inc (ACM). Permission to make
digital or hard copies of portions of this work for personal or classroom use is granted without
fee provided that the copies are not made or distributed for profit or commercial advantage and
that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is
permitted.
To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior
specific
permission and/or a fee. Request permission to republish from: Publications Dept. ACM, Inc.
Fax
+1-212-869-0481 or E-mail [email protected].
For other copying of articles that carry a code at the bottom of the first or last page, copying is
permitted provided that the per-copy fee indicated in the code is paid through the Copyright
Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
Notice to Past Authors of ACM-Published Articles
ACM intends to create a complete electronic archive of all articles and/or other material
previously
published by ACM. If you have written a work that was previously published by ACM in any
journal or conference proceedings prior to 1978, or any SIG Newsletter at any time, and you do
NOT want this work to appear in the ACM Digital Library, please inform [email protected],
stating the title of the work, the author(s), and where and when published.
ACM ISBN: 978-1-60558-669-4
Additional copies may be ordered prepaid
from:
ACM Order Department
P.O. BOX 11405
Church Street Station
New York, NY 10286-1405
Phone: 1-800-342-6626 (U.S.A. and Canada)
+1-212-626-0500 (All other countries)
Fax: +1-212-944-1318
E-mail: [email protected]
ACM Order Number:
Printed in the U.S.A.
ii
Preface
Computer supported communication and infrastructure are integral parts of modern economy.
Their security is of incredible importance to a wide variety of practical domains ranging from
Internet service providers to the banking industry and e-commerce, from corporate networks to
the intelligence community.
The CSI-KDD workshop focuses on novel knowledge discovery methods addressing
CyberSecurity and intelligence issues as well as innovative applications demonstrating the
effectiveness of data mining in solving real-world security problems. The challenge for novel
methods originates from the emergence of new types of contents and protocols, and only an
integrated view on all modes promises optimal results. Innovative applications are essential as
IT-communication as well as computer-supported technical and social infrastructure have an
extremely complex structure and require a comprehensive approach to prevent criminal activities.
As an invited speaker we welcome André Bergholz, Fraunhofer IAIS. He will report on the
lessons learnt on phishing filtering in the specific targeted research project AntiPhish funded by
the European Union. He reports on filter methodologies evaluated in a test laboratory setting, and
describes the application of this technology to real world email streams, to be used to filter all
email traffic online in real time. In the afternoon there will be an invited talk entitled “Data
Security and Integrity: Developments and Directions” given by Bhavani Thuraisingham. The talk
is concentrated ensuring that only authorized individuals have address to data and data is
protected from malicious corruption.
The workshop is organized in two tracks. In the first track "Novel Knowledge Discovery
Methods for the Security Domain" is targeted to advanced data mining approaches for
CyberSecurity. The second track "Innovative Techniques and Applications in Intelligence
Informatics" concentrates on large-scale security applications.
Despite the fact that various types of security mechanisms have been defined and are widely
deployed to prevent malicious users from launching attacks, cyber criminals remain pretty
successful in misusing useful protocols and applications for their own benefits. Spam and
phishing campaigns are routinely carried out. Drive by download attacks are among the major
threats facing normal users surfing the web. BGP hijacks corrupting internet routing tables are
becoming known to the public as well. Etc.
The first track considers data mining approaches, which can help addressing security issues
according to, at least, three distinct axes. First of all, thanks the very large amount of application
logs of various kinds available, it can be a valid approach to better understand the attacks we are
facing and to help designing better preventive and, or, detection mechanisms in order to respond
to these attacks. On the other hand, by analyzing traffic and other attack related traces data
mining can be helpful in getting a better picture of who is attacking us tackling the ”attack
iii
attribution problem” under the umbrella of e-forensics techniques. Finally data mining can be
employed to detect attacks by classifying content, e.g. by filtering spam and phishing messages.
Still a number of research issues remain. Different types of content have to be analyzed (e.g.
email, websites, embedded images, transmitted code, activity logs), and only an integrated view
on all modes promises optimal results. This, for instance, involves mining the different media
associated with an email and combining the results to improve accuracy.
Spammers, hackers and producers of fraudulent content continuously change their tactics,
requiring adaptive and even anticipatory mining techniques. As intelligent opponents aim at
thwarting analyses, specific approaches for analysis are required. In addition, many new forms of
messaging (e.g., SMS, MMS), often anchored in a mobile environment, become a victim of
malicious manipulations.
For the first track we accepted four very interesting oral presentations covering topics such as
intrusion and malware detection, attack attribution and spam filtering. The training time of
intrusion detection models is often computationally expensive, hence the interest in efficient
models while still assuring a high predictive accuracy of the intrusion detection. Chen Yo-Shu
and Chen Yi-Ming present this view in the paper "Combining Incremental Hidden Markov
Model and Adaboost Algorithm for Anomaly Intrusion Detection".
In the paper entitled “Addressing the Attack Attribution Problem using Knowledge Discovery
and Multi-criteria Fuzzy Decision-Making", O. Thonnard, W. Mees and M. Dacier propose an
analysis framework to reason about the root causes of attacks observed on the Internet. They
apply it to some large real world datasets and derive interesting insights on what they call armies
of zombies.
"A Data Mining Framework for Malware Detection Using Statistical Analysis of Byte-Level File
Content" by S. Momina Tabish, M. Zubair Shafiq and Muddassar Farooq discusses non-signature
based malware detection. The approach is successful and assumes that benign files are quite
distinct from malware files considering a byte-level format. A very novel and interesting
approach inspired by gaming theories is presented and applied on phishing mail filtering in an
adversarial setting by Gaston L’Huillier, Richard Weber and Nicolas Figueroa (Online Phishing
Classification Using Adversarial Data Mining and Signaling Games).
The second track is called "Innovative Techniques and Applications in Intelligence Informatics".
Intelligence Informatics is concerned with the study of the development and use of advanced
information technologies and systems for national, international, and societal security-related
applications. The annual IEEE International Conference series on Intelligence and Security
Informatics with over two hundred attendants was started in 2003. In addition, the Pacific Asia
Workshop on Intelligence and Security Informatics with over eighty attendees has been started in
2006. These intelligence and security informatics events have brought together academic
researchers, law enforcement and intelligence experts, information technology consultants and
iv
practitioners to discuss their research and practice related to various intelligence and security
informatics topics. Among these research topics in intelligence security, there is a strong focus on
data mining and knowledge discovery. It is the first attempt to introduce intelligence informatics
to the ACM SIGKDD community. The four major topics of intelligence security include (a)
information sharing and data/text/web mining, (b) infrastructure protection and emergency
responses, (c) terrorism informatics, and (d) enterprise risk management and information system
security.
In information sharing and data/text/Web mining, we focus on criminal data mining,
criminal/intelligence information sharing and visualization, cyber crime detection and analysis,
authorship analysis, deception detection and analysis, and information sharing governance. We
investigate how to use advanced data sharing and mining techniques to support law enforcement
and intelligent experts in their investigations so that effective results can be achieved efficiently.
In infrastructure protection and emergency responses, we explore several interesting
infrastructure problems such as bioterrorism information infrastructure, transportation and
communication infrastructure protection, cyber-infrastructure design and protection, border
safety, disaster prevention, detection and management, and emergency response and
management. As we can see in recent natural disasters and terror attacks, a good infrastructure
protection and emergency response management will minimize damages and recover from
devastation in a shorter amount of time.
In terrorism informatics, we investigate several terrorism related informatics problems. For
instances, we investigate terrorism related analytical methodologies and software tools, terrorism
knowledge portals and databases, terrorist incident chronology databases, terrorism root cause
analysis, social network analysis, forecasting and countering terrorism and measuring the
effectiveness of counter-terrorism campaigns.
In the recent years, we have also included enterprise risk management and information systems
security, in which we examine information security management standards, information systems
security policies, fraud detection, board activism and influence, corporate sentiment surveillance,
market influence analytics and medial intelligence, and consumer-generated media and social
media analytics.
The program committee has selected five papers in intelligence informatics for presentation.
Park and Treglia developed a model and theory of intelligence information sharing through a
literature review, experience and interviews with practitioners. Yang and Tang proposed a
subgraph generalization approach to share and integrate terrorist or criminal social network data
between different intelligence and law enforcement units and preserve the privacy of individuals
in social networks. Such social network sharing and integration technique improves the
performance of social network analysis such as centrality measurements. Bhavani et al.
investigated the information management component for military stabilization and reconstruction
operations. The temporal service oriented architecture system (TG-SOA), which utilized the
v
temporal geosocial semantic web to manage the lifecycle of stabilization and reconstruction
operations, was developed. Senator examined the common criticisms on data mining
applications for security. These criticisms argued that the data mining applications were
ineffective and threatening civil liberties. He analyzed these criticisms by modeling a
phenomena and proposing alternative designs. Kwok et al. studied the security problems in
public companies’ web servers against cyber attacks. The study included ten Hong Kong Hang
Send Index companies and ten Hong Kong China Enterprises Index companies. A pyramid risk
analysis tool was also proposed.
This workshop would not be possible without the invited speakers, the authors and the 37
members of the program committee. We express our gratitude towards them. We would thank
also Fraunhofer IAIS, who was in care of the CSI-KDD 2009 website as well of the submission
system.
Hsinchun Chen
Marc Dacier
Marie-Francine Moens
Gerhard Paass
Christopher C. Yang
vi
CSI-KDD 2009 Organizers and Program Committee
Organizers
Hsinchun Chen
Marc Dacier
Marie-Francine Moens
Gerhard Paass
Christopher C. Yang
The University of Arizona, Tucson, USA
Symantec Research Labs Europe, France
K.U. Leuven, Belgium
Fraunhofer IAIS, St. Augustin, Germany (contact)
Drexel University, Philadelphia, USA
Program Committee
Adedeji B. Badiru
Yigal Arens
John Aycock
Antonio Badia
Andre Bergholz
Ulf Brefeld
Patrick S. Chen
Robert W.P. Chang
Domenico Dato
Yuval Elovici
Uwe Glaesser
Nazli Goharian
Mark Goldberg
Henrik Grosskreutz
David Hicks
Thorsten Holz
Patrick Horkan
Sotiris Ioannidis
Latifur Khan
Engin Kirda
Christopher Kruegel
Sheau-Dong Lang
Ee-Peng Lim
Evangelos Markatos
Robert Moskovitch
William Pottenger
Raghav Rao
Elliot Rich
Stefan Rüping
Bracha Shapira
David Skillicorn
Randy Smith
Paul Thompson
Cedric Ulmer
Nalini Venkatasubramanian
Zhao Xu
Urko Zurutuza
Air Force Institute of Technology, Dayton, OH, USA.
USC/ISI, USA
University of Calgary, Canada
University of Louisville, USA
Fraunhofer IAIS, Germany
MPI Saarbrücken, Germany
Tatung University, Taiwan
Criminal Investigation Bureau, Taiwan
Tiscali Services, Italy
Ben-Gurion University, Israel
Simon Fraser University, Canada
Illinois Institute of Technology, USA
RPI, USA
Fraunhofer IAIS, Germany
Aalborg University Esbjerg, Denmark
University of Mannheim, Germany
Symantec, Ireland
Institute of Computer Science, Greece
University of Texas at Dallas, USA
Eurecom, France
UC Santa Barbara, USA
University of Central Florida, USA
Singapore management University, Singapore
Institute of Computer Science, Greece
Ben-Gurion University, Israel
Rutgers University, USA
State University of New York at Buffalo, USA
University at Albany, SUNY, USA
Fraunhofer IAIS, Germany
Ben-Gurion University, Israel
Queen's University, Canada
University of Alabama, USA
Dartmouth College, USA
SAP Research, France
University of California, Irvine, USA
Fraunhofer IAIS, Germany
Mondragon University, Spain
vii
Track 2
Track 1
Track 1
Track 1
Track 2
Table of Contents
Preface.......................................................................................................................................... iii
Workshop Organizers and Program Committee ......................................................................... vii
Workshop Program ...................................................................................................................... ix
Invited Talk
AntiPhish – Lessons Learnt
André Bergholz .................................................................................................................. 1
Track1: Novel Knowledge Discovery Methods for the Security Domain
Combining Incremental Hidden Markov Model and Adaboost Algorithm for Anomaly Intrusion
Detection
Chen Yo-Shu, Chen Yi-Ming ............................................................................................. 3
Addressing the Attack Attribution Problem using Knowledge Discovery and Multi-criteria Fuzzy
Decision-Making
O. Thonnard, W. Mees and M. Dacier ................................................................................ 11
A Data Mining Framework for Malware Detection Using Statistical Analysis of Byte-Level File
Content
S. Momina Tabish, M. Zubair Shafiq and Muddassar Farooq............................................ 23
Online Phishing Classification Using Adversarial Data Mining and Signaling Games
Gaston L’Huillier, Richard Weber and Nicolas Figueroa................................................... 33
Invited Talk
Data Security and Integrity: Developments and Directions
Bhavani Thuraisingham ..................................................................................................... 43
Track 2: Innovative Techniques and Applications in Intelligence Informatics
Towards Trusted Intelligence Information Sharing,
Joon Park and Joseph Treglia.............................................................................................. 45
Social Networks Integration and Privacy Preservation using Subgraph Generalization,
Christopher Yang and Xuning Tang ................................................................................... 53
Novel Knowledge Discovery Methods for the Security Domain,
Bhavani Thuraisingham, Latifur Khan and Murat Kantarcioglu ........................................ 63
On the Efficacy of Data Mining for Security Applications
Ted Senator ......................................................................................................................... 75
A Study of Online Service and Information Exposure of Public Compani
Sai Ho Kwok, Cheuk Tung Lai and Jason Yeung............................................................... 85
viii
Workshop Program
09:00 -10:00 Invited Talk
!
09:00 -10:00:
AntiPhish – Lessons Learnt
André Bergholz
10:00-10:30 Coffee Break
10:30-12:30 Track1: Novel Knowledge Discovery Methods for the Security Domain
!
10:30-11:00
!
11:00-11:30
!
11:30 – 12:00
!
12:00 – 12:30
Combining Incremental Hidden Markov Model and Adaboost Algorithm
for Anomaly Intrusion Detection,
Chen Yo-Shu and Chen Yi-Ming
Addressing the Attack Attribution Problem using Knowledge Discovery
and Multi-criteria Fuzzy Decision-Making,
O. Thonnard, W. Mees and M. Dacier
A Data Mining Framework for Malware Detection Using Statistical
Analysis of Byte-Level File Content
S. Momina Tabish, M. Zubair Shafiq and Muddassar Farooq
Online Phishing Classification Using Adversarial Data Mining and
Signaling Games
Gaston L’Huillier, Richard Weber and Nicolas Figueroa
12:30-14:00 Lunch
14:00-14:40 Invited Talk
!
14:00 -14:40:
Data Security and Integrity: Developments and Directions
Bhavani Thuraisingham
14:40-15:30 Track 2A: Innovative Techniques and Applications in Intelligence Informatics
!
14:40-15:05:
!
15:05-15:30:
Towards Trusted Intelligence Information Sharing,
Joon Park and Joseph Treglia
Social Networks Integration and Privacy Preservation using Subgraph
Generalization,
Christopher C. Yang and Xuning Tang
15:30-16:00 Coffee Break
16:00-17:00 Track 2B: Innovative Techniques and Applications in Intelligence Informatics
!
16:00-16:30:
!
16:30-17:00
!
17:00-17:30
Novel Knowledge Discovery Methods for the Security Domain,
Bhavani Thuraisingham, Latifur Khan and Murat Kantarcioglu
On the Efficacy of Data Mining for Security Applications
Ted Senator
A Study of Online Service and Information Exposure of Public Companies
Sai Ho Kwok, Cheuk Tung Lai and Jason Yeung
ix
x
AntiPhish – Lessons Learnt
André Bergholz
Fraunhofer Institute Intelligent Analysis
and Information Systems (IAIS)
Schloss Birlinghoven
St. Augustin, Germany
[email protected]
ABSTRACT
Phishing emails usually contain a message from a credible
looking source requesting a user to click a link to a website where
she/he is asked to enter a password or other confidential
information. Most phishing emails aim at withdrawing money
from financial institutions or getting access to private information.
Phishing has increased enormously over the last years and is a
serious threat to global security and economy. There are a number
of possible countermeasures to phishing. These range from
communication-oriented approaches like authentication protocols
over blacklisting to content-based filtering approaches [3].
We argue that the first two approaches are currently not
broadly implemented or exhibit deficits. Therefore content-based
phishing filters are necessary and widely used to increase
communication security. A number of features are extracted
capturing the content and structural properties of the email.
Subsequently a statistical classifier is trained using these features
on a training set of emails labeled as ham (legitimate), spam or
phishing. This classifier may then be applied to an email stream to
estimate the classes of new incoming emails.
AntiPhish is a specific targeted research project funded under
Framework Program 6 by the European Union. It is aims at
developing improved anti-phishing technologies that help to
protect and secure the global email communication infrastructure.
The project on the one hand developed the filter methodology in a
test laboratory setting, but on the other hand implemented this
technology in real world settings, to be used to filter all email
traffic online in real time. In this talk we summarize our
experience with phishing filtering with benchmark data and in
addition with different real-life email streams.
First we describe a number of novel features that are
particularly well-suited to identify phishing emails [1]. These
include statistical models for the low-dimensional descriptions of
email topics, sequential analysis of email text and external links,
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CSI-KDD'09, June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-4…$5.00.
the detection of embedded logos as well as indicators for hidden
salting [2]. Hidden salting is the intentional addition or distortion
of content not perceivable by the reader. For empirical evaluation
we have obtained a large realistic corpus of emails pre-labeled as
spam, phishing, and ham (legitimate). In experiments with
benchmark data our methods outperform other published
approaches for classifying phishing emails.
The second part of the talk describes the application of these
approaches to real-life email streams. On the one hand we
investigate how we can identify new phishing emails arriving
from a honeypot system. This allows to spot new types of
phishing mails. Subsequently the characteristics of these new
phishing emails can be used to update client-based phishing
filters. A second experiment investigates the capabilities of the
AntiPhish system when monitoring emails in an ISP framework.
It turns out that active learning approaches are very efficient to
maintain and improve filtering accuracy.
We discuss the implications of these results for the practical
application of this approach in the workflow of an email provider.
Finally we describe a strategy how the filters may be updated and
adapted to new types of phishing.
ACKNOWLEDGMENTS
This talk is based upon work performed within the FP6027600 project AntiPhish (http://www.antiphishresearch.org/).
The authors would like to thank the European Commission for
partially funding the AntiPhish project as well as all the AntiPhish
project partners for their interest, support, and collaboration in
this initiative.
REFERENCES
[1] Andre Bergholz, Jan De Beer, Sebastian Glahn, MarieFrancine Moens, Gerhard Paass, Siehyun Strobel 2009.
New Filtering Approaches for Phishing Email. Accepted for
publication for Journal of Computer Security (JCS)
[2] Andre Bergholz, Gerhard Paass, Frank Reichartz, Siehyun
Strobel, Marie-Francine Moens and Brian Witten 2008.
Detecting Known and New Salting Tricks in Unwanted
Emails. Fifth Conference on Email and Anti-Spam, CEAS
2008, Aug 21-22, 2008.
[3] Markus Jakobson and Steven Myers 2007. Phishing and
Countermeasures - Understanding the Increasing Problem of
Electronic Identity Theft. Wiley, Hoboken, New Jersey.
BIOGRAPHY
André Bergholz is a senior research engineer in the text
mining group at Fraunhofer IAIS. He is interested in text and data
analysis and management. Prior to joining Fraunhofer André
Bergholz worked as a research engineer at Xerox Research Centre
Europe in the domain of document management. André Bergholz
holds a PhD and a German diploma degree, both from HumboldtUniversity Berlin. At the time he specialized in management of
semistructured data, where he also did a postdoc at Stanford
University.
!"#$%&%&'()&*+,#,&-./(0%11,&(2.+3"4(2"1,/(.&1(
51.$""6-(5/'"+%-7#(8"+(5&"#./9()&-+:6%"&(;,-,*-%"&(
!
!"#$%"&'(%)*'
!4#54*6&'(%)*'
+),-./0)*/'12'3*21.0-/41*'5-*-6)0)*/'
7-/41*-8'()*/.-8'9*4:).;4/<'
=>>&'?%1*6@-'A@B&'?%1*684&'C-4D-*&'=E>FG&'ABHB(B'
+),-./0)*/'12'3*21.0-/41*'5-*-6)0)*/'
7-/41*-8'()*/.-8'9*4:).;4/<'
=>>&'?%1*6@-'A@B&'?%1*684&'C-4D-*&'=E>FG&'ABHB(B'
IJGE>=>EEKLLB*L"B)@"B/D'
L<0KLLB*L"B)@"B/D'
'
'
!"#$%!&$!
+--!7(#!$)(:$*8!&)'#53&()!%,',6'&();!
"#$%&'&()$*!+&%%,)!-$#.(/!-(%,*!0+--1!2$3!4,,)!3566,3375**8!
$99*&,%! '(! $)(:$*8! &)'#53&()! %,',6'&();! <)6#,:,)'$*! +--!
0<+--1! 75#'2,#! &:9#(/,3! '2,! '#$&)&)=! '&:,! (7! +--;! +(>,/,#?!
4('2! +--! $)%! <+--! 3'&**! 2$/,! '2,! 9#(4*,:! (7! 2&=2! 7$*3,!
9(3&'&/,! #$',;! <)! '2&3! 9$9,#?! >,! 9#(9(3,! $)! @%$4((3'A<+--! '(!
6(:4&),!<+--!$)%!$%$4((3'!7(#!$)(:$*8!&)'#53&()!%,',6'&();!@3!
$%$4((3'! 7&#3'*8! 53,3! :$)8! <+--3! '(! 6(**,6'&/,*8! 6*$33&78!
3$:9*,3! '2,)! %,6&%,3! '2,! #,35*'3! (7! 3$:9*,3!! 6*$33&7&6$'&()3?! '2,!
@%$4((3'A<+--!6$)!&:9#(/,!'2,!$665#$',!#$',!(7!6*$33&7&6$'&()3;!
BC9,#&:,)'$*! #,35*'3! >&'2! D'&%,! %$'$3,'3! 32(>! '2$'! '2,! 9#(9(3,%!
:,'2(%! 6$)! 3&=)&7&6$)'*8! &:9#(/,! '2,! 7$*3,! 9(3&'&/,! #$',! 48! EFG!
>&'2(5'! %,6#,$3&)=! %,',6'&()! #$',;! H,3&%,3?! >,! $*3(! 9#(9(3,! $!
:,'2(%! '(! $%I53'! '2,! )(#:$*! 9#(7&*,! 7(#! $/(&%&)=! ,##(),(53!
%,',6'&()!6$53,%!48!62$)=,3!(7!)(#:$*!4,2$/&(#;!J,!9,#7(#:!>&'2!
,C9,#&:,)'3! >&'2! #,$*&3'&6! %$'$3,'3! ,C'#$6',%! 7#(:! '2,! 53,! (7!
9(95*$#! 4#(>3,#3;! K(:9$#,%! >&'2! '#$%&'&()$*! +--! :,'2(%?! (5#!
:,'2(%! 6$)! &:9#(/,! '2,! '#$&)&)=! '&:,! 48! LFG! '(! 45&*%! $! ),>!
)(#:$*!9#(7&*,;!!
J$##,)%,#! PEQ! 7&#3'! $99*&,%! +--! '(! :(%,*! 383',:! 6$**!
3,[5,)6,3!(7!)(#:$*!9#(6,33!7(#!$)(:$*8!&)'#53&()!%,',6'&()!$)%!
=('! $! #,:$#.$4*,! %,',6'&()! #,35*';! +(>,/,#?! +--! 2$3! 9#(4*,:3!
(7!>$3'&)=!$:(5)'!(7!'2,!'#$&)&)=!'&:,!$)%!2&=2!7$*3,!9(3&'&/,!#$',!
PVQ;!"2,!*$#=,!$:(5)'!(7!'#$&)&)=!'&:,!7(#!+--!6(:,3!7#(:!'2$'!
'2,! +--! $%(9'3! '2,! :$C&:5:! *&.,*&2((%! 9#&)6&9*,! (7! H$5:A
J,*62! $*=(#&'2:! '(! '#$&)! $! :(%,*;! "2,! '#$&)&)=! 6(:9*,C&'8! &3!
\0TU"1?! >2,#,! T! &3! '2,! )5:4,#! (7! 2&%%,)! 3'$',3! $)%! "! &3! '2,!
(43,#/$4*,!3,[5,)6,!*,)='2;!!
&'()*+,-)./'01/#234)5(/6).5,-7(+,.!
M;N;O!P87),'(-0*/#9.():.QR!D,65#&'8!$)%!S#(',6'&()!
;)0),'</$),:.!
D,65#&'8!
=)9>+,1.!
@)(:$*8!<)'#53&()!M,',6'&()?!T(#:$*!S#(7&*,?!<+--?!@%$4((3'!
?@! AB$%86C&$A8B/
@66(#%&)=! '(! '2,! UFFV! #,9(#'! 7#(:! W$39,#3.8! PXQ?! '2,! =*(4$*!
:$*>$#,3! $)%! &)'#53&()3! =#(>! 32$#9*8Y! 2,)6,! &'! &3! &:9(#'$)'! '(!
%,/,*(9! ,77,6'&/,! &)'#53&()! %,',6'&()! 383',:3! 0<MD31! '(! 3'(9! '2,!
%,3'#56'&()3! 6$53,%! 48! '2,3,! :$*>$#,3;! <MD3! %,',#:&),! >2,'2,#!
'2,! 65##,)'! 383',:! &3! &)65##,%! &)'#53&()!48! $)$*8Z&)=! 383',:! 6$**!
3,[5,)6,3?! 383',:! *(=3! (#! ),'>(#.! 9$6.,'3;! @**! (7! '2,3,! %$'$!
&)6*5%,! '2,! '&:,! 3,#&,3! ,/,)'3;! +--! 2$3! '2,! =#,$'! 6$9$4&*&'8! '(!
%,36#&4,!'2,!'&:,!3,#&,3!%$'$?!3(!:$)8!#,3,$#62,3!PUQAPOQ!$99*8!'2,!
!
S,#:&33&()!'(!:$.,!%&=&'$*!(#!2$#%!6(9&,3!(7!$**!(#!9$#'!(7!'2&3!>(#.!7(#!
9,#3()$*!(#!6*$33#((:!53,!&3!=#$)',%!>&'2(5'!7,,!9#(/&%,%!'2$'!6(9&,3!$#,!
)('! :$%,! (#! %&3'#&45',%! 7(#! 9#(7&'! (#! 6(::,#6&$*! $%/$)'$=,! $)%! '2$'!
6(9&,3! 4,$#! '2&3! )('&6,! $)%! '2,! 75**! 6&'$'&()! ()! '2,! 7&#3'! 9$=,;! "(! 6(98!
('2,#>&3,?! (#! #,954*&32?! '(! 9(3'! ()! 3,#/,#3! (#! '(! #,%&3'#&45',! '(! *&3'3?!
#,[5&#,3!9#&(#!39,6&7&6!9,#:&33&()!$)%a(#!$!7,,;!
"#$%&''()*?!b5),!UV?!UFFL?!S$#&3?!]#$)6,;!!
K(98#&=2'!UFFL!@K-!LEVAXAOF``VAOOLAN;;;c`;FF;!
!
+(>,/,#?! >2,)! '2,! (43,#/$4*,! 3,[5,)6,! *,)='2! 4,6(:,3! /,#8!
*()="# $%&# '# ()*+&# ,-# $%&# .)/01)23# 425/&3+2&# 56# 7)+8AJ,*62!
$*=(#&'2:! 6$)! 4,! $99#(C&:$',%! >&'2! *&''*,! ,77,6'3! '(! :(%,*!
6(##,6'),33;! J&'2! '2&3! (43,#/$'&()?! ]*(#,Z! PLQ! 9#(9(3,%! $)!
<)6#,:,)'$*!+--!0<+--1!'(!#,%56,!'2,!'#$&)&)=!(7!'!6(:9*,C&'8!
'(!\0TU1!7(#!32(#',)&)=!'2,!'#$&)&)=!'&:,;!^)7(#'5)$',*8?!$*'2(5=2!
'2,! <+--! #,%56,3! '2,! '#$&)&)=! '&:,?! &'! 3'&**! 2$3! '2,! &)2,#,)'!
9#(4*,:!(7!2&=2!7$*3,!9(3&'&/,!#$',;!\)!'2,!('2,#!2$)%?! $%$4((3'?!
32(#'! 7(#! $%$9'&/,! 4((3'&)=?! &3! $)! ,)3,:4*,! :$62&),! *,$#)&)=!
$*=(#&'2:!PXFQ!PXXQ!7(#!95#35&)=!'2,!:&)&:5:!6*$33&7&6$'&()!,##(#;!
@%$4((3'!7(653,3!:(#,!()!2$#%!3$:9*,3!'(!#,&)7(#6,!*,$#)&)=!$)%!
7&)$**8!6(:4&),3!:$)8!>,$.!6*$33&7&,#3!'(!$!3'#()=!6*$33&7&,#;!D(!&'!
6$)! %,6#,$3,! '2,! 7$*3,! 9(3&'&/,! #$',! (7! <+--! 48! &)',=#$'&)=! '2,!
$%$4((3'!'(!'2,!<+--;!J,!9#(9(3,!'(!'$.,!'2,!$%/$)'$=,!(7!'2&3!
&)',=#$'&()! &)! $)(:$*8! <MD! $)%! )$:,! '2&3! &)',=#$'&()! $3!
@%$4((3'A<+--;!
@99*&6$'&()! 4,2$/&(#! 62$)=,3! $#,! 6(::();! 0,;=;! 9#(=#$:!
59%$',Y!'2,!62$)=,3!(7!53,#!4,2$/&(#1;!"2,#,7(#,?!&)!$)(:$*8!<MD?!
&$!9# ,8452$)-$# $5# )3:+9$# -528)*# 4256,*&# 42584$*;<# =&# 9#(9(3,! '(!
)354$# >?)!9# 5-A*&),! $%$4((3'! PXUQ! PX_Q! &)! (5#! :,'2(%! $)%! 53,!
,C9,#&:,)'3!'(!/$*&%$',!'2$'!'2&3!$%(9'&()!6$)!&)%,,%!&:9#(/,!'2,!
%,',6'&()!#$',!$3!>,**!$3!%,6#,$3,!'2,!7$*3,!9(3&'&/,!#$',;!
"2,!#,:$&)%,#!(7!'2&3!9$9,#!&3!(#=$)&Z,%!$3!7(**(>3;!<)!D,6'&()!
U?!>,!=&/,!$!4#&,7!(/,#/&,>!(7!'2,!#,*$',%!>(#.;!<)!D,6'&()!_?!>,!
9#(9(3,! (5#! :,'2(%! $)%! ,C9*$&)! 2(>! '(! 6(:4&),! <+--! $)%!
$%$4((3'!7(#!$)(:$*8!&)'#53&()!%,',6'&();!J,!$*3(!9#(9(3,!$)!()A
*&),! )(#:$*! 9#(7&*,! $%I53':,)'! :,'2(%! &)! '2&3! 3,6'&();!
BC9,#&:,)'$*! #,35*'3! $#,! 32(>)! &)! D,6'&()! N;! ]&)$**8?! D,6'&()! `!
=&/,3!(5#!6()6*53&()3!$)%!'2,!%&#,6'&()!(7!75'5#,!>(#.;!
D@! %EF!$E6/G8%=/
D@?! A05,):)0('</HII/JAHIIK/
+--! PXNQ! PX`Q! &3! $! 7&)&',! 3'$',! $5'(:$'&()! >&'2! %(54*8!
3'(62$3'&6! 9#(6,33;! <'! &3! $! 9#(4$4&*&'8! 75)6'&()!(7! -$#.(/! 62$&)3;!
J&'2! '2,! =#,$'! %,36#&9'&/,! 6$9$4&*&'8! 7(#! '&:,! 3,#&,3! %$'$?! &'! &3!
9(95*$#*8! $99*&,%! &)! $6(53'&6! #,6(=)&'&()?! &:$=,! #,6(=)&'&()?!
3&=)$*!9#(6,33&)=!(#!4&(*(=&6$*!&)7(#:$'&()?!,'6;!
@!+--!&)6*5%,3!$!2&%%,)!3'$',!*$8,#!$)%!$)!(43,#/$4*,!(5'95'!
*$8,#;! +&%%,)! 3'$',! *$8,#! &3! $! 3'$4*,! -$#.(/! 62$&)3;! <'3! 3'$',!
9#(4$4&*&'8! $)%! 3'$',! '#$)3&'&()! 9#(4$4&*&'8! $#,! %,6&%,%! 7#(:! '2,!
&)&'&$*!3'$',!9#5.).,*,$;#(&/$52#@#)-3#'2,!3'$',!'#$)3&'&()!9#(4$4&*&'8!
:$'#&C! @;! \43,#/$4*,! (5'95'! *$8,#! &3! %,6&%,%! 7#(:! '2,! (43,#/,%!
38:4(*3!9#(4$4&*&'8!:$'#&C!H!>2&62!&3!%,#&/,%!7#(:!'2,!(43,#/,%!
38:4(*3! (7! ,$62! 2&%%,)! 3'$',;! d,'! T?! -! #,9#,3,)'! '2,! )5:4,#! (7!
2&%%,)!3'$',3!$)%!'2,!)5:4,#!(7!(43,#/,%!38:4(*3!#,39,6'&/,*8?!$!
+--! :(%,*! &3! 535$**8! %,36#&4,%! $3! ABC@"# D"# 7E?! >2,#,! '2,!
8&)-,-F9#56##@"#D!$)%!H!$#,!$3!7(**(>3R!!
@ R! &)&'&$*! 3'$',! 9#(4$4&*&'8! &)6*5%&)=! !"
>2,#,!X!&!T;!
H,3&%,3?! $%$4((3'! 4,*()=3! '(! 4$'62! '#$&)&)=;! "2$'! :,$)3! &7! $!
),>!%$'5:!,)',#3?!'2,!(#&=&)$*!'#$&),%!:(%,*!32(5*%!4,!%&36$#%,%!
$)%! '2,! '#$&)&)=! 4,! #,3'$#',%! $=$&);! "2,! 4$'62! '#$&)&)=! >$3',3!
$:(5)'!(7!'&:,?!,39,6&$**8! >2,)!%$'$!62$)=,%!7#,[5,)'*8!(#!%$'$!
6()'&)5(53*8! ,C9$)%3;! "(! 3(*/,! '2&3! 9#(4*,:?! \Z$! 9#(9(3,%! $)!
()A*&),! $%$4((3'! :,'2(%! PXUQ! PX_Q;! <)! '2&3! $99#($62?! ,$62! ),>!
3$:9*,! &3! 3,)'! '(! 3543,[5,)'! >,$.! 6*$33&7&,#3?! $)%! '2,)! ,$62!
6*$33&7&,#! $%I53'3! '2,! >,&=2'! (7! '2,! ),>! 3$:9*,! $)%! &'3! (>)!
6()7&%,)6,! #$',! $66(#%&)=! '(! '2,! 6*$33&7&6$'&()! #,35*'! (7! '2,! ),>!
3$:9*,;! H('2! '2,! ),>! 3$:9*,! $)%! '2,! >,&=2'! (7! '2,! $%I53',%!
!,*,:,)'3?!!!!!!!!!!!!!!!!
3$:9*,!$#,!3,)'!'(!'2,!),C'!6*$33&7&,#;!
@R! 3'$',! '#$)3&'&()! 9#(4$4&*&'8! :$'#&C! &)6*5%&)=! !"# !,*,:,)'3?!
>2,#,!X!&?!I!T; !
!
HR! (43,#/,%! 38:4(*3! 9#(4$4&*&'8! :$'#&C! &)6*5%&)=! 4I0.1!
,*,:,)'3?!>2,#,!X!!I!T!$)%!X!!.!-;!
e&/,)! $)! (43,#/,%! 3,G+&-/&# 5# )-3# )# 853&*# A?! >,! 6$)! 45&*%! $!
9&G+&-/&# 2&*)$,5-# 853&*# .&$1&&-# 5# )-3# A<# H%&# H$5:AJ,*62!
$*=(#&'2:! PX`Q?! >2&62! &3! 4$3,%! ()! '2,! :$C&:5:! *&.,*&2((%!
9#&)6&9*,?! 6()'&)5(53*8! *,$#)3! '(!$%$9'! :(%,*!9$#$:,',#3! (7! @"# D!
$)%! H! $5# 8)0&# IC5# J# AE# 8)K,8+8<# H%,9# 425/&99# ,9# /)**&3# $%&#
'#$&)&)=!(7!+--!:(%,*;!
+(>,/,#?!'2,!6(:9*,C&'8!(7!H$5:AJ,*62!$*=(#&'2:!&3!\0TU"1;!
"2,! *()=,#! '2,! (43,#/$4*,! 3,[5,)6,! *,)='2! 4,6(:,3?! '2,! :(#,!
'&:,! >,! ),,%! '(! '#$&)! $! +--! :(%,*;! +,)6,?! ]*(#,Z! 9#(9(3,%!
<+--! PLQ! 48! $%(9'&)=! $)! &)6#,:,)'$*! H$5:AJ,*62! $*=(#&'2:;!
f,6$**! '2$'! &)! '2,! 4$6.>$#%! 9$#'! (7! H$5:AJ,*62! $*=(#&'2:?! '2,!
/$#&$4*, $#$ %"& % '%($&' ( ($&) ( ) ( (* $*$+%$& % ," & ?! >2,#,! +%$& !&3!
'2,!383',:!2&%%,)!3'$',$," !$'!'!'2!'&:,3',9;!+,!#,9*$6,%!'2,!(#&=&)$*!
'# ()*+&# 56# LG<CME# .;# $%&# )4425K,8)$&3# '# ()*+&# ,-# LG<CNE"! >2,#,!
X!&?!I!T!$)%!X!'!"AX;!"2,!&)&'&$*!'" %&&gX;!
1
#$ %"& % - ."/ 0/ %($&' &#$&' %/&!
0X1!
/ %'
1
#$ %"& % - ."/ 0/ %($&' &$
:$)8!6*$33&7&,#3?!3(!&'!:$.,3!6*$33&7&,%!#,35*'3!:(#,!$665#$',!'2$)!
5)&[5,!6*$33&7&,#;!+(>,/,#?!$%$4((3'!&3!3,)3&'&/,!'(!)(&38!%$'$?!3(!
&'!&)6*&),3!'(!&)'#(%56,!(/,#7&''&)=!9#(4*,:;!-$)8!#,3,$#62,3!PXOQ!
PXEQ!PXVQ!2$/,!4,,)!%,/(',%!'(!3(*/,!'2&3!9#(4*,:;!
0U1!
/ %'
"2,!'#$&)&)=!6(:9*,C&'8!(7!'!&)!4$6.>$#%!9#(6,%5#,!(7!H$5:A
J,*62! $*=(#&'2:! &3! '253! #,%56,%! '(! \0TU1! 48! '2,! ),>!
)4425K,8)$&3#&9$,8)$&3#'#()*+&<!!"2,!)$:,!(7!<+--!6(:,3!7#(:!
'2$'!,$62!'&:,!$!),>!(43,#/$4*,!3,[5,)6,!&3!3,)'!'(!'2,!:(%,*?!'2,!
+--!9$#$:,',#3!$#,!#,A,3'&:$',%;!D(!'2,!*,$#)&)=!7(#!'2,!:(%,*!
&3!&)6#,:,)'$*;!
D@D! !1'3++.(/
@%$4((3'?! 9#(9(3,%! 48! ]#,5)%! $)%! K2$9&#,! PXFQ! PXXQ?! &3!
9(95*$#*8!$99*&,%!&)!&:$=,!#,6(=)&'&();!@%$4((3'!&3!$!359,#/&3,%!
*,$#)&)=! $*=(#&'2:;! <'! 6(:95',3! 6()7&%,)6,! #$',3! (7! >,$.!
6*$33&7&,#3!$)%!$%I53'3!'2,!>,&=2'!(7!,$62!3$:9*,?!3(!'2$'!'2,!),C'!
'#$&)&)=! 7(653,3! ()! '2,! 2$#%! 3$:9*,3;! <'! 7&)$**8! 6(:4&),3! ,$62!
>,$.! 6*$33&7&,#! $)%! '2,! 6()7&%,)6,! #$',! (7! ,$62! >,$.! 6*$33&7&,#;!
"2,!7,$'5#,!(7!$%$4((3'!&3!'2$'!&'3!6*$33&7&,%!#,35*'3!$#,!%,6&%,%!48!
D@L! !1'3++.(/!HII/
]((! PXLQ! 6(:4&),%! +--! $)%! $%$4((3'! '(! 4,! $99*&,%! '(!
$5'(:$'&6! 39,,62! #,6(=)&'&();! D9,,62! 3$:9*,3! $#,! %&/&%,%! &)'(!
:$)8!6*$33,3?!$)%!$%$4((3'A+--!6*$33&7&,#3!$#,!'#$&),%!'(!6(/,#!
%&77,#,)'! =#(593! (7! '#$&)&)=! 3$:9*,3;! @3! $%$4((3'! '#$&)3! :$)8!
>,$.! 6*$33&7&,#3! 48! $%I53'&)=! '2,! >,&=2'! %&3'#&45'&()! (7! 3$:9*,3;!
]((! #,9*$6,%! '#$%&'&()$*! H$5:AJ,*62! $*=(#&'2:! >&'2! H&$3,%!
H$5:AJ,*62! $*=(#&'2:! '(! 6(:95',! 6*$33&7&,#3;! H8! $%I53'&)=! '2,!
>,&=2'3! (7! 3$:9*,3?! 2&3! :,'2(%! 6$)! =,'! +--3! >&'2! B[;0_1! $)%!
B[;0N1;!!
.2"/ %
3+5%'
45 *5 :' $5
$5
5 %/&
3
6 %"&."/ 0/ 78$&'
9#$&'
'5 $%' $
!
4
5 :'
5 %/&
3+5%' 5 3*$%'
6$$5 %"&#$&'
'5
3+5%'
02/ %(; & %
45
5 %/&#5 %/&
3
$
5 :' 6$
'5 $%'<*
5
4
3+5%' 5
'5
,,$,8$ %8;
*5 :' $5
3$%'
6$ %/&#$5 %/&
$
0_1!
0N1!
<)! B[;0_1! $)%! B[;0N1?!45 !3'$)%3! 7(#! '2,! >,&=2'! (7! ,$62! 3$:9*,;!
"2,! $5'2(#3! &)! PUFQ! 9#(/,%! '2$'! '2,! >,&=2'3! (7! '#$&)&)=! 3$:9*,3!
35-!$#(,5*)$&#$%&#/5-(&2F&-/&#4254&2$;#56#$2,!:$C&:5:!*&.,*&2((%!
'#$&)&)=;!@3!'2&3!$99#($62!2$3!'(!'#$&)!:$)8!+--3!'(!=,'!4,'',#!
#,35*'3?!&'!),,%3!:562!:(#,!'&:,!'(!'#$&)!'2,!:(%,*!'2$)!'#$%&'&()$*!
+--;!"2&3!32(#'6(:&)=!:('&/$',3!53!'(!53,!<+--!PLQ!'(!#,9*$6,!
+--!7(#!3$/&)=!'#$&)&)=!'&:,;!H5'!(5#!:,'2(%!&3!$!*&''*,!%&77,#,)'!
7#(:!]((!3;!]((!2$)%*,3!'2,!9#(4*,:!(7! :5*'&A6*$33&7&6$'&()?!$)%!
2&3! '#$&)&)=! %$'$! &)6*5%,3! :5*'&A6*$33! *$4,*3;! \5#! :,'2(%! %,$*3!
>&'2!'2,!4&)$#8!6*$33&7&6$'&()!9#(4*,:?!&)!>2&62!'2,!'#$&)&)=!%$'$!
()*8!2$3!)(#:$*!6*$33!*$4,*;!+(>!'(!&)',=#$',!<+--!$)%!$%$4((3'!
>&**!4,!&)'#(%56,%!&)!),C'!3,6'&();!
L@! !6!"88#$MAHII/
L@?! "'.-5/!1'3++.(MAHII/
H,6$53,! $%$4((3'! $*=(#&'2:! ),,%3! '(! 6(**,6'! :$)8! >,$.!
6*$33&7&,#3!'(!6*$33&78!3$:9*,3?!'(!&:9#(/,!'2,!'#$&)&)=!6(3'!(7!'2,!
+--?! >,! #,9*$6,! '2,! +--! >&'2! '2,! <+--;! D(! >,! %,/,*(9! $!
4&$3,%! &)6#,:,)'$*! H$5:AJ,*62! $*=(#&'2:! 48! 6(:4&)&)=! H&$3,%!
H$5:AJ,*62! $*=(#&'2:! PXLQ! >&'2! <)6#,:,)'$*! H$5:AJ,*62!
$*=(#&'2:!PLQ;!H5'!&'!&3!)(',%!'2$'!<+--!&3!$!9#(4$4&*&'8!75)6'&();!
"2$'!:,$)3!>2,)!$!3$:9*,!&3!95'!&)'(!'2,!<+--?!&'3!(5'95'!&3!'2,!
9#(4$4&*&'8!(7!'2,!3$:9*,;!H,6$53,!>,!()*8!2$/,!)(#:$*!4,2$/&(#!
3$:9*,3! '(! :(%,*! $! )(#:$*! 9#(7&*,! &)! <MD! 9#(4*,:?! >,! 2$/,! '(!
'#$)37(#:! '2,! <+--! '(! 2$)%*,! 4&)$#8! 6*$33&7&6$'&()! 9#(4*,:;! <7!
'2&3! '#$)37(#:$'&()! 3566,33,3?! >2,)! >,! &)95'! $! 3$:9*,! '(! '2,!
$%$4((3'?! >,! =,'! $)! (5'95'! >2&62! &3! $! :$'62! (7! )(#:$*! 9#(7&*,!
4,2$/&(#!6*$33!(#!$!:&3:$'62!6*$33;!
D(! '2,#,! ,C&3'3! $! 9#(4*,:! ()! 2(>! '(! '#$)37(#:! '2,! <+--3! '(!
%,$*! >&'2! ,$62! >,$.! 6*$33&7&,#! (7! $%$4((3'h! +,#,! >,! 3,'! $!
'2#,32(*%! /$*5,! =' !'(! 3(*/,! '2&3! 9#(4*,:;! <7! '2,! 3$:9*,3!
9#(4$4&*&'8!(7!'2,!<+--!&3!*,33!'2$)!'2#,32(*%$=' ?!&'!&3!%,',#:&),%!
$3!$!:&3:$'62!6*$33Y!('2,#>&3,?!&'!&3!$!:$'62!6*$33;!0D,,!B[;0`1;1!
'( "A$5B'%?" $*$C$ & D E' $ F
>$ %?" & % @
!
:'( "A$5B'%?" $*$C$ & - E' $
0`1!
!
"2,! 9#(6,%5#,3! (7! 6(:4&)&)=! <+--3! $)%! $%$4((3'! '(! '#$&)! $!
)(#:$*!9#(7&*,!$#,!%,36#&4,%!$3!7(**(>3R!
!
X;! D,'!)(#:$*!3$:9*,!39$6,!. % %/' ( ) ( /" ( ) ( /0 &$$)%!'2,!&)&'&$*!
>,&=2'!1' %"& % '20!(7!,$62!3$:9*,;!
'(
,"QB%?& % @
:'(
^3,! '2,! 4&$3,%! &)6#,:,)'$*! H$5:AJ,*62! $*=(#&'2:!
'(! '#$&)! '2,! )(#:$*! 3$:9*,3! 0D,,! B[;0U1?! B[;0_1! $)%!
B[;0N11!'(!=,'!G3$ 45!<+--;!
4;!
^3,! B[;0`1! '(! 45&*%! $! >,$.! 6*$33&7&,#!63 !'(! 6*$33&78!
3$:9*,3!'(!ijX?AXk!$66(#%&)=!'(!'2#,32(*%!=' ,!
6;!
^3,!B[;0O1!'(!6(:95',!'2,!,##(#!6*$33&7&6$'&()!#$',!(7!
'2,!)(#:$*!3$:9*,3!6*$33&7&,%!48!'2,!>,$.!6*$33&7&,#;!
<7! ,##(#! #$',! H3 7 8,9$(#!H3 % 8( !4#,$.! *((9;! B[;0E1!
6(:95',3!'2,!6()7&%,)6,!#$',!(7!'2,!>,$.!6*$33&7&,#;!
B
I$ %
4$ %"&$$!
"%'
$> $ %? " &%:'
'
' : K$
6$ % 5B J
L$
)
K$
%;!
0O1!
\5#! #,3,$#62! $%(9'3! \Z$!3! ()A*&),! $%$4((3'! '(! $%I53'! )(#:$*!
9#(7&*,;!+(>,/,#?!&)!\Z$!3!9#(6,%5#,3!(7!()A*&),!$%$4((3'?!$!),>!
3$:9*,!>&**!4,!'#$&),%!:(#,!'2$)!(),!'&:,3!&)!,$62!>,$.!6*$33&7&,#!
'(! :$.,! '2&3! 3$:9*,! 4,! 6(##,6'*8! 6*$33&7&,%?! 3(! &'! :$8! 2$/,! '2,!
9#(4*,:! (7! :(%,*! (/,#A'#$&)&)=;! "(! $/(&%! '2&3! 9#(4*,:?! &)! (5#!
#,3,$#62?!,$62!),>!3$:9*,!&3!()*8!'#$&),%!()6,;!
d,'!;3< !#,9#,3,)'! '2,! )5:4,#! (7! 6(##,6'! 6*$33&7&,%! 3$:9*,3! $)%!
;3= $4,!'2,!)5:4,#!(7!,##(#!6*$33&7&,%!3$:9*,3;!"2,!/$*5,!(7!'2,3,!
'>(! )5:4,#3! >&**! 4,! $77,6',%! 48! '2,! ,##(#! 6*$33&7&6$'&()! #$',! (7!
,$62!>,$.!6*$33&7&,#!$)%!'2,!'('$*!)5:4,#!(7!3$:9*,3?!>,!9#(9(3,!
B[;0XX1! '(! #,9#,3,)'! '2,&#! #,*$'&()3;! "2,! &)&'&$*! >,&=2'! (7! ),>!
3$:9*,!&3!X?!3(!'2$'!&'!6$)!$%I53'!)(#:$*!9#(7&*,;!
1$U % %' : I$ & > *>M$1V;0MW$8A$*8$.5$X.;N5M,
!
1$Y % I$ > *>M$1V;0MW$8A$*8$.5$X.;N5M,
!
"2,! 9#(6,%5#,3! (7! ()A*&),! )(#:$*! 9#(7&*,! $%I53':,)'! $#,!
%,36#&4,%!$3!7(**(>3R!
U;!
D,'!&)&'&$*!>,&=2'!=" !(7!,$62!),>!3$:9*,!/" !$3!X!!
_;!
\)A*&),!59%$',?!!](#!,$62!>,$.!6*$33&7&,#!63 ?!!%(!*((9!
0E1!
0V1!
O$ % - 4$ %"& M?N7:6$ >$ %?" &9$
0L1!
*
P%?& % ,"QB R- 6$ >$ %?&S!
0XF1!
$%'
!!!!!!!!>2,#,!3&=)!#,9#,3,)'3!3&=)5:!75)6'&();! "2,!3&=)5:!75)6'&()!
(7!$!#,$*!)5:4,#!C!&3!%,7&),%!$3!7(**(>3R!
<7!63 %/" & % '?!'2,)!6$*65*$',!B[;0XU1!
1$U % 1$U & Y"
1$Y
I$ % U
1$ & 1$Y !
'
Y" % Y" J
L
)%' : I$ &
4;!
N;!
0XU1!
B*3,?!6$*65*$',!B[;0X_1!
1$Y % 1$Y & Y"
1$Y
I$ % U
1$ & 1$Y !
'
Y" % Y" J L
)I$
"%'
J&'2! '2,! ,)%! (7! '2,! *((9?! >,! 6(:4&),! "! <+--3! '(! =,'! $!
3'#()=!6*$33&7&,#!+?!3,,!B[;0XF1;!
=
^3,! B[;0XX1! '(! 6(:95',!;3 !$)%!;3 $ 7#(:! '2,! )5:4,#3! (7!
'('$*!3$:9*,3;!
B
_;!
<
X;!
$;!
4$ %"&M?N7:6$ >$ %?" &9
!
O$
0XX1!
<)! =,),#$*?! '2,! $%I53':,)'! (7! )(#:$*! 9#(7&*,! &3! 7&#3'*8! 3,)%&)=!
'2,! ),>! 3$:9*,! '(! ,$62! >,$.! 6*$33&7&,#! $)%! ,$62! >,$.! 6*$33&7&,#!
#,9,$',%*8! $%I53'3! '2,! >,&=2'! (7! '2,! 3$:9*,! $)%! '2,! 6()7&%,)6,!
#$',!(7!&'3!(>)?!'2,)!'2,!3$:9*,!$)%!&'3!$%I53',%!>,&=2'!$#,!3,)'!'(!
),C'!6*$33&7&,#;!
^3,! B[;0V1! '(! $%I53'! '2,! >,&=2'3! (7! '2,! 3$:9*,3!
>2&62!2$/,!4,,)!6*$33&7&,%?!:3 !&3!$!)(#:$*&Z,%!7$6'(#?!
3,,!B[;0L1;!
4$&' %"& %
!
L@D! !1'3++.(MAHII/>-(N/80M<-0)/!142.(-0*/
&'7'3-<-(9/
U;! O52C$BM"N"PHE?!*((9!
$;!
"A$? 7 8 F
!
"A$? T 8
0X_1!
@'! '2,! ,)%! (7! '2,! *((9?! '2,! )(#:$*! 9#(7&*,! &3! 6(:9*,',*8!
$%I53',%;!^3,!B[;0E1!'(!6(:95',!'2,!6()7&%,)6,!#$',!(7!!63 !?!
$)%!'2,!7&)$*!6*$33&7&,#!+!48!B[;0XF1;!
O@! EPQE%AIEB$#/
@)(:$*8! &)'#53&()! %,',6'&()! &3! %&/&%,%! &)'(! '>(! 3'$=,3R! '2,!
)(#:$*! 4,2$/&(#! '#$&)&)=! 3'$=,! $)%! '2,! $)(:$*8! &)'#53&()!
%,',6'&()! 3'$=,;! "2,! 7(#:,#! '#$&)3! '2,! )(#:$*! 9#(7&*,! 48! 53&)=!
)(#:$*!'#$&)&)=!%$'$!$)%!'2,!*$'',#!%,',#:&),3!>2,'2,#!'2,!9#(6,33!
&3! $)(:$*8;! J2,)! '2,! )(#:$*! 4,2$/&(#! &3! 62$)=,%?! '2,! )(#:$*!
9#(7&*,!),,%3!'(!4,!$%I53',%!'(!#,7*,6'!'2,3,!62$)=,3;!
H,7(#,! >,!9,#7(#:!(5#!,C9,#&:,)'3?!>,!2$/,!'(!%,',#:&),!'2,!
/$*5,!(7!3(:,!7#,,!9$#$:,',#3!(7!@%$4((3'A<+--R!'2,!)5:4,#!(7!
2&%%,)! 3'$',3! (7! <+--?!'2,! '#$&)&)=! '&:,3! (7! 4&$3,%! &)6#,:,)'$*!
H$5:AJ,*62!$*=(#&'2:?!'2,!3,''&)=!(7!'2#,32(*%!(7!B[;0`1?!$)%!'2,!
)5:4,#!(7!*((93!&)!$%$4((3'!$*=(#&'2:;!D(:,!(7!'2,3,!9$#$:,',#!
/$*5,3!%,9,)%!()!'2,!6$3,!7(#!>2&62!>,!),,%!'(!,/$*5$',;!"$.,!'2,!
^T-!%$'$3,'3!PUXQ?!>2&62!6()'$&)!$!*$#=,!/(*5:,!(7!383',:!6$**3!
&)!)(#:$*!$)%!$4)(#:$*!9#(6,33,3!$)%!$#,!7$:(53!4,)62:$#.!7(#!
&)'#53&()!%,',6'&()?!$3!$)!,C$:9*,;!"2,!)5:4,#!(7!2&%%,)!3'$',3!&3!
'2,! )5:4,#! (7! 5)&[5,! 383',:! 6$**3! 53,%! 48! '2,! 9#(6,33! PEQ! PUUQ!
PU_Q;! "2,! )5:4,#! (7! (43,#/,%! 38:4(*3! &3! '2,! )5:4,#! (7! 383',:!
6$**3!48!(9,#$'&)=!383',:;!J&'2!#,39,6'!'(!'2,!('2,#!3,''&)=3!(7!'2,!
9$#$:,',#3!/$*5,3?!'2,!'#$&)&)=!'&:,3!(7!4&$3,%!&)6#,:,)'$*!H$5:A
J,*62! $*=(#&'2:! &3! 3,'! '(! `;! "2,! '2#,32(*%! $=' $ (7! B[;0`1! &3!
%,',#:&),%!48!3(#'&)=!9#(4$4&*&'&,3!(7!'#$&)&)=!3$:9*,3!6(5)',%!48!
<+--! $)%! 3,*,6'&)=! '2,! 3$:9*,! 9#(4$4&*&'8! 7#(:! `G! '(! X`G!
'#$&)&)=!3$:9*,3;!J,!3,'!'2,!)5:4,#!(7!*((93!&)!$%$4((3'!'(!4,!`!
48! '2,! #,$3()! (7! 3$/&)=! 6(:95'$'&()! '&:,! $3! >,**! $3! '2,!
(43,#/$'&()! '2$'! '2,! ,##(#! #$',! (7! 6*$33&7&6$'&()! &)! ,$62! *((9!
%,6#,$3,3;!!
J,! 7&#3'! 53,! D'&%,! $)%! D,)%:$&*! 383',:! 6$**! %$'$3,'3! 7#(:!
^T-! PUXQ! $3! (5#! ,C9,#&:,)'! %$'$;! "2,3,! %$'$3,'3! &)6*5%,! :$)8!
)(#:$*! $)%! &)'#53&()! '#$6,3;! B$62! '#$6,! &3! $! *&3'! (7! 383',:! 6$**3!
&335,%!48!$!3&)=*,!9#(6,33!7#(:!&'3!4,=&))&)=!(7!,C,65'&()!'(!'2,!
,)%!(7!,C,65'&();!d&.,!'2,!%$'$!9#,9#(6,33&)=!&)!PU_Q!PUNQ?!>,!53,!
$!3*&%&)=!>&)%(>!'(!6()3'#56'!383',:!6$**!3,[5,)6,3;!J,!3,'!'>(!
'2#,32(*%3! =' !$)%! $=) ;! =' !%,',#:&),3! >2,'2,#! '2,! 383',:! 6$**!
3,[5,)6,3! :$'62! '2,! )(#:$*! 9#(7&*,;!=) !%,',#:&),3! >2,'2,#! ,$62!
9#(6,33! &3! $)(:$*8;! <7! :&3:$'62! #$',! 0(#! 6$**,%! $)(:$*8! #$',1! (7!
383',:! 6$**! 3,[5,)6,3! (7! '2,! 9#(6,33! ,C6,,%3!=) ?! '2&3! 9#(6,33! &3!
3$&%!'(!4,!$)!$)(:$*8!9#(6,33;!<)!$%%&'&()!'(!'2,!^T-!%$'$3,'3?!'(!
/$*&%$',!'2,!$99*&6$'&()!(7!(5#!:,'2(%!&)!$!:(#,!#,$*&3'&6!6$3,?!>,!
$99*8!(5#!:,'2(%!'(!-&6#(3(7'! J&)%(>3!$99*&6$'&()3!'(!/$*&%$',!
'2,!$%I53'&)=!)(#:$*!9#(7&*,!:,'2(%;!
%,',6'&()! #$',;! J,! 6$)! 3,,! '2$'! (5#! :,'2(%! 2$3! '2,! *(>,#! 7$*3,!
9(3&'&/,! #$',! '2$)! '2,! ('2,#3! 5)%,#! '2,! 3$:,! %,',6'&()! #$',;! ](#!
,C$:9*,?! &)! D'&%,! >&)%(>! 3&Z,! O?! (5#! :,'2(%! &:9#(/,3! '2,! 7$*3,!
9(3&'&/,! #$',! 48! OVG! $'! '2,! 3$:,! %,',6'&()! #$',! (7! LFG;! ]&=;U!
32(>3!'2,!3&:&*$#!#,35*'3;!](#!,C$:9*,?!&)!D'&%,!>&)%(>!3&Z,!XX?!
(5#! :,'2(%! &:9#(/,3! '2,! 7$*3,! 9(3&'&/,! #$',! 48! EFG! 6(:9$#,%!
>&'2!'2,!('2,#3!$'!'2,!3$:,!%,',6'&()!#$',!(7!LFG;!
!
!
T-*2,)/?@/%8&/52,U)./5+:7',)1/>-(N/G'0*/)(/'<@/'01/////
T<+,)V/)(/'<@/201),/#(-1)/>-01+>/.-V)/W@/
/
O@?! #(-1)/!0+:'<9/6)()5(-+0/
<)! '2,! 7&#3'! ,C9,#&:,)'?! >,! 7(653! ()! '2,! 9,#7(#:$)6,! (7!
%,',6'&()! $)%! 6(:9$#,! &'! >&'2! '2,! 9,#7(#:$)6,! (7! '2,! +--!
9#(9(3,%! 48! PU_Q! $)%! '2,! <+--! 9#(9(3,%! 48! PLQ;! J,! 3,'! '2,!
*,)='2!(7!3*&%&)=!>&)%(>!'(!4,!O!$)%!XX?!$)%!',3'!4('2!6$3,3;!"2,!
)5:4,#! (7! 2&%%,)! 3'$',3! &3! XL! $)%! '2,! )5:4,#! (7! (43,#/,%!
38:4(*3! &3! XVU;! "$4*,! X! &3! '2,! 6()7&=5#$'&()! (7! '2,! 7&#3'!
,C9,#&:,)';!
$'3<)/?@/#(-1)/)R7),-:)0(/5+0S-*2,'(-+0@/
#(-1)/6'('.)(.
B2:3),/+S/Q,+5)..).
!
T-*2,)/D@/%8&/52,U)./5+:7',)1/>-(N/G'0*/)(/'<@/'01//////
T<+,)V/)(/'<@/201),/#(-1)/>-01+>/.-V)/??@/
B2:3),/+S/#9.():/&'<<.
B+,:'</$,'-0-0*/6'('
LO
DX?YW
/
B+,:'</$).(/6'('
WWZ
[[\OZ\
A0,2.-+0/$).(/6'('
?Y\
DY\XL\
]&=;X!$)%!]&=;U!32(>!'2$'!'2,!f\K!65#/,3!(7!'2,!<+--!$)%!'2,!
+--! $#,! #(5=2*8! 3&:&*$#;! "2,! #,$3()! &3! '2$'! '2,! <+--! :$&)*8!
&:9#(/,3! '2,! '#$&)&)=! '&:,! (7! +--?! 45'! )('! &:9#(/&)=!
9,#7(#:$)6,?!3(!&'!2$3!'2,!3&:&*$#!#,35*'3!>&'2!+--;!<)!$%%&'&()?!
>,! 6$)! 3,,! '2$'! '2,! %&77,#,)'! >&)%(>! 3&Z,3! >&**! &)7*5,)6,! '2,!
9,#7(#:$)6,! (7! %,',6'&();! "2,! #,$3()! &3! '2$'! 32(#',#! 3,[5,)6,3!
(665#!:562!:(#,!7#,[5,)'*8!'2$)!'2,!*()=,#!3,[5,)6,3?!45'!(7',)!
$#,!)('!$3!$665#$',!$3!*()=,#!3,[5,)6,3;!D(!'2,!32(#',#!3,[5,)6,3!
!
!
J,!$)$*8Z,!(5#!,C9,#&:,)'$*!#,35*'!>&'2!('2,#!#,*$',%!>(#.!48!
/$#8&)=! '2#,32(*%!=) !7#(:! F;FFF`! '(! X;! "2,! f,6,&/,#! (9,#$'&)=!
62$#$6',#&3'&6!0f\K1!(7!,$62!:,'2(%!&3!32(>)!&)!]&=;X!$)%!]&=;U;!
<)! ]&=;X?! CA$C&3! &3! '2,! 7$*3,! 9(3&'&/,! #$',! $)%! 8A$C&3! &3! '2,!
$#,! 2$#%! '(! %&77,#,)'&$',! '2,! )(#:$*! 9#(6,33! 7#(:! '2,! &)'#53&()!
9#(6,33;!
J,!$*3(!,/$*5$',!'2,!'#$&)&)=!'&:,!(7!:(%,*&)=!$!)(#:$*!9#(7&*,;!
"2,!,/$*5$'&()!#,35*'3!$#,!32(>)!&)!"$4*,!U;!J,!6$)!3,,!'2$'!(5#!
:,'2(%! 39,)%3! _l`! '&:,3! 6(:9$#,%! >&'2! '2,! ('2,#! >(#.3;! "2,!
#,$3()! &3! '2$'! '2,! *((9! (7! $%$4((3'! &3! `! $)%! &'! ),,%3! '(! '#$&)! `!
<+--3?!&'!'2,#,7(#,!39,)%3!2&=2,#!'#$&)&)=!6(3'!'2$)!'#$&)&)=!'2,!
5)&[5,!+--!(7!('2,#!#,*$',%!>(#.;!
$'3<)/D@/#(-1)/)R7),-:)0(/(,'-0-0*/(-:)@/
G'0*/)(/'<@
JHIIK
T<+,)V/)(/'<@
JAHIIK
Q,+7+.)1/:)(N+1
J!1'3++.(MAHIIK
G-01+>/#-V)??
?OW@W\X.
?OD@DDX.
OO[@W\O.
G-01+>/#-V)/W
Z\@[WO.
ZO@L\.
L[W@[[O.
!
O@D! #)01:'-</!0+:'<9/6)()5(-+0//
J,!53,!'2,!D,)%:$&*!%$'$3,'3!(7!PU_Q!'(!9#(/,!'2$'!(5#!:,'2(%!
6$)! 6*,$#*8! %&36#&:&)$',! '2,! )(#:$*! $)%! &)'#53&()! 9#(6,33,3;!
T(#:$*! 9#(6,33,3! 2$/,! *(>,#! $)(:$*8! #$',! $)%! &)'#53&()!
9#(6,33,3! 2$/,! 2&=2,#! $)(:$*8! #$',;! "$4*,! _! &3! '2,! 6()7&=5#$'&()!
(7!D,)%:$&*;!<)'#53&()!',3'!9#(6,33,3!$#,!D83*(=Af,:(',X?!D83*(=A
f,:(',U?! D83*(=Ad(6$*X?! D83*(=Ad(6$*U?! 3:`O`$! $)%! 3:`C;! "2,!
)5:4,#! (7! 2&%%,)! 3'$',3! &3! `_! $)%! '2,! )5:4,#! (7! (43,#/,%!
38:4(*3!&3!XVU;!
"$4*,! N! 32(>3! '2$'! (5#! :,'2(%! 2$3! 6*,$#,#! 3,9$#$'&()! (7!
$)(:$*8!#$',3!4,'>,,)!)(#:$*!$)%!$4)(#:$*!9#(6,33,3!'2$)!('2,#!
>(#.3;! ](#! ,C$:9*,?! &)! (5#! :,'2(%! '2,! $)(:$*8! #$',! (7! )(#:$*!
',3'!%$'$!&3!F;VOG!>2&*,!'2,! :&)&:5:! $)(:$*8!#$',!(7!$4)(#:$*!
9#(6,33! &3! XN;NXG;! "2,! %&77,#,)6,! &3! XE! '&:,3;! D(! &'! 6$)! 6*,$#*8!
%&36#&:&)$',!)(#:$*!$)%!&)'#53&()!9#(6,33,3!;!
O@L! !0+:'<9/6)()5(-+0/+S/AE/A0(,2.-+0@/
"(! ,/$*5$',! '2,! 9,#7(#:$)6,! (7! @%$4((3'A<+--! &)! #,$*&3'&6!
6$3,?!>,!'$.,!<BO!0<)',#),'!BC9*(#,#!/,#3&()!O1!$3!',3'!'$#=,';!J,!
7&#3'!,/$*5$',!>2,'2,#!'2,!@%$4((3'A<+--!6$)!,77,6'&/,*8!%,',6'!
$)!&)'#53&()!(665##,%!&)!<BO;!"2,)!>,!6(:9$#,!'2,!$)(:$*8!#$',3!
%,',6',%! 48! 95#,! <+--! $)%! @%$4((3'A<+--;! "2,! <BO! )(#:$*!
%$'$! :,$)3! '2,! 383',:! 6$**3! =,),#$',%! 48! '2,! <BO! >2&62! 2$3! )(!
&)'#53&();! "2,! &)'#53&()! %$'$3,'3! (7! -DFOAFFX?! -DFOAFX_! $)%!
-DFEAFFN! $#,! (4'$&),%! 7#(:! '2,! ,C9*(&'! (7! /5*),#$4&*&'8! &)!
=#$92&63! #,)%,#&)=! ,)=&),?! '2,! ,C9*(&'! (7! /5*),#$4&*&'8! &)!
6#,$',',C'#$)=,!4577,#!(/,#7*(>!$)%!'2,!,C9*(&'!(7!/5*),#$4&*&'8!&)!
/,6'(#! :$#.59! *$)=5$=,! (7! <BO! #,39,6'&/,*8;! "2,! >&)%(>! 3&Z,! &3!
3,'!'(!4,!XX?!'2,!)5:4,#!(7!2&%%,)!3'$',3!&3!NF!$)%!'2,!)5:4,#!(7!
(43,#/,%! 38:4(*3! &3! UVN;! "2,! <BO! '#$&)&)=! %$'$! 6()'$&)3! UE?VUV!
383',:! 6$**3;! "2,! ('2,#! ',3'! %$'$! $)%! ,C9,#&:,)'$*! #,35*'3! $#,!
32(>)!&)!"$4*,!`;!
$'3<)/\@/$N)/'0+:'<9/,'()./+S/AEW/-0(,2.-+0/1)()5()1/39/
!1'3++.(MAHII@/
$).(/6'('
B2:3),/+S
#9.():/&'<<.
!0+:'<9/%'()JbK
B2:3),/+S
I-.:'(5N
#)`2)05).c$+('<
B2:3),/+S
#)`2)05).
AEW/B+,':</6'('
L\dYXL
D@\Z
XY?cL\Y[L
I#YWMYY?
/A0(,2.-+0/6'('
DLdL\L
Z[@L\
?[DXYcDLLOL
I#YWMY?L
/A0(,2.-+0/6'('
?DdY[O
Z?@X[
[WX?c?DYZO
I#YZMYYO
/A0(,2.-+0/6'('
DWdW\O
?L@\X
LWDDcDWWOO
$'3<)/L@/#)01:'-</)R7),-:)0(/5+0S-*2,'(-+0@/
6).5,-7(-+0/+S/(N)/1'('.)(./2.)1/J#':)/'./G'0*/)(/'<@/^DL_K
?Y\/(,'5)./+S/(N)/<'((),/1'('/-0
#)01:'-<@-0-
$,'-0-0*/6'('
?D/(,'5)./+S/(N)/S+,:),/1'('/-0
#)01:'-<@1'):+0@-0OD/(,'5)./+S/(N)/S+,:),/1'('/-0
#)01:'-<@-0-
B+,:'<
\/(,'5)./+S/(N)/<'((),/1'('/-0
$).(/6'('
#)01:'-<@1'):+0@-0O/.9.<+*/'(('5]./'01
!30+,:'<
D/20.255)..S2</'(('5].
/
/
J,! 6(:9$#,! (5#! ,C9,#&:,)'$*! #,35*'3! >&'2! ('2,#! #,*$',%! >(#.!
$)%! ,/$*5$',! '2,! $)(:$*8! #$',3! (7! ,$62! >(#.;! "$4*,! N! 32(>3! '2,!
,C9,#&:,)'$*!#,35*'3;!
!
<)!"$4*,!`?!'2,!%&77,#,)6,!&)!$)(:$*8!#$',3!4,'>,,)!)(#:$*!',3'!
%$'$! $)%! &)'#53&()! %$'$! &3! `;_l_F;`! '&:,3?! 3(! &'! 6$)! 6*,$#*8!
%&36#&:&)$',!>2,'2,#!$!9#(6,33!&3!$)(:$*8!(#!)(';!"$4*,!O!32(>3!
'2$'! @%$4((3'A<+--! 6$)! %&36#&:&)$',! )(#:$*! 9#(6,33,3! 7#(:!
&)'#53&()!9#(6,33,3!:(#,!6*,$#*8!'2$)!'2,!%,',6'&()!#$',!(7!<+--!
>2&62!&3!(4'$&),%!7(#:!'2,!#,35*'!(7!D2&)!PU`Q;!
$'3<)/W@/!0+:'<9/,'()./1)()5()1/39/AHII/'01/////////
!1'3++.(MAHII/
$'3<)/O@/$N)/'0+:'<9/,'()./5+:7',)1/>-(N/+(N),.@/
#9.():/&'<<
#)`2)05).
JG-01+>/#-V)a/??K
T+,,).(/)(/'<@ F))/)(/'<@ G'0*/)(/'<@
J#(-1)K
J%AQQE%K JHIIK
JbK
JbK
JbK
Q,+7+.)1/:)(N+1
J!1'3++.(MAHIIK
JbK
#9.<+*M%):+()?
\@?
?L@X
?O@LD
DY@W
#9.<+*M%):+()D
?@Z
?Y@X
??@ZL
?O@O?
#9.<+*MF+5'<?
O
Z@D
W@L?
?\@LD
#9.<+*MF+5'<D
\@L
X
W@LX
?Z@WO
.:\W\'
Y@W
X@O
O@X?
DD@YO
.:\R
D@Z
?Y@?
D@Z\
?X@\\
B+,:'</$).(/6'('
Y
?@D
Y
Y@[W
!
!
A0(,2.-+0/$).(/6'('
I#YWMYY?/A0(,2.-+0
6'('
I#YWMY?L/A0(,2.-+0
6'('
I#YZMYYO/A0(,2.-+0
6'('
G)0MT2/#N-0
JAHIIK
JbK
Q,+7+.)1/:)(N+1
J!1'3++.(MAHIIK
JbK
W@OW
Z[@L\
?[@[?
Z?@X[
?L@OW
?L@\X
!
!
/
"2,! $)(:$*8! #$',3! (7! &)'#53&()! %$'$! &)! (5#! :,'2(%! $#,! 2&=2,#!
'2$)!'2$'!(7!PU`Q?!,C6,9'!7(#!-DFEAFFN!&)'#53&()!%$'$!()*8!2&=2,#!
F;X_?! '2,! ,*3,! (7! &)'#53&()! %$'$! (7! (5#! :,'2(%! %&77,#! NlXU! '&:,3!
7#(:! PU`Q;! "2,#,7(#,?! (5#! :,'2(%! 2$3! 4,'',#! 9,#7(#:$)6,! &)!
&)'#53&()!%,',6'&();!
O@O! 80M<-0)/!142.(-0*/B+,:'</Q,+S-<)//
J2,)! '2,! )(#:$*! 4,2$/&(#! >$3! 62$)=,%?! '2,! )(#:$*! 9#(7&*,!
32(5*%! 4,! $%I53',%! 35&'$4*8! '(! $/(&%! ,##(),(53! %,',6'&()! 6$53,%!
48!'2,!62$)=,;! J,!,C',)%!<BO!,C9,#&:,)'!$)%!%&/&%,!&'!&)'(!'>(!
3&'5$'&()3R! 0X1! ^9=#$%,! '2,! 9#(=#$:! /,#3&();! <7! '2,! <BO! >$3!
59=#$%,%! '(! <BE! (#! <BV?! '2,! )(#:$*! 9#(7&*,! (7! <BO! 32(5*%! 4,!
$%I53',%;!0U1!"2,!62$)=,3!(7!53,#!4,2$/&(#;!](#!,C$:9*,?!>2,)!$!
53,#!#,9*$6,3!'2,!<BO!>&'2!'2,!]&#,7(C!$3!2&3a2,#!4#(>3,#?!)(#:$*!
9#(7&*,!(7!<BO!32(5*%!4,!$%I53',%!'((;!<)!'2&3!,C9,#&:,)'?!>,!53,!
<BO?!<BE?!<BV?!e((=*,!K2#(:,?!]&#,7(C!$)%!\9,#$!$3!,C9,#&:,)'!
(4I,6'3!$)%!>,!53,!'2,!'#$&)&)=!%$'$!(7!<BO!'(!',3'!'2,!%$'$!(7!('2,#!
4#(>3,#3;!"2,!<BO!'#$&)&)=!%$'$!6()'$&)3!UE?VUV!383',:!6$**3;!"2,!
',3'!%$'$!$)%!,C9,#&:,)'$*!#,35*'3!$#,!32(>)!&)!"$4*,!E;!
J,!$*3(!'$.,!'2,!\9,#$!$3!$)('2,#!,C$:9*,!'(!32(>!'2,!,77,6'3!
(7!$%I53'&)=!'2,!)(#:$*!9#(7&*,;! "2,!\9,#$!'#$&)&)=!%$'$!6()'$&)3!
XL?`VN! 383',:! 6$**3;! "2,! ',3'! %$'$! $)%! ,C9,#&:,)'$*! #,35*'3! $#,!
32(>)! &)! "$4*,! L;! @7',#! $%I53'&)=! <BO! )(#:$*! 9#(7&*,?! \9,#$!
$)(:$*8! #$',! &3! %(>)! 7#(:! NF;F_G! '(! F;XVG;! <'! 9#(/,3! '2$'! (5#!
:,'2(%! 6$)! %,6#,$3,! $)(:$*8! 6$53,%! 48! '2,! 62$)=,3! (7! 53,#!
4,2$/&(#;!
$'3<)/X@/$N)/'142.(-0*/)R7),-:)0(//
+0/(N)/5N'0*)./+S/2.),/3)N'U-+,@/
$'3<)/Z@/$N)/'0+:'<9/,'()./+S/AEW/5+:7',)1/////////////////////////
>-(N/+(N),/3,+>.),.@!
!
$).(/6'('
B2:3),/+S
#9.():/&'<<.
!0+:'<9/%'()JbK
AEW/
L\dYXL
D@\Z
B2:3),/+S
I-.:'(5N
#)`2)05).c$+('<
B2:3),/+S
#)`2)05).
XY?cL\Y[L
AEZ/
[ZdO[X
?O@O
?DWYYc[ZOZX
AE[/
DYd[[O
?O@LZ
DXXZcDY[\O
;++*<)/&N,+:)
?Y?dOOD
LW@YO
LW\\[c?Y?OLY
T-,)S+R
[\d[W?
D[@L\
DOLLZc[\[\?
87),'
LXdY[L
OY@YL
?\WODcLXYZL
B2:3),/+S/#9.():
&'<<.
!0+:'<9/%'()JbK
B2:3),/+S/I-.:'(5N
#)`2)05).c/$+('<
B2:3),//+S/#)`2)05).
")S+,)/!142.(-0* 87),'//6'('
LXdY[L
OY@YL
?\WODcLXYZL
!S(),/!142.(-0* 87),'//6'('
LXdY[L
Y@?[
ZYcLXYZL
$).(/6'('
!
]&)$**8?!^3,!\9,#$!$3!$)!,C$:9*,?!>,!,/$*5$',!'2,!'#$&)&)=!6(3'!
(7!(5#!:,'2(%!6(:9$#,%!>&'2!+--!PU_Q!$)%!<+--!PLQ;!D,,!]&=;!
_!7(#!'2,!#,35*'3;!
!
!
"$4*,! E! 32(>3! '2$'! '2,! $)(:$*8! #$',! (7! <BO!%&77,#3! 7#(:! ('2,#!
4#(>3,#3! 48! `lXO! '&:,3;! @3! <BO?! <BE?! <BV! $#,! $**! %,/,*(9,%! 48!
-&6#(3(7'?! '2,! $)(:$*8! #$',3! (7! <BE! $)%! <BV! $#,! *(>,#! '2$)! '2,!
('2,#3;! "2,! $)(:$*8! #$',! (7! \9,#$! &3! '2,! 2&=2,3'! 0NF;F_G1;! <'!
&:9*&,3! '2,! 4,2$/&(#!(7! \9,#$! &3! 3&=)&7&6$)'*8! %&77,#,)'! 7#(:! '2$'!
(7!<BO;!
]#(:! "$4*,! E?! >,! 6$)! 3,,! '2$'! &7! >,! %()!'! $%I53'! '2,! )(#:$*!
9#(7&*,! (7! 9#(6,33! 4,2$/&(#! 9#(:9'*8?! '2,! <MD! >&**! :&3'$.,)*8!
%,',#:&),! $! )(#:$*! 9#(6,33?! 3562! $3! \9,#$?! $3! $)! &)'#53&()!
9#(6,33;! D(! &)! '2,! 3,6()%! 92$3,! (7! '2&3! ,C9,#&:,)'?! >,! 53,! ),>!
'#$&)&)=! %$'$! (7! <BV! '(! $%I53'! <BO! )(#:$*! 9#(7&*,! $)%! ,/$*5$',!
$)(:$*8!#$',!(7!<BV!',3'!%$'$!$7',#!$%I53',%;!"2,!<BV!'#$&)&)=!%$'$!
6()'$&)3! X`?_NU! 383',:! 6$**3;! "2,! ',3'! %$'$! $)%! ,C9,#&:,)'$*!
#,35*'3!$#,!32(>)!&)!"$4*,!V;!@7',#!$%I53'&)=!<BO!)(#:$*!9#(7&*,?!
<BV!$)(:$*8!#$',!&3!%(>)!7#(:!XN;_EG!'(!U;OG;!"2&3!9#(/,3!'2$'!
(5#! :,'2(%! 6$)! %,6#,$3,! $)(:$*8! 6$53,%! 48! 59%$'&)=! 9#(=#$:!
/,#3&();!
$'3<)/[@/$N)/'142.(-0*/)R7),-:)0(/+0/27*,'1-0*/AEW/(+/AE[@/
$).(/6'('
B2:3),/+S/#9.():
!0+:'<9/%'()JbK
&'<<.
B2:3),/+S/I-.:'(5N
#)`2)05).c/$+('<
B2:3),//+S/#)`2)05).
")S+,)/!142.(-0* AE[/6'('
DYd[[O
?O@LZ
DXXZcDY[\O
!S(),/!142.(-0* AE[/6'('
DYd[[O
D@W
\OLcDY[\O
!
T-*2,)/L@/$,'-0-0*/'/0)>/0+,:'</7,+S-<)/+S/87),'@/
!
]#(:!]&=;!_?!$*'2(5=2!'2,!<+--!#,%56,!'2,!6(:95'$'&()! 6(3'!
(7! '! 7#(:! \0TU"1! '(! \0TU1?! &'! 3'&**! ),,%3! '(! 6(:95',! Q! &)! '2,!
7(#>$#%!9#(6,%5#,;!D(!6(:9$#,%!>&'2!'2,!+--?!&'3!9,#7(#:$)6,!
()*8!&:9#(/,3!$!*&''*,;!+(>,/,#?!(5#!:,'2(%!()*8!),,%3!'(!$%I53'!
'2,!>,&=2'!(7!,$62!>,$.!6*$33&7&,#!$)%!'2,!>,&=2'3!(7!3$:9*,3;!D(!
6(:9$#,%! >&'2! ('2,#! >(#.3?! (5#! :,'2(%! &:9#(/,3! '2,! '#$&)&)=!
6(3'!48!LFG!&)!#,=$#%&)=!'(!'#$&)&)=!'2,!),>!)(#:$*!4,2$/&(#;!
\@! &+05<2.-+0./
<)! '2&3! 9$9,#?! >,! 2$/,! 9#(9(3,%! $)! @%$4((3'A<+--! 7(#!
$)(:$*8! &)'#53&()! %,',6'&()! $)%! 9#(9(3,%! $! :,62$)&3:! 7(#! ()A
*&),!$%I53'&)=!)(#:$*!9#(7&*,;!
"2,#,!$#,!'>(!6()'#&45'&()3!(7!(5#!#,3,$#62R!
!
X;!
D566,3375**8! $%(9'! @%$4((3'A<+--! '(! %,6#,$3,! '2,! 7$*3,!
9(3&'&/,! #$',! (7! $)(:$*8! &)'#53&()! %,',6'&();! <)! D'&%,!
,C9,#&:,)'?!>,!&:9#(/,!'2,!7$*3,!9(3&'&/,!#$',!48!EFG!>&'2!
)(!*(33!(7!%,',6'&()!#$',;!
U;!
\)A*&),! $%I53'&)=! )(#:$*! 9#(7&*,! >2,)! )(#:$*! 4,2$/&(#!
>$3! 62$)=,%;! BC9,#&:,)'3! 32(>! '2$'! 9#(:9'*8! $%I53'&)=!
)(#:$*!9#(7&*,!6$)!%,6#,$3,!'2,!9(33&4&*&'8!(7!7$*3,!9(3&'&/,!
$)%! (5#! ()A*&),! $%I53':,)'! :,'2(%! 6$)! #,%56,! '2,! #,A
'#$&)&)=!'&:,!(7!$!),>!)(#:$*!9#(7&*,!48!LFG!&)!'2,!6$3,!(7!
\9,#$!4#(>3,#;!!
!
!!]#(:! (5#! ,C9,#&:,)'3?! >,! 7&)%! '2$'! '2,! 3,''&)=! (7! '2#,32(*%!
$=' $$ &)! B[;0`1! &3! 3,)3&'&/,?! 4,6$53,! &'! >&**! $77,6'! '2,! %,',6'&()!
9,#7(#:$)6,;!<7!'2,!'2#,32(*%!/$*5,!&3!*(>?!4('2!'2,!%,',6'&()!#$',!
$)%! '2,! 7$*3,! 9(3&'&/,! #$',! $#,! #,*$'&/,*8! *(>;! K5##,)'*8?! '2&3!
'2#,32(*%!&3!3,'!48!'#&$*!$)%!,##(#;!"253?!2(>!'(!3,'!$)!$66,9'$4*,!
'2#,32(*%! ),,%!'(!4,!75#'2,#!#,3,$#62,%;!H,3&%,3?!$3! >&)%(>!3&Z,!
&3!'2,!*,)='2!(7!3,[5,)6,?!&'!>&**!$*3(!$77,6'!'2,!%,',6'&();!D(!2(>!
'(!62((3,!'2,!4,'',#!>&)%(>!3&Z,!'(!%,',6'!&)'#53&()!,77,6'&/,*8!&3!
$)('2,#!#,3,$#62!%&#,6'&();!
W@! !&=B8GFE6;IEB$#/
J,!/,#8!$99#,6&$',!'2,!$)()8:(53!#,/&,>,#3!7(#!'2,&#!/$*5$4*,!
6(::,)'3!$)%!35==,3'&()3;!"2&3!#,3,$#62!>$3!3599(#',%!&)!9$#'!48!
'2,! "$&>$)! T$'&()$*! D6&,)6,! K(5)6&*! 5)%,#! 6()'#$6'3! TDK! L`A
UUUXABAFFV!AF`L!$)%!TDK!LOAUOUVABAFFV!AFFVA-m_;!
Z@! %ETE%EB&E#/
PXQ! W$39,#3.8!D,65#&'8!H5**,'&)R!-$*>$#,!B/(*5'&()!UFFV;!
2''9Raa>>>;/&#53*&3';6(:a,)a$)$*83&3h?!@66,33,%!()!-$#62!
FU?!UFFL;!
PUQ! m()=Z2()=!d&?!m$)=!e,?!n5!b&)=?!o2$(!H(;!@!),>!&)'#53&()!
%,',6'&()!:,'2(%!4$3,%!()!75ZZ8!+--;!<)%53'#&$*!
B*,6'#()&63!$)%!@99*&6$'&()3?!UFFV;!
P_Q! f$25*!W2$))$?!+5$9&)=!d&5;!K()'#(*!'2,(#,'&6!$99#($62!'(!
&)'#53&()!%,',6'&()!53&)=!$!%&3'#&45',%!2&%%,)!:$#.(/!:(%,*;!
J&#,*,33!K(::5)&6$'&()3?!<BBB!UFFV;!
PNQ! WI,'&*!+$3*5:?!-$#&,!B;!e;!-(,!$)%!D/,&)!b;!W)$93.(=;!
f,$*A'&:,!&)'#53&()!9#,/,)'&()!$)%!3,65#&'8!$)$*83&3!(7!
),'>(#.3!53&)=!+--3;!d(6$*!K(:95',#!T,'>(#.3?!UFFV;!
P`Q! WI,'&*!+$3*5:?!@I&'2!@4#$2$:!$)%!D/,&)!W)$93.(=;!]5ZZ8!
()*&),!#&3.!$33,33:,)'!7(#!%&3'#&45',%!&)'#53&()!9#,%&6'&()!
$)%!9#,/,)'&()!383',:3;!K(:95',#!-(%,*&)=!$)%!D&:5*$'&()?!
UFFV;!
POQ! K25)!m$)=?!],&[&!M,)=?!+$&%()=!m$)=;!@)!5)359,#/&3,%!
$)(:$*8!%,',6'&()!$99#($62!53&)=!354'#$6'&/,!6*53',#&)=!$)%!
2&%%,)!:$#.(/!:(%,*;!K(::5)&6$'&()3!$)%!T,'>(#.&)=!&)!
K2&)$?!UFFE;!
PEQ! K;!J$##,)%,#?!D;!](##,3'?!H;!S,$#*:5'',#;!M,',6'&)=!
&)'#53&()3!53&)=!383',:!6$**3R!$*',#)$'&/,!%$'$!:(%,*3;!<)!
S#(6,,%&)=3!(7!'2,!XLLL!<BBB!D8:9(3&5:!()!D,65#&'8!$)%!
S#&/$68?!9$=,3!X__AX`U?!\$.*$)%?!K$*&7(#)&$?!XLLL;!
PVQ! D;!K2(?!D;!+$);!">(!3(92&3'&6$',%!',62)&[5,3!'(!&:9#(/,!
2::A4$3,%!&)'#53&()!%,',6'&()!383',:3;!S#(6,,%&)=3!(7!
<)',#)$'&()$*!D8:9(3&5:!()!f,6,)'!@%/$)6,3!&)!<)'#53&()!
M,',6'&()?!UFF_;!
PLQ! e,#:$)!]*(#,ZAd$##$2()%(?!D53$)!H#&%=,3!$)%!B#&6!@;!
+$)3,);!<)6#,:,)'$*!,3'&:$'&()!(7!%&36#,',!2&%%,)!:$#.(/!
:(%,*3!4$3,%!()!$!),>!4$6.>$#%!9#(6,%5#,;!<)!S#(6,,%&)=3!
(7!'2,!">,)'&,'2!T$'&()$*!K()7,#,)6,!()!@#'&7&6&$*!
<)',**&=,)6,?!UFF`;!
PXFQ!m($/!]#,5)%!$)%!f(4,#'!B;!D62$9&#,;!@!%,6&3&()A'2,(#,'&6!
=,),#$*&Z$'&()!(7!()A*&),!*,$#)&)=!$)%!$)!$99*&6$'&()!'(!
4((3'&)=;!b(5#)$*!(7!K(:95',#!$)%!D83',:!D6&,)6,3!``?!XXLA
X_L?!XLLE;!
PXXQ!m($/!]#,5)%!$)%!f(4,#'!B;!D62$9&#,;!@!32(#'!&)'#(%56'&()!'(!
4((3'&)=;!b(5#)$*!(7!b$9$),3,!D(6&,'8!7(#!@#'&7&6&$*!
<)',**&=,)6,?!XN0`1REEXAEVF?!D,9',:4,#?!XLLL;!
PXUQ!T&.5)I!K;!\Z$!$)%!D'5$#'!f533,**;!\)*&),!4$==&)=!$)%!
4((3'&)=;!<)!@#'&7&6&$*!<)',**&=,)6,!$)%!D'$'&3'&63!UFFX?!W,8!
J,3'?!]d?!^D@?!99;!XF`AXXU;!b$)5$#8!UFFX;!
PX_Q!T&.5)I!K;!\Z$;!\)*&),!,)3,:4*,!*,$#)&)=;!M,9$#':,)'!(7!
B*,6'#&6$*!B)=&),,#&)=!$)%!K(:95',#!D6&,)6,?!^)&/,#3&'8!(7!
K$*&7(#)&$?!H,#.,*,8?!UFFX;!
PXNQ!d;!f;!f$4&),#?!H;!+;!b5$)=;!@)!&)'#(%56'&()!'(!2&%%,)!
:$#.(/!:(%,*3;!<BBB!@DDS!-$=$Z&),?!b$)5$#8!XLVO;!
PX`Q!d;!f;!f$4&),#;!@!'5'(#&$*!()!2&%%,)!:$#.(/!:(%,*3!$)%!
3,*,6',%!$99*&A6$'&()3!&)!39,,62!#,6(=)&'&();!S#(6;!<BBB?!/(*;!
EE?!99;!U`ERUVO?!],4!XLVL;!
PXOQ!"(:!H8*$)%,#!$)%!d&3$!"$',;!^3&)=!/$*&%$'&()!3,'3!'(!$/(&%!
(/,#7&''&)=!&)!$%$4((3';!@:,#&6$)!@33(6&$'&()!7(#!@#'&7&6&$*!
<)',**&=,)6,?!UFFO;!
PXEQ!@*,C$)%,#!p,Z2),/,'3!$)%!\*=$!H$#&)(/$;!@/(&%&)=!
4((3'&)=!(/,#7&''&)=!48!#,:(/&)=!6()753&)=!3$:9*,3;!
D9#&)=,#Ap,#*$=!H,#*&)!+,&%,*4,#=!UFFE;!
PXVQ!J,&:&)=!+5?!J,&!+5?!$)%!D',/,!-$84$).;!@%$4((3'A4$3,%!
$*=(#&'2:!7(#!),'>(#.!&)'#53&()!%,',6'&();!<BBB!"#$)3$6'&()3!
()!D83',:3?!-$)?!$)%!K84,#),'&63S9$#'!HR!K84,#),'&63?!/(*;!
_V?!T\;!U?!@9#&*!UFFV;!
PXLQ!D$8!J,&!]((?!m()=!d&$)?!$)%!d&$)=!M()=;!f,6(=)&'&()!(7!
/&35$*!39,,62!,*,:,)'3!53&)=!$%$9'&/,*8!4((3',%!2&%%,)!
:$#.(/!:(%,*3;!<BBB!"#$)3$6'&()3!()!K&#65&'3!$)%!D83',:3!
7(#!p&%,(!",62)(*(=8?!/(*;!XN?!T\;!`?!-$8!UFFN;!
PUFQ!d,/,)'!-;!@#3*$)?!$)%!b(2)!+;!d;!+$)3,);!D,*,6'&/,!'#$&)&)=!
7(#!2&%%,)!:$#.(/!:(%,*3!>&'2!$99*&6$'&()3!'(!39,,62!
6*$33&7&6$'&();!<BBB!"#$)3$6'&()3!()!D9,,62!$)%!@5%&(!
S#(6,33&)=?!/(*;E?!T\;X?!b$)5$#8!XLLL;!
PUXQ!^T-!D83',:!K$**!M$'$3,'3;!
2''9Raa>>>;63;5):;,%5al&::3,6a383',:6$**3;2':;!
PUUQ!n&$(Aq5)=!o2$)=r?!o2()=Ad*$)=!o!+!^;!K(:4&)&)=!'2,!
2::!$)%!'2,!),5#$*!),'>(#.!:[%,*3!'(!#,6(=)&Z,!&)'#53&()3;!
S#(6,,%&)=3!(7!'2,!"2&#%!<)',#)$'&()$*!K()7,#,)6,!()!
-$62&),!d$:&)=!$)%!K84,#),'&63?!D2$)=2$&?!UOAUL!@5=53'!
UFFN;!
PU_Q!J;!J$)=?!n;+;!e5$)?!n;d;!o2$)=;!-(%,*&)=!9#(=#$:!
4,2$/&(#3!48!2&%%,)!:$#.(/!:(%,*3!7(#!&)'#53&()!%,',6'&();!
<)!S#(6,,%&)=3!(7!UFFN!<)',#)$'&()$*!K()7,#,)6,!()!-$62&),!
d,$#)&)=!$)%!K84,#),'&63?!@5=!UFFN;!
PUNQ!B;B3.&)?!J;d,,?!$)%!D;b;D'(*7(;!-(%,*&)=!383',:!6$**3!7(#!
&)'#53&()!%,',6'&()!>&'2!%8)$:&6!>&)%(>!3&Z,3;!<)!
S#(6,,%&)=3!(7!M@fS@!<)7(#:$'&()!D5#/&/$4&*&'8!
T5-6&2&-/&#U#LK459,$,5-#VV"NWWM<XVYTLZ!FX?!b5),!UFFX;!
PU`Q!J,)A]5!D2&);!@)!$%$9'&/,!$)(:$*8!%,',6'&()!:,'2(%!4$3,%!
()!&)6#,:,)'$*!2&%%,)!:$#.(/!:(%,*!$)%!>&)%(>3!)$'&/,!
@S<;!M,9$#':,)'!(7!<)7(#:$'&()!-$)$=,:,)'?!T$'&()$*!
K,)'#$*!^)&/,#3&'8?!"$&>$);!-$3',#!'2,3&3!UFFE;!0&)!K2&),3,;1!
Addressing the Attack Attribution Problem
using Knowledge Discovery and Multi-criteria
Fuzzy Decision-Making
Olivier Thonnard
Royal Military Academy
Polytechnic Faculty
Brussels, Belgium
Wim Mees
Royal Military Academy
Polytechnic Faculty
Brussels, Belgium
[email protected] [email protected]
ABSTRACT
In network traffic monitoring, and more particularly in the
realm of threat intelligence, the problem of “attack attribution” refers to the process of effectively attributing new
attack events to (un)-known phenomena, based on some evidence or traces left on one or several monitoring platforms.
Real-world attack phenomena are often largely distributed
on the Internet, or can sometimes evolve quite rapidly. This
makes them inherently complex and thus difficult to analyze.
In general, an analyst must consider many different attack
features (or criteria) in order to decide about the plausible root cause of a given attack, or to attribute it to some
given phenomenon. In this paper, we introduce a global
analysis method to address this problem in a systematic
way. Our approach is based on a novel combination of a
knowledge discovery technique with a fuzzy inference system, which somehow mimics the reasoning of an expert by
implementing a multi-criteria decision-making process built
on top of the previously extracted knowledge. By applying
this method on attack traces, we are able to identify largescale attack phenomena with a high degree of confidence.
In most cases, the observed phenomena can be attributed
to so-called zombie armies - or botnets, i.e. groups of compromised machines controlled remotely by a same entity.
By means of experiments with real-world attack traces, we
show how this method can effectively help us to perform a
behavioral analysis of those zombie armies from a long-term,
strategic viewpoint.
Keywords
Intelligence monitoring and analysis, attack attribution.
1.
INTRODUCTION
In the field of threat intelligence, “attack attribution” refers
to the process of effectively attributing new attack events to
known or unknown phenomena by analyzing the traces they
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CSI-KDD ’09, June 28, Paris, France
Copyright 2009 ACM 978-1-60558-669-4 ...$5.00.
Marc Dacier
Symantec Research
Sophia Antipolis
France
[email protected]
have left on sensors or monitoring platforms deployed on
the Internet. The objectives of such a process are twofold:
i) to get a better understanding of the root causes of the
observed attacks; and ii) to characterize emerging threats
from a global viewpoint by producing a precise analysis of
the modus operandi of the attackers on a longer time scale.
In this paper, we introduce a global threat analysis method
to address this problem in a systematic way. We present a
knowledge mining framework that enables us to identify and
characterize large-scale attack phenomena on the Internet,
based on network traces collected with very simple and easily
deployable sensors. Our approach relies on a novel combination of knowledge discovery (by means of maximum cliques)
and a multi-criteria decision-making algorithm that is based
on a fuzzy inference system (FIS). Interestingly, a FIS does
not need any training prior making inferences. Instead, it
takes advantage of the previously extracted knowledge to
make sound inferences, so as to attribute incoming attack
events to a given phenomenon.
A key aspect of the proposed method is the exploitation
of external characteristics of malicious sources, such as their
spatial distributions in terms of countries and IP subnets,
or the distribution of targeted sensors. We take advantage
of these statistical characteristics to group events that seem
a priori unrelated, whereas most current techniques used
for anomalous traffic correlation rely only on the intrinsic
properties of network flows (e.g., protocol characteristics,
IDS alerts or signatures, firewall logs, etc) [1, 31].
Our research builds also on prior work in malicious traffic analysis, also referred to as Internet background radiation [17, 4]. We acknowledge also the seminal work of Yegneswaran et al. on “Internet situational awareness” [30], in
which they explore ways to integrate honeypot data into
daily network security monitoring. Their approach aims at
providing tactical information, for daily operations, whereas
our approach is more focused on strategic information revealing the long-term behaviors of large-scale phenomena.
Furthermore, many of these large-scale phenomena are apparently related to the ubiquitous problem of zombie armies
- or botnets, i.e. groups of compromised machines that are
remotely controlled and coordinated by a same entity. Still
today, zombie armies and botnets constitute, admittedly,
one of the main threats on the Internet, and they are used
for different kinds of illegal activities (e.g., bulk spam sending, online fraud, denial of service attack, etc) [3, 18]. While
most previous studies related to botnets have focused on un-
In Section 2, we present the first component of our method,
namely the extraction of cliques of attackers. This step aims
at discovering knowledge by identifying meaningful correlations in a set of attack events. In Section 3, we present a
multi-criteria decision-making algorithm that is based on a
fuzzy inference system. The purpose of this second component consists in combining intelligently the previously extracted knowledge, so as to build sequences of attack events
that can be very likely attributed to the same global phenomena. Then, in Section 4, we present our experimental
results and the kind of findings we can obtain by applying
this analysis method to a set of attack events. Finally, we
conclude in Section 5 and we suggest some future directions.
2.
2.1
KNOWLEDGE DISCOVERY IN ATTACK
TRACES
Introduction
We need first to introduce the notion of “attack event”.
Our dataset is made of network attack traces collected from
a distributed set of sensors (e.g., server honeypots), which
are deployed in the context of the Leurre.com Project [14,
22]. Since honeypots are systems deployed for the sole purpose of being probed or compromised, any network connection that they establish with a remote IP can be considered
as malicious, or at least suspicious. We use a classical clustering algorithm to perform a first low-level classification of
the traffic. Hence, each IP source observed on a sensor is
attributed to a so-called attack cluster [21] according to its
network characteristics, such as the number of IP addresses
targeted on the sensor, the number of packets and bytes sent
to each IP, the attack duration, the average inter-arrival time
between packets, the associated port sequence being probed,
and the packet payload (when available). Therefore, all IP
sources belonging to a given attack cluster have left very
similar network traces on a given sensor and consequently,
they can be considered as having the same attack profile.
This leads us then to the concept of attack event, which is
defined as follows:
An attack event refers to a subset of IP sources
having the same attack profile on a given sensor, and whose coordinated activity has been observed within a specific time window.
Fig. 1 illustrates this notion by representing the time series (i.e., the number of sources per day) of three coordinated
attack events observed on two different sensors in the same
time interval, and targeting three different ports. The identification of those events can be easily automated by using
the method presented in [20]. By doing so, we are able to
extract interesting events from this spurious, nonproductive
traffic collected by our sensors (previously termed “Internet
background radiation” in [17]), and we can focus on the most
450
AE103 (139T) on sensor 45
AE171 (1433T) on sensor 45
AE173 (5900T) on sensor 9
400
350
300
Nr of sources
derstanding their inner working [23, 6, 2], or on techniques
for detecting bots at the network-level [8, 9], we are instead
more interested in studying the global behaviors of those
armies from a strategic viewpoint, i.e.: how long do they
stay alive on the Internet, what is their average size, and
more importantly, how do they evolve over time with respect to different criteria such as their origins, or the type
of activities (or scanning) they perform.
250
200
150
100
50
0
82
84
86
88
90
92
94
Time (by day)
96
98
100
102
Figure 1: Illustration of 3 attack events observed on 2
different sensors, and targeting 3 different ports.
important events that might originate from coordinated phenomena. In the rest of this Section, we show how to take
advantage of different characteristics of such attack events
to discover knowledge by means of an unsupervised cliquebased clustering technique.
2.2 Defining Attack Characteristics
In most knowledge discovery applications, the first step
consists in selecting certain key characteristics from the dataset,
i.e., salient features that may (hopefully) provide meaningful
patterns [11]. We give here an overview of different attack
characteristics we have selected to perform the extraction
of knowledge from our set of attack events. In this specific
case, we consider these characteristics as useful to analyze
the root causes of global phenomena observed on our sensors.
However, we do not pretend that they are the only ones that
could be used in threat monitoring, and other characteristics
might certainly prove even more relevant in the future. For
this reason, the framework is built such that other attack
features could be easily included when necessary.
So, the two first characteristics retained are related to
the origins of the attackers, i.e. their spatial distributions.
First, the geographical location can be used to identify attack activities having a specific distribution of originating
countries. Such information can be important to identify,
for instance, botnets that are located in a limited number of
countries. It is also a way to confirm the existence, or not, of
so-called safe harbors for cybercriminals or hackers. Somehow related to the geographical location, the IP network
blocks provide also an interesting viewpoint on the attack
phenomena. Indeed, IP subnets can give a good indication
of the spatial “uncleanliness” of certain networks, i.e., the
tendency for compromised hosts (e.g., zombie machines) to
stay clustered within unclean networks [5]. So, for each attack event, we can create a feature vector representing either
the distribution of originating countries, or of IP addresses
grouped by Class A-subnet (i.e., by /8 prefix).
The next attack characteristic deals with the targets of the
attackers, namely the distribution of sensors that have been
targeted by the sources. Botmasters may indeed send commands at a given time to all zombies to instruct them to start
scanning (or attacking) one or several IP subnets, which of
course will create coordinated attack events on specific sensors. Therefore, it seems important to look at relationships
that may exist between attack events and the sensors they
have been observed on. Since attack events are defined per
sensor, we decided to group all strongly correlated attack
events that occurred within the same time window of existence (as explained in [20]), and we then use each group of
attack events to create the feature vector representing the
proportion of sensors that have been targeted.
Besides the origins and the targets, the type of activity
performed by the attackers seems also relevant to us. In
fact, bot software is often crafted with a certain number of
available exploits targeting a reduced set of TCP or UDP
ports. In other words, we might think of each botnet having
its own attack capability, which means that a botmaster will
normally issue scan or attack commands only for vulnerabilities that he might exploit to expand his botnet. So, it seems
to make sense to take advantage of this feature to look for
similarities between the sequences of ports that have been
targeted by the sources of the attack events. Let us remind
that, in our low-level classification of the network traffic [21],
each source is associated to the complete sequence of ports
that it has targeted on a given sensor for the whole duration of the attack session (e.g., less than 24 hours), which
allows us to compute and compare the distributions of port
sequences for the observed attack events.
Finally, we have also decided to compute, for each pair
of events, the ratio of common IP addresses. We are aware
of the fact that, as time passes, some zombie machines of a
given botnet might be cured while others may get infected
and join the botnet. Additionally, certain ISPs apply a quite
dynamic policy of IP address allocation to residential users,
which means that bot-infected machines can have different
IP addresses when we observe them at different moments.
Nevertheless, and according to our domain experience, it
is reasonable to expect that if two distinct attack events
have a high percentage of IP addresses in common, then
the probability that those two events are somehow related
to the same global phenomenon is increased (assuming that
the time difference between the two events is not too large).
2.3
Extracting Cliques of Attackers
2.3.1 Principles
In our global threat analysis method, we have developed
a knowledge discovery component that involves an unsupervised graph-theoretic correlation process. The idea consists in discovering all groups of highly similar attack events
(through their corresponding feature vectors) in a reliable
and consistent manner, and for each attack characteristic
that can bring an interesting viewpoint on the root causes.
In a clustering task, we typically consider the following
steps [11]: i) feature selection and/or extraction; ii) definition of a similarity measure between pairs of patterns; iii)
grouping similar patterns; iv) data abstraction (if needed),
to provide a compact representation of each cluster; and v)
the assessment of the clusters quality and coherence.
In the previous Section, we have already described the attack features that are of interest in this paper; so now we
need to measure the similarity between two such input vectors (or distributions, in our case). Clearly, the choice of
a similarity metric is very important, as it has an impact
on the properties of the final clusters, such as their size,
quality, and consistency. To reliably compare the kind of
empirical distributions mentioned here above, we have chosen to rely on strong statistical distances. As we do not
know the real underlying distribution from which the observed samples were drawn, we use non-parametric statistical tests, such as Pearson’s χ2 , to determine whether two
one-dimensional probability distributions differ in a significant way (with a significance level of 0.05). The resulting
p-value is then validated against the Jensen-Shannon divergence (JSD) [15], which derives itself from the KullbackLeibler divergence [12]. Let p1 and p2 be for instance two
probability distributions over a discrete space X, then the
K-L divergence of p2 from p1 is defined as:
X
p1 (x)
DKL (p1 ||p2 ) =
p1 (x) log
p
2 (x)
x
which is also called the information divergence (or relative
entropy). DKL is commonly used in information theory to
measure the difference between two probability distributions
p1 and p2 , but it is not considered as a true metric since it
is not symmetric, and does not satisfy the triangle inequality. For this reason, we can also define the Jensen-Shannon
divergence as:
JS(p1 , p2 ) =
DKL (p1 ||p̄) + DKL (p2 ||p̄)
2
where p̄ = (p1 + p2 )/2. In other words, the Jensen-Shannon
divergence is the average of the KL-divergences to the average distribution. The JSD has the following notable properties: it is always bounded and non-negative; JS(p1 , p2 ) =
JS(p2 , p1 ) (symmetric), and JS(p1 , p2 ) = 0 when p1 = p2
(idempotent). To be a true metric, the JSD must also satisfy the triangular inequality, which is not true for all cases
of (p1 , p2 ). Nevertheless, it can be demonstrated that the
square root of the Jensen-Shannon divergence is a true metric [7], which is what we need for our application.
Finally, we take advantage of those similarity measures to
group all attack events whose distributions look very similar. We simply use an unsupervised graph-based approach to
formulate the problem: the vertices of the graph represent
the patterns (or feature vectors) of all attack events, and
the edges express the similarity relationships between those
vertices, as calculated with the distance metrics described
here above. Then, the clustering is performed by extracting
so-called maximal cliques from the graph, where a maximal clique is defined as an induced sub-graph in which the
vertices are fully connected and it is not contained within
any other clique. To perform this unsupervised clustering,
we use the dominant sets approach of Pavan et al. [19],
which proved to be an effective method for finding maximal
weighted cliques. This means that the weight of every edge
(i.e., the relative similarity) is also taken into consideration
by the algorithm, as it seeks to discover maximal cliques
whose total weight is maximized. This generalization of the
MCP is also known as the maximum weight clique problem (MWCP). We refer the interested reader to [27, 26] for
a more detailed description of this clique-based clustering
technique applied to our honeynet traces.
2.3.2 Some Experimental Clique Results
Our data set comes from a 640-day attack trace obtained
with the Leurre.com honeynet in the time period from September 2006 to June 2008. This trace was collected by 36 platforms located in 20 different countries and belonging to 18
different class A-subnets. We have selected only the most
prevalent types of activities observed on the sensors, i.e.
about 130 distinct attack profiles for which an activity involving a sufficient number of IP sources had been observed
at least once on a given day during the whole period. This
data set comprises totally 1,195,254 distinct sources, which
have sent about 3,423,577 packets to the sensors. By using
the technique described in [20], we have extracted 351 attack events that were somehow coordinated on at least two
different sensors. This reduced set of attack events still accounts for 282,363 unique sources (23.6 % of the data set),
or 741,349 packets (21.5%).
For the set of attack characteristics considered above, we
applied our clique-based clustering on those attack events.
Table 1 on page 5 presents a high-level overview of the
cliques obtained for each attack dimension separately. As we
can see, a relatively high volume of sources could be classified
into cliques for each dimension. The last colon with the most
prevalent patterns gives an indication of which countries or
class A-subnets (e.g., originating or targeted IP subnets)
are most commonly observed in the cliques that lie in the
upper quartile with respect to the number of sources. Interestingly, it seems that many coordinated attack events are
coming from a given IP subspace. Regarding the targeted
platforms, several cliques involve a single class A-subnet.
About the type of activities, we can observe some commonly targeted ports (e.g., Windows ports used for SMB
or RPC, or SQL and VNC ports), but also a large number of uncommon high TCP ports that are normally unused
on standard (and clean) machines (such as 6769T, 50286T,
9661T, . . . ). A non-negligeable volume of sources is also due
to UDP spammers targeting Windows Messenger popup service (ports 1026 to 1028/UDP).
2.4
Consolidation of the Knowledge
In order to assess the consistency of the resulting cliques of
attack events, it can be useful to see them charted on a twodimensional map so as to i) verify the proximities among
clique members (intra-clique consistency), and ii) understand potential relationships between different cliques that
are somehow related (i.e. inter-clique relationships). Moreover, the statistical distances used to compute those cliques
make them intrinsically coherent, which means also that certain cliques of events may be somehow related to each other,
although they were separated by the clique algorithm.
Since most of the feature vectors we are dealing with have
a high number of variables (e.g., a geographical vector has
more than 200 country variables), obviously the structure of
such high-dimensional data set cannot be displayed directly
on a 2D map. Multidimensional scaling (MDS) is a set of
methods that can help to address this problem. MDS is
based on dimensionality reduction techniques, which aim at
converting a high-dimensional dataset into a two or threedimensional representation that can be displayed, for example, in a scatter plot. The aim of dimensionality reduction is
to preserve as much of the significant structure of the highdimensional data as possible in the low-dimensional map.
As a consequence, MDS allows an analyst to visualize how
far observations are from each other for different kinds of
similarity measures, which in turn can deliver insights into
!"
%
&'(&)
*+(&)
&)(*+
&)(*+
&)(0122
&)(&'
&)(,#"
,-(&)
./(,&)(:4
&)(*+
&)(*+
*+(&)
*+(:4
$"
67('.
67(.-
"
!$"
!#"
$;
,-(0122
*+(&'
$"
,-(&'
*+(45 *+(33-(9.
*+(&'
*+(&)
*+(3*+(&)3-(&)
7+(9.
7+(9. 3-(9.
9.(95
9.(7+
9.(7+
="
*+(67
*+(0122
4,(,-
,-(&'
45(67
45(38*(45
9.(7+
9.(+7
<;
<"
;
9.(67
&'
!!" %
!!"
!#"
!$"
"
$"
#"
!"
"
Figure 2: Visualization of geographical cliques of attackers. The coloring refers to the different cliques and the
red circles indicate their sizes on the low-D map. The
superposed text labels indicate only the two top attacking
countries for some of the data points.
the underlying structure of the high-dimensional dataset.
Because of the intrinsic non-linearity of real-world data
sets, we applied a recent MDS technique called t-SNE to
visualize each dimension of the data set, and to assess the
consistency of the cliques results. t-SNE [28] is a variation
of Stochastic Neighbour Embedding; it produces significantly
better visualizations than other MDS techniques by reducing
the tendency to crowd points together in the centre of the
map. Moreover, this technique has proven to perform better in retaining both the local and global structure of real,
high-dimensional datasets in a single map, in comparison to
other non-linear dimensionality reduction techniques such
as Sammon mapping, Isomaps or Laplacian Eigenmaps [10].
Stochastic Neighbor Embedding aims at minimizing a cost
function that is based on the sum of Kullback-Leibler divergences over all datapoints using a gradient descent method.
t-SNE improves further this technique by using an initial
Student-t distribution, rather than a Gaussian, to compute
the similarity between two points in the low-dimensional
space (which tends to alleviate the problem of “crowding”
points in the center of the map, see [28] for a detailed explanation).
Figure 2 shows the resulting two-dimensional plot obtained by mapping the geographical vectors on a 2D map
using t-SNE. Each datapoint on this map represents the geographical distribution of a given attack event. The coloring
refers to the clique membership of each event, as obtained
previously by applying the clique-based clustering, and the
dotted circles indicate the clique sizes. We could easily verify
that two adjacent events on the map have highly similar geographical distributions (even from a statistical viewpoint),
while two distant events have clearly nothing in common
in terms of originating countries. Quite surprisingly, the
resulting mapping is far from being chaotic; it presents a
relatively sparse structure with clear datapoint groupings,
which means also that most of those attack events present
very tight relationships regarding their origins. Due to the
Attack Dimension
Geolocation
Nr of
Cliques
31
Max.size
(nr events)
40
Min.size
(nr events)
3
Volume of
sources (%)
84.4
IP Subnets (Class A)
25
51
3
91.2
Targeted platforms
17
86
2
70.1
Port sequences
22
66
4
93.2
Most prevalent patterns found in the cliques(1)
!CN,CA,US,FR,TW", !IT,ES,FR,SE,DE,IL", !KR,US,BR,PL,CN,CA"
!US,JP,GB,DE,CA,FR,CN,KR", !US,FR,JP,CN,DE,ES,TW", !CA,CN"
!PL,DE,ES,HU,FR"
!87,82,151,83,84,81,85,213", !222,221,60,218,58,24,124,121,219,82,220"
!201,83,200,24,211,218,89,124,61,82,84", !24,60"
!83,84,85,80,88", !193,195,201,202,203,216,200,61,24,84,59"
!202", !88, 192", !195", !193", !194"
!129, 134, 139, 150", !24, 213"
!I", !1433T", !I-445T", !5900T", !1026U", !135T", !50286T"
!I-445T-139T-445T-139T-445T", !6769T", !1028U-1027U-1026U"
Table 1: Some experimental clique results obtained from a honeynet dataset collected from Sep 06 until June 08.
(1) the given patterns represent the average distributions for the most prevalent cliques, i.e. the ones lying in the
upper quartile in terms of number of sources. For the IP subnets (resp. targeted platforms), the numbers refer to the
distributions of originating (resp. targeted) class A-subnets.
!"
%
&##'* &
&
&&
&##'*
'(!)*
#"
!!##*
#!!$*
+',!#*
$"
',""*
',""*
(#++*
($$,+*
!$"
!#"
$#!'+*
#!!$*
#!!$*
!$((*
'+)#$* !.!,*
$,())*
$'
& &&
&
$"
',""*
("$'*
',""*
"
+"
##'*
',""*
$,!.*##'*
(+'*
&##'*(+,*##'*--$,!.* $,!)*
$,!.*
$,!)*
("$!/
(#++*
('
("
(+'*
',""*
!(+#*
'
("$./("$!/("$)/
!!" %
!!"
!#"
!$"
"
$"
#"
!"
"
Figure 3: Same visualization of the geographical cliques
of attackers as Fig 2, but here the superposed text labels
indicate the port sequences targeted by the attackers.
strict statistical distances used to calculate cliques, this kind
of correlation can hardly be obtained by chance only.
Similar “semantic mapping” can naturally be obtained for
the other dimensions (e.g., subnets, platforms, etc), so as to
help assessing the quality of other cliques of attackers. To
conclude this Section, Figure 3 shows the same geographical mapping on which the port sequences of several attack
events have been superposed on top of the datapoints. This
can help to visualize unobvious relationships among different types of activities and their origins, and it leads also
to the natural intuition that an intelligent algorithm could
potentially leverage the results of this knowledge discovery
process, by combining efficiently different sets of cliques.
3.
3.1
MULTI-CRITERIA DECISION-MAKING
Requirements and Motivation
The decision-support component of our method shall take
advantage of the knowledge obtained via the extraction of
cliques, and of the global semantic mappings obtained through
dimensionality reduction. The final objective consists in
re-constructing sequences of attack events that can be attributed with a high confidence to the same root phenomenon
in function of multiple criteria. In other words, we want
to build an inference engine that takes as input the extracted knowledge to classify incoming attack events into
either “known phenomena”, or otherwise to identify a new
phenomenon when needed (e.g., when we observe the first
attack event of a new zombie army). There exists certainly
many different classification algorithms that are able to map
multiple input features to multiple output classes, even for
complex, non-linear mappings, such as Support Vector Machines, Artificial Neural Networks, etc. However, we are confronted to specific constraints that do not allow us to use this
type of supervised machine learning techniques. First, we
have a priori zero-knowledge of the expected output, which
means that we can not provide training samples showing the
characteristics of the output we are looking for. Secondly,
we want to include some domain knowledge to specify which
type of combinations we expect to be promising in the root
cause identification. Third, the inference system must be
flexible enough to allow additional criteria to be used in
the future, so as to further improve the inference capabilities. Finally, we favor the “white-box” approach having a
transparent reasoning process, which allows an expert to understand the reasons (i.e., the combinations of criteria) for
which the system has grouped a given set of events into the
same root phenomenon.
Although large-scale phenomena on the Internet are complex and dynamic, our intuition is that two consecutive attack events should be linked to the same root phenomenon
if and only if they share at least two different attack characteristics. That is, we want to build a decision-making
process that will attribute two attack events to the same
phenomenon when the events features are “close enough” for
any combination of at least two attack dimensions out of the
complete set of criteria: {origins, targets, activity, commonIP }.
So, we hypothesize that real-world phenomena may perfectly
evolve over time, which means that two consecutive attack
events of the same zombie army must not necessarily have
all their attributes in common. For example, the bots composition of a zombie army may evolve over time because
of the cleaning of infected machines and the recruitment of
new bots. From our observation viewpoint, this will translate into a certain shift in the IP subnet distribution of the
zombie machines for subsequent attack events of this army
H7#!35;'$,!4*#!!
F*3,;!5'EA#*,!
:5;'$!@4*34A+#,!4*#!&'((3B#6!456!4++!*'+#,!4*#!!
I++!%'$;'$,!(3!4*#!F%EA35#6!!
#@4+'4$#6!C3$7!;*#D6#B5#6!E#EA#*,73;!&'5FG%5,!! C3$7!45!4//*#/4G%5!E#$7%6.!
Figure 4: Main components of a Fuzzy System.
(and thus, most probably different cliques w.r.t. the origins). Or, a zombie army may be instructed to scan several
consecutive IP subnets in a rather short interval of time,
which will lead to the observation of different events having highly similar distributions of originating countries and
subnets, but those events will target completely different
sensors, and may eventually use different exploits (hence,
targeting different port sequences).
On the other hand, we consider that only one correlated
attack dimension is not sufficient to link two attack events
to the same root cause, since the result might then be due
to chance only (e.g., a large proportion of attacks originate
from some large or popular countries, certain Windows ports
are commonly targeted, etc). However, by combining intelligently several attack viewpoints, we can reduce considerably
the probability that two attack events would be attributed
to the same root cause whereas they are in fact unrelated.
3.2
Fuzzy Inference Systems
We still need to formally define what is the “relatedness
degree” between two attack events, certainly when they do
not belong to a same clique but are somehow “close” to each
other. Intuitively, attack events characteristics in the real
world have unsharp boundaries, and the membership to a
given phenomenon can be a matter of degree. For this reason, we have developed a decision-making process that is
based on a fuzzy inference system (FIS). The mathematical concepts behind fuzzy reasoning are quite simple and
intuitive; in fact, it aims at reproducing the reasoning of
a human expert with very simple mathematical functions.
Fuzzy inference is thus a convenient way to map an input
space to an output space with a flexible and extensible system, and using the codification of common sense and expert
knowledge. The mapping then provides a basis from which
decisions can be made.
The main components of an inference system are sketched
in Fig. 4. To map the input space to the output space,
the primary mechanism is a list of if-then statements called
rules, which are evaluated in parallel, so the order of the
rules is unimportant. Instead of using crisp variables, all
inputs are fuzzified using membership functions in order to
determine the degree to which the input variables belong to
each of the appropriate fuzzy sets. If the antecedent of a
given rule has more than one part (i.e., multiple ’if’ statements), a fuzzy logical operator is applied to obtain one
number that represents the result of the antecedent for that
rule. For example, the fuzzy OR operator simply selects the
maximum of the two values. The results of all rules are then
combined and distilled into a single, crisp value that can be
used to make a decision. This aggregation process can be
done in two different ways. Mamdani’s inference [16] expects
"%$
"%#
"%!
"2
!!"
2
;*'
;*!
'
'
"%&
678*297:17:2
!
2
;*'
;*!
'
"%&
)*+,*-./01
!"#$!%&!&'(()!*'+#,-!#./.0!!
!1'+#!20!3&!!"!3,!#!456!$"!3,!%!$7#5!&"!3,!'!!
!!!1'+#!80!3&!!(!3,!#)!$7#5!&(!3,!')!!
!!!!
!!!1'+#!50!9!
!!
)*+,*-./01
:5;'$,!
<=2-)2>!
=8!
<=?-)?>!
9!
"%$
"%#
"%!
"
!"
('
#"
$"
"2
"%&
"%$
"%#
"%!
!#"
!3"
!!"
!'"
"
"
5
4'
Figure 5: Fuzzy rule evaluation.
the output membership functions to be also fuzzy sets. After
the aggregation process, there is a fuzzy set for each output
variable that needs defuzzification by computing for instance
the centroı̈d of the output function. Whereas in a Sugenotype inference system [25], the output membership functions
are either linear or constant. The general form of a rule in a
Sugeno fuzzy model is: if Input1 is x and Input2 is y then
Output is z = a.x + b.y + c. For a zero-order Sugeno model,
the output level z is a constant (a=b=0). The output level
zi of each rule is weighted by the firing strength wi of the
rule. The most common way to calculate the final output of
the system is the weighted average of all rule outputs:
P
i wi .zi
F inal output = P
i wi
When it is possible to model a fuzzy system using Sugenotype inference, the defuzzification and aggregation process
is thus greatly simplified and much more efficient than with
Mamdani’s inferences, which is why we used a Sugeno-type
system to model each attack phenomenon.
Concretely, we use the knowledge obtained from the extraction of cliques to build the fuzzy rules that describe the
behavior of each phenomenon. The characteristics of new
incoming attack events are then used as input to the fuzzy
systems that model the phenomena identified so far. In
each of those fuzzy systems, the features of the most recent attack event shall define the current parameters of the
membership function used to evaluate the following simple
rules: if xi is close AND if yi is close then zi is related,
∀i ∈ {geo, subnets, targets, portsequence}. Fig 5 gives a
graphical representation of how such a rule is evaluated for
the subnets of origins of two given attack events. Since this
characteristic is represented by a 2D mapping, we can see the
result of evaluating the relative position of the events according to both dimensions (x, y). Each membership function is
maximal within the cliques, then it decreases smoothly to
take into account the fuzziness of real-world phenomena. In
this case, the antecedents of the rule hold respectively 0.16
and 1.0, which results in an output of 0.16 (since a logical
AND in fuzzy logic corresponds to the MIN operator).
So, the membership functions referred to as “is close”
in the fuzzy rules are defined by the characteristics of the
cliques to which the attack events belong. The calculation
of the rule output zi ∈ [0, 1] is just the intersection between
the two curves, which quantifies the inter-relationship between the cliques (and hence, between the attack events).
Similarly, we can evaluate the fuzzy rules for the other dimensions considered in the inference system. For the last
dimension, i.e. the common IP’s, we use a static membership function whose input is the common IP ratio calculated
between the two events. Fig 6 represents this static membership function, where we can see the output ZIP increasing
and where
F (z1 , z2 , . . . , zn ) = W1 .z1! + W2 .z2! + . . . + Wn .zn!
1
Fuzzy output
0.8
0.6
0.4
0.2
0
0
5
10
Common IP ratio (%)
15
20
Figure 6: Common IP Membership function.
smoothly as the ratio of common IP addresses increases from
0 to 10%, where ZIP is then maximal. This curve is actually drawn from our knowledge, or domain experience, in
monitoring malicious traffic.
Note that, initially, the inference engine has no knowledge, so the first incoming attack event will create the first
phenomenon. Then, each time a new event could not be
attributed to an existing phenomenon, the inference engine
will create a new fuzzy system to model this new emerging
phenomenon. The inference engine is thus self-adaptive by
design.
3.3
Multi-criteria Decision-making
Having formally defined how to evaluate the output of
each rule, for each phenomenon, a last problem remains regarding the weighted average that is used as aggregation
function in a classical Sugeno inference system. In fact, it
does not allow us to express that certain combinations of
criteria (or rule outputs) must be somehow prioritized, as
previously described in the requirements. We need thus to
introduce another type of multi-criteria aggregation function that allows to model more complex requirements such
as “most of”, or “at least two” criteria to be satisfied in the
overall decision function. Yager has introduced in [29] a special type of operator called Ordered Weighted Aggregation
(OWA), which allows to include some relationships between
multiple criteria in the aggregation process. An OWA operator provides an aggregation function for criteria whose result
lies between the classical “and” and “or” operators, which are
in fact the two extreme cases. Assume Z1 , Z2 , . . . , Zn are n
criteria of concern in our multi-criteria problem. For each
criteria, Zi (x) ∈ [0, 1] indicates the degree to which x satisfies that criteria, which corresponds in our case to the rules
output of a given fuzzy system. Then, we define a mapping
function F : I n → I where I = [0, 1] as an OWA operator
of dimension n, if associated with F is a weighting vector
W = (W1 , W2 , . . . , Wn ) such that
1. Wi ∈ [0, 1]
2.
P
i
Wi = 1
with zi! being the ith largest element in the collection z1 , ..., zn .
That is, Z ! is an ordered vector composed of the elements
of Z put in descending order, which means that the weights
Wi are associated with a particular ordered position rather
than a particular element. Yager [29] has carefully studied
the mathematical foundations of OWA operators, and he
demonstrated that such operators have the desired properties such as monotonicity, generalized commutativity, associativity and idempotence. To define the weights Wi to be
used, Yager suggests two possible approaches: either to use
some learning mechanism with sample data and a regression
model, or to give some semantics or meaning to the Wi ’s by
asking a decision-maker to provide directly those values. We
selected the latter approach by defining the weighting vector
as W = (0.1, 0.35, 0.35, 0.1, 0.1), which translates our intuition about the dynamic behaviors of large-scale phenomena.
It can be interpreted as: “at least three criteria must be satisfied, but the first criteria is of less importance compared
to the 2nd and 3rd ones”. These values were carefully chosen in order to avoid the grouping of unrelated events when,
for example, two events are coming from popular countries
and targeting common (Windows) ports in the same interval of time, but those events are in reality not related to
the same phenomenon. In this worst-case scenario, we can
imagine that the ordered vector of criteria (obtained from
the evaluation of the fuzzy rules) could be something similar to Z = (0.3, 0.1, 0, 1, 0). That is, we have a high correlation for the targeted port sequences (z4 = 1), and we
have then some weak correlation (due to chance) for the
geographical origins (z1 = 0.3) and also for the subnets
of origins (z2 = 0.1). By applying our weighting vector
W to Z ! = (1, 0.3, 0.1, 0, 0), we get as final decision value
F = 1 ∗ 0.1 + 0.3 ∗ 0.35 + 0.1 ∗ 0.35 = 0.24. By considering
other scenarios, we can verify that the values of the weighting vector W work as expected, i.e. it minimizes the final
output value in these cases. Moreover, these considerations
enable us also to fix our decision threshold to an empirical
value of about 0.25. That is, when the final output value
F lies under this threshold, we will reject the attribution of
the attack event under scrutiny to the current phenomenon
whose fuzzy system is being evaluated. Finally, when several fuzzy systems provide an output value lying above the
threshold, we will obviously chose the highest one to attribute the event; however, this case was rarely observed in
our experiments. There exists certainly other alternatives
for choosing the Wi ’s, but according to our experimental results, this choice proved to be very effective in identifying
sequences of attack events having the same root cause.
4. BEHAVIORAL ANALYSIS OF GLOBAL
PHENOMENA
4.1 Main Characteristics
In this Section, we provide some experimental results obtained by applying our multi-criteria inference method to
the same set of attack events we already introduced in Section 2.3 (clique analysis). As already mentioned, these experimental results only aim at validating the applicability
and usefulness of the method proposed. They do not pre-
1
It is important to note that the sizes of the zombie armies
given here only reflect the number of sources we could observe on our sensors; the actual sizes of those armies are
most probably much larger, even though some churn effects
(DHCP, NAT) could also affect these numbers.
Total size (nr of sources)
0
1
10k
20k
30k
40k
50k
60k
70k
1
0.9
0.9
CDF size
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
100
200
300
Lifetime (nr of days)
400
F(x)
F(x)
CDF lifetime
0
600
500
Figure 7: Empirical CDF of the size and lifetime of
zombie armies.
%
/
"*$
"*(
"*+
"*#
=>?
tend to offer a complete view of all possible phenomena observable on the Internet. At the contrary, they show that,
even with a limited number of data sources, it is possible
to observe and reason about a couple of interesting phenomena. Furthermore, these anecdotal, yet representative,
examples show that our method helps in characterizing their
root cause, i.e., in addressing the attack attribution issue.
So, over the whole collection period (640 days), we found
about 32 global phenomena. In total, 348 attack events
(99% of our data set) could be attributed to a given largescale phenomenon. An in-depth analysis has revealed that
most of those phenomena (apart from the noisy network
worm W32.Rahack.H [24], also known as W32/Allaple) are
quite likely related to zombie armies, i.e., groups of compromised machines belonging to the same botnet(s). We conjecture this for the following main reasons: i) the apparent
coordination of the sources, both in time (i.e., coordinated
events on several sensors) and in the distribution of tasks
(e.g., scanners versus attackers); ii) the short durations of
the attack events, typically a few days only, whereas “classical” worms tend to spread over longer, continuous periods
of time; iii) the absence of known classical network worm
spreading on many of the observed port sequences; and iv)
the source growing rate, which has a sort of exponential
shape for worms and is somehow different for botnets [13].
To illustrate the results, Table 2 on page 10 presents an
overview of some global phenomena found in our dataset.
Thanks to our method, we are able to characterize precisely
the behaviors of the identified phenomena or zombie armies.
Hence, we found that the largest army had in total 57 attack events comprising 69,884 sources, and could survive for
about 112 days. The longest lifetime of a zombie army observed so far was still 586 days. Fig. 7 shows the cumulative
distributions (CDF) of the lifetime and size of the identified armies. Those figures reveal some interesting aspects
of their global behaviors: according to our observations, at
least 20% of the zombie armies had in total more than ten
thousand observable1 sources during their lifetime, and the
same proportion of armies could survive on the Internet for
at least 250 days. On average, zombie armies have a total
size of about 8,500 observed sources, a mean number of 658
sources per event, and their mean survival time is 98 days.
Regarding the origins, we observe some very persistent
groups of IP subnets and countries of origin across many
different armies. On Fig. 8, we can see the CDF of the
sources involved in the zombie armies of Table 2, where the
x-axis represents the first byte of the IPv4 address space.
It appears clearly that malicious sources involved in those
phenomena are highly unevenly distributed and form a relatively small number of tight clusters, which account for a
significant number of sources and are thus responsible for
a large deal of the observed malicious activities. This is
consistent with other prior work on monitoring global malicious activities, in particular with previous studies related
to measurements of Internet background radiation [4, 17,
31]. However, we are now able to show that there are still
some notable differences in the spatial distributions of those
zombie armies with respect to the average distribution over
"*'
7.4@2A4
B7%
B7)
B7'
B7#
B7$
B7%"
B7%%
B7%&
B7&"
"*)
"*!
"*&
"*%
"/
!"
#"
$"
%&"
%'"
%("
&%"
&)"
,-.)/01234/536200/7!089:4;0<
Figure 8: Empirical CDF of sources in IPv4 address
space for the 9 zombie armies illustrated in Table 2.
all sources (represented with the blue dashed line). In other
words, certain armies of compromised machines can have
very different spatial distributions, even though there is a
large overlap between “zombie-friendly” IP subnets. Moreover, because of the dynamics of this kind of phenomena, we
can even observe very different spatial distributions within
a same army at different moments of its lifetime. This is a
strong advantage of our analysis method that is more precise
and enables us to distinguish individual phenomena, instead
of global trends, and to follow their dynamic behavior over
time.
Another interesting observation on Fig. 8 is related to
the subnet CDF of ZA1 (uniformly distributed in the IPv4
space, which means randomly chosen source addresses) and
ZA20 (a constant distribution coming exclusively from the
subnet 24.0.0.0/8). A very likely explanation is that those
zombie armies have used spoofed addresses to send UDP
spam messages to the Windows Messenger service. So, this
indicates that IP spoofing is still possible under the current
state of filtering policies implemented by certain ISP’s on
the Internet.
Finally, in terms of attack capability, we observe that about
50% of the armies could target at least two completely dif-
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
1
0.5
0
0
1400
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9
10
11
12
13
1200
1000
Nr of sources
Geo
Fuzzy Common
output
IPs
Port Seq Targets Subnets
1
0.5
0
800
600
400
1
2
3
4
5
6
7
8
9
10
11
12
13
1
2
3
4
5
6
7
8
9 10
Attack events (ordered in time)
11
12
13
200
0
100
120
140
160
180
Time (by day)
200
220
Figure 9: Output of the fuzzy inference system (zi and
Figure 10: Time series of coordinated attack events for
F (zi )) modeling the zombie army nr 12.
zombie army ZA10 (i.e., nr of sources observed by day).
ferent ports (thus, probably two different exploits, at least),
and one army had even an attack capability greater than 10
(ZA4 in Table 2). At this stage, it is unclear why a zombie army would target such a large number of unusual, high
TCP ports (12293T, 15264T, etc). A recurrent misconfiguration or P2P phenomenon is thus not excluded; but even in
that case, it is very interesting to note that our method was
able to attribute all those different events to the same root
phenomenon, thanks to the combination of several statistical
metrics.
we can see that this army had four waves of activity during
which it was randomly scanning 5 different subnets (note
the almost perfect coordination among those attack events).
When inspecting the subnet distributions of those different
attack waves, we could clearly observe a drift in the origins
of those sources, quite likely as certain machines were infected by (resp. cleaned from) the bot software. Finally, we
found another smaller army (ZA11) that is clearly related to
ZA10 (e.g., same temporal behavior, similar activity, same
targets); but in this case, a different group of zombie machines, resulting in very different subnet CDF’s on Fig. 8),
was used to attack only specific IP addresses on our sensors,
probably by taking advantage of the results given by the
army of scanners (ZA10).
4.2
Some Detailed Examples
In this Section, we further detail two zombie armies to
illustrate some typical behaviors we could observe among
the identified phenomena, e.g.:
i) a move (or drift) in the origins of certain armies (both
geographical and IP blocks) during their lifetime;
ii) a large scan sweep by the same army targeting several
consecutive class A-subnets;
iii) within a same army, multiple changes in the port sequences (or exploits) used by zombies to scan or to
attack;
iv) a coordination between different armies.
Zombie army 12 (ZA12) is an interesting case in which we
can observe the behaviors ii) and iii). Fig. 9 represents the
output of the fuzzy system modeling this phenomenon. Each
bar graph represents the fuzzy output zi for a given attack
dimension, whereas the last plot shows the final aggregated
output from which the decision to group those events together was made (i.e., F (zi )). We can clearly see that the
targets and the activities of this army have evolved between
certain attack events (e.g., when the value of zi is low).
That is, this army has been scanning (at least) four consecutive class A-subnets during its lifetime (still 183 days),
while probing at the same time three different ports on these
subnetworks.
Then, the largest zombie army observed by the sensors
(ZA10) has showed the behaviors i) and iv). On Fig. 10,
5. CONCLUSIONS
We have introduced a general analysis method to address
the complex problem related to “attack attribution”. Our
approach is based on a novel combination of knowledge discovery and a multi-criteria fuzzy decision-making process.
By applying this method, we have showed how apparently
unrelated attack events could be attributed to the same
global attack phenomenon, or to the same army of zombie
machines operating in a coordinated manner. To the best of
our knowledge, this is the first formal, systematic and rigorous method that enables us to identify and characterize
precisely the behaviors of those large-scale attack phenomena. As future work, we envisage to extend our method to
other data sets, such as high-interaction (eventually client)
honeypot data, or malware data sets, and to include even
more relevant attack features so as to improve further the inference capabilities of the system, and thus also our insights
into malicious behaviors observed on the Internet.
Acknowledgments
This work has been partially supported by the European Commission through project FP7-ICT-216026-WOMBAT funded by
the 7th framework program. The opinions expressed in this paper
are those of the authors and do not necessarily reflect the views
of the European Commission.
Id
Nr of
events
10
Total size
Lifetime
(nr sources)
(nr days)
(Class A- subnets)
1
18,468
535
24.*,193.*,195.*,213.*
Targeted sensors
Attack capability
Main origins
4
82
26,962
321
202.*
5
13
9,644
131
195.*
6
15
51,598
>1 year
> 7 subnets
9
23
11,198
218
192.*,193.*,194.*
10
57
69,884
112
128.*,129.*,134.*,139.*,150.*
11
14
2,636
110
129.*,134.*,139.*,150.*
I-445T-139T-445T-139T-445T
12
14
27,442
183
192.*,193.*,194.*,195.*
1025T,1433T,2967T
20
10
30,435
337
24.*, 129.*, 195.*
(countries / subnets)
1026U
12293T,15264T,18462T,25083T,25618T,28238T,29188T,
32878T,33018T,38009T,4152T,46030T,4662T,50286T,. . .
135T,139T,1433T,2968T,5900T
ICMP (W32.Rahack.H / Allaple)
2967T,2968T,5900T
I-I445T
1026U,1026U1028U1027U,1027U
US,JP,GB,DE,CA,FR,CN,KR,NL,IT
69,128,195,60,81,214,211,132,87,63
IT,ES,DE,FR,IL,SE,PL
87,82,83,84,151,85,81,88,80
CN,US,PL,IN,KR,JP,FR,MX,CA
218,61,222,83,195,221,202,24,219
KR,US,BR,PL,CN,CA,FR,MX,TW
201,83,200,24,211,218,89,124
US,CN,TW,FR,DE,CA,BR,IT,RU
193,200,24,71,70,213,216,66
CN,CA,US,FR,TW,IT,JP,DE
222,221,60,218,58,24,70,124
US,FR,CA,TW,IT
82,71,24,70,68,88,87
US,JP,CN,FR,TR,DE,KR,GB
218,125,88,222,24,60,220,85,82
CA,CN
24,60
Table 2: Overview of some large-scale phenomena found in a honeynet dataset collected from Sep 06 until Jun 08.
6.
REFERENCES
[1] Paul Barford and David Plonka. Characteristics of network
traffic flow anomalies. In In Proceedings of ACM
SIGCOMM Internet Measurement Workshop, 2001.
[2] Paul Barford and Vinod Yegneswaran. An Inside Look at
Botnets. Advances in Information Security. Springer, 2006.
[3] David Barroso. Botnets - the silent threat. In European
Network and Information Security Agency (ENISA),
November 2007.
[4] Zesheng Chen, Chuanyi Ji, and Paul Barford.
Spatial-temporal characteristics of internet malicious
sources. In Proceedings of INFOCOM, 2008.
[5] M. P. Collins, T. J. Shimeall, S. Faber, J. Janies,
R. Weaver, M. De Shon, and J. Kadane. Using
uncleanliness to predict future botnet addresses. In IMC
’07: Proceedings of the 7th ACM SIGCOMM conference on
Internet measurement, pages 93–104, New York, NY, USA,
2007. ACM.
[6] Evan Cooke, Farnam Jahanian, and Danny McPherson.
The Zombie roundup: Understanding, detecting, and
disrupting botnets. In Proceedings of the Steps to Reducing
Unwanted Traffic on the Internet (SRUTI 2005
Workshop), Cambridge, MA, July 2005.
[7] B. Fuglede and F. Topsoe. Jensen-shannon divergence and
hilbert space embedding. pages 31–, June-2 July 2004.
[8] G. Gu, R. Perdisci, J. Zhang, and W. Lee. BotMiner:
Clustering analysis of network traffic for protocol- and
structure-independent botnet detection. In Proceedings of
the 17th USENIX Security Symposium, 2008.
[9] Guofei Gu, Junjie Zhang, and Wenke Lee. BotSniffer:
Detecting botnet command and control channels in network
traffic. In Proceedings of the 15th Annual Network and
Distributed System Security Symposium (NDSS’08),
February 2008.
[10] Geoffrey Hinton and Sam Roweis. Stochastic neighbor
embedding. In Advances in Neural Information Processing
Systems 15, volume 15, pages 833–840, 2003.
[11] A.K. Jain and R.C. Dubes. Algorithms for Clustering Data.
Prentice-Hall advanced reference series, 1988.
[12] S. Kullback and R. A. Leibler. On information and
sufficiency. Annals of Mathematical Statistics 22: 79-86.,
1951.
[13] Wenke Lee, Cliff Wang, and David Dagon, editors. Botnet
Detection: Countering the Largest Security Threat,
volume 36 of Advances in Information Security. Springer,
2008.
[14] C. Leita, V.H. Pham, O. Thonnard, E. Ramirez-Silva,
F. Pouget, E. Kirda, and Dacier M. The Leurre.com
Project: Collecting Internet Threats Information Using a
Worldwide Distributed Honeynet. In Proceedings of the
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
WOMBAT Workshop on Information Security Threats
Data Collection and Sharing, WISTDCS 2008. IEEE
Computer Society press, April 2008.
J. Lin. Divergence measures based on the shannon entropy.
Information Theory, IEEE Transactions on, 37(1):145–151,
Jan 1991.
E. H. Mamdani and S. Assilian. An experiment in linguistic
synthesis with a fuzzy logic controller. Int. J.
Hum.-Comput. Stud., 51(2):135–147, 1999.
Ruoming Pang, Vinod Yegneswaran, Paul Barford, Vern
Paxson, and Larry Peterson. Characteristics of internet
background radiation. In IMC ’04: Proceedings of the 4th
ACM SIGCOMM conference on Internet measurement,
pages 27–40, New York, NY, USA, 2004. ACM.
Markus Kötter Georg Wicherski Paul Bächer,
Thorsten Holz. Know your enemy: Tracking botnets. In
http://www.honeynet.org/papers/bots/.
M. Pavan and M. Pelillo. A new graph-theoretic approach
to clustering and segmentation. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition,
2003.
V. Pham, M. Dacier, G. Urvoy Keller, and T. En Najjary.
The quest for multi-headed worms. In DIMVA 2008, 5th
Conference on Detection of Intrusions and Malware &
Vulnerability Assessment, July, 2008, Paris, France, Jul
2008.
F. Pouget and M. Dacier. Honeypot-based forensics. In
AusCERT2004, AusCERT Asia Pacific Information
technology Security Conference 2004, 23rd - 27th May
2004, Brisbane, Australia, 2004.
The Leurre.com Project. http://www.leurrecom.org.
M. Abu Rajab, J. Zarfoss, F. Monrose, and A. Terzis. A
multifaceted approach to understanding the botnet
phenomenon. In IMC ’06: Proceedings of the 6th ACM
SIGCOMM conference on Internet measurement, pages
41–52, New York, NY, USA, 2006. ACM.
Symantec Security Response. W32.rahack.h, [april 2009].
Michio Sugeno. Industrial Applications of Fuzzy Control.
Elsevier Science Inc., New York, NY, USA, 1985.
Olivier Thonnard and Marc Dacier. A framework for attack
patterns’ discovery in honeynet data. DFRWS 2008, 8th
Digital Forensics Research Conference, August 11- 13,
2008, Baltimore, USA, 2008.
Olivier Thonnard and Marc Dacier. Actionable knowledge
discovery for threats intelligence support using a
multi-dimensional data mining methodology. In ICDM’08,
8th IEEE International Conference on Data Mining series,
December 15-19, 2008, Pisa, Italy, Dec 2008.
Laurens van der Maaten and Geoffrey Hinton. Visualizing
data using t-sne. Journal of Machine Learning Research,
9:2579–2605, November 2008.
[29] Ronald R. Yager. On ordered weighted averaging
aggregation operators in multicriteria decisionmaking.
IEEE Trans. Syst. Man Cybern., 18(1):183–190, 1988.
[30] V Yegneswaran, P Barford, and V Paxson. Using
honeynets for internet situational awareness. In Fourth
ACM Sigcomm Workshop on Hot Topics in Networking
(Hotnets IV), 2005.
[31] Vinod Yegneswaran, Paul Barford, and Johannes Ullrich.
Internet intrusions: global characteristics and prevalence.
In SIGMETRICS, pages 138–147, 2003.
Malware Detection using Statistical Analysis
of Byte-Level File Content
S. Momina Tabish, M. Zubair Shafiq, Muddassar Farooq
Next Generation Intelligent Networks Research Center (nexGIN RC)
National University of Computer & Emerging Sciences (FAST-NUCES)
Islamabad, 44000, Pakistan
{momina.tabish,zubair.shafiq,muddassar.farooq}@nexginrc.org
ABSTRACT
1. INTRODUCTION
Commercial anti-virus software are unable to provide protection against newly launched (a.k.a “zero-day”) malware.
In this paper, we propose a novel malware detection technique which is based on the analysis of byte-level file content. The novelty of our approach, compared with existing
content based mining schemes, is that it does not memorize specific byte-sequences or strings appearing in the actual file content. Our technique is non-signature based and
therefore has the potential to detect previously unknown and
zero-day malware. We compute a wide range of statistical
and information-theoretic features in a block-wise manner
to quantify the byte-level file content. We leverage standard
data mining algorithms to classify the file content of every
block as normal or potentially malicious. Finally, we correlate the block-wise classification results of a given file to categorize it as benign or malware. Since the proposed scheme
operates at the byte-level file content; therefore, it does not
require any a priori information about the filetype. We have
tested our proposed technique using a benign dataset comprising of six different filetypes — DOC, EXE, JPG, MP3, PDF
and ZIP and a malware dataset comprising of six different
malware types — backdoor, trojan, virus, worm, constructor and miscellaneous. We also perform a comparison with
existing data mining based malware detection techniques.
The results of our experiments show that the proposed nonsignature based technique surpasses the existing techniques
and achieves more than 90% detection accuracy.
Sophisticated malware is becoming a major threat to the
usability, security and privacy of computer systems and networks worldwide [1], [2]. A wide range of host-based solutions have been proposed by researchers and a number
of commercial anti-virus (AV) software are also available in
the market [5]–[21]. These techniques can broadly be classified into two types: (1) static, and (2) dynamic. Static
techniques mostly operate on machine-level code and disassembled instructions. In comparison, dynamic techniques
mostly monitor the behavior of a program with the help of
an API call sequence generated at run-time. The application of dynamic techniques in AV products is of limited use
because of the large processing overheads incurred during
run-time monitoring of API calls; as a result, the performance of computer systems significantly degrades. In comparison, the processing overhead is not a serious concern for
static techniques because the scanning activity can be scheduled offline in an idle time. Moreover, static techniques can
also be deployed as an in-cloud network service that moves
complexity from an end-point to the network cloud [28].
Almost all static malware detection techniques including
commercial AV software — either signature-, or heuristic-,
or anomaly-based — use specific content signatures such as
byte sequences and strings. A major problem with the content signatures is that they can easily be defeated by packing
and basic code obfuscation techniques [3]. In fact, the majority of malware that appears today is a simple repacked
version of old malware [4]. As a result, it effectively evades
the content signatures of old malware stored in the database
of commercial AV products. To conclude, existing commercial AV products cannot even detect a simple repacked version of previously detected malware.
The security community has expanded significant effort in
application of data mining techniques to discover patterns
in the malware content, which are not easily evaded by code
obfuscation techniques. Two most well-known data mining
based malware detection techniques are ‘strings’ (proposed
by Schultz et al [7]) and ‘KM’ (proposed by Kolter et al [8]).
We take these techniques as a benchmark for comparative
study of our proposed scheme.
The novelty of our proposed technique — in contrast to
the existing data mining based technique — is its purely
non-signature paradigm: it does not remember exact file
content/contents for malware detection. It is a static malware detection technique which should be, intuitively speaking, robust to the most commonly used evasion techniques.
The proposed technique computes a diverse set of statistical
Categories and Subject Descriptors
D.4.6 [Security and Protection]: Invasive Software
General Terms
Experimentation, Security
Keywords
Computer Malware, Data Mining, Forensics
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CSI-KDD’09, June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-4 ...$5.00.
and information-theoretic features in a block-wise manner
on the byte-level file content. The generated feature vector of every block is then given as an input to standard
data mining algorithms (J48 decision trees) which classify
the block as normal (n) or potentially malicious (pm). Finally, the classification results of all blocks are correlated
to categorize the given file as benign (B) or malware (M). If
a file is split into k equally-sized blocks (b1 , b2 , b3 , · · · , bk )
and n statistical features are computed for every k-th block
(fk,1 , fk,2 , fk,3 , · · · , fk,n ), then mathematically our scheme
can be represented as:
where (F) is a suitable feature set, (D) is a data mining
algorithm for classification of individual blocks. The file
is eventually categorized as benign (B) or malware (M) by
the correlation module (C). Once a suitable feature set (F)
and a data mining algorithm (D) are selected, we test the
accuracy of the solution using a benign dataset consisting of
six filetypes: DOC, EXE, JPG, MP3, PDF and ZIP; and a malware
dataset comprising of six different malware types: backdoor,
trojan, virus, worm, constructor and miscellaneous. The
results of our experiments show that our scheme is able to
provide more than 90% detection accuracy1 for detecting
malware, which is an encouraging outcome. To the best
of our knowledge, this is the first pure non-signature based
data mining malware detection technique using statistical
analysis of the static byte-level file contents.
The rest of the paper is organized as follows. In Section
2 we provide a brief overview of related work in the domain
of data mining malware detection techniques. We describe
the detailed architecture of our malware detection technique
in Section 3. We discuss the dataset in Section 4. We report the results of pilot studies in Section 5. In Section 6,
we discuss the knowledge discovery process of the proposed
technique by visualizing the learning models of data mining
algorithms used for classification. Finally, we conclude the
paper with an outlook to our future work.
on a dataset that consists of 1, 001 benign and 3, 265 malicious executables. These executables have 206 benign and
38 malicious samples in the portable executable (PE) file
format. They have collected most of the benign executables from Windows 98 systems. They use three different
approaches to statically extract features from executables.
The first approach extracts DLL information inside PE
executables. Further, the DLL information is extracted using three types of feature vectors: (1) the list of DLLs (30
boolean values), (2) the list of DLL function calls (2, 229
boolean values), and (3) the number of different function
calls within each DLL (30 integer values). RIPPER — an
inductive rule-learning algorithm — is used on top of every
feature vector for classification. These schemes based on
DLL information provides an overall detection accuracy of
83.62%, 88.36% and 89.07% respectively. Enough details
about the DLLs are not provided, so we could not implement this scheme in our study.
The second feature extraction approach extracts strings
from the executables using GNU strings program. Naı̈ve
Bayes classifier is used on top of extracted strings for malware detection. This scheme provides an overall detection
accuracy of 97.11%. This scheme is reported to give the
best results amongst all, and we have implemented it for
our comparative study.
The third feature extraction approach uses byte sequences
(n-grams) using hexdump. The authors do not explicitly
specify the value of n used in their study. However, from
an example provided in the paper, we deduce it to be 2
(bi-grams). The Multi-Naı̈ve Bayes algorithm is used for
classification. This algorithm uses voting by a collection of
individual Naı̈ve Bayes instances. This scheme provides an
overall detection accuracy of 96.88%.
The results of their experiments reveal that Naı̈ve Bayes
algorithm with strings is the most effective approach for detecting the unseen malicious executables with reasonable
processing overheads. The authors acknowledge the fact
that the string features are not robust and can be easily
defeated. Multi-Naı̈ve Bayes with byte sequences also provides a relatively high detection accuracy, however, it has
large processing and memory requirements. Byte sequence
technique was later improved by Kolter et al and is explained
below.
2.
2.2
b1
b2
..
.
bk
F
⇒
f1,1 , f1,2 · · · f1,n
f2,1 , f2,2 · · · f2,n
..
.
fk,1 , fk,2 · · · fk,n
D
⇒
n/pm
n/pm
..
.
n/pm
C
B/M,
⇒
RELATED WORK
In this section we explain the details of most relevant malware detection techniques: (1) ‘strings’ by Schultz et al [7],
(2) ‘KM’ by Kolter et al [8] and (3) ‘NG’ by Stolfo et al [16],
[17]. In our comparative study, we use ‘strings’ and ‘KM’
as benchmarks for comparison, whereas ‘NG’ effectively uses
just one of the many statistical features used in our proposed
technique.
2.1
Schultz et al — Strings
In [7], Schultz et al use several data mining techniques to
distinguish between the benign and malicious executables in
Windows or MS-DOS format. They have done experiments
1
Throughout this text, the terms detection accuracy and
Area Under ROC Curve (AUC) are used interchangeably.
The AUC (0 ≤ AUC ≤ 1) is used as a yardstick to determine
the detection accuracy. Higher values of AUC mean high
true positive (tp) rate and low false positive (fp) rate [30].
At AUC = 1, tp rate = 1 and fp rate = 0.
Kolter et al — KM
In [8], Kolter et al use n-gram analysis and data mining approaches to detect malicious executables in the wild.
They use n-gram analysis to extract features from 1, 971 benign and 1, 651 malicious PE files. The PE files have been
collected from machines running Windows 2000 and XP operating systems. The malicious PE files are taken from an
older version of the VX Heavens Virus Collection [27].
The authors evaluate their approach for two classification
problems: (1) classification between the benign and malicious executables, and (2) categorization of executables as
a function of their payload. The authors have categorized
only three types — mailer, backdoor and virus — due to the
limited number of malware samples.
Top n-grams with the highest information gain are taken
as binary features (T if present and F if absent) for every
PE file. The authors have done pilot studies to determine
the size of n-grams, the size of words and the number of
top n-grams to be selected as features. A smaller dataset
consisting of 561 benign and 476 malicious executables is
considered in this study. They have used 4-grams, one byte
word and top 500 n-grams as features.
Several inductive learning methods, namely instance-based
learner, Naı̈ve Bayes, support vector machines, decision trees
and boosted versions of instance-based learner, Naı̈ve Bayes,
support vector machines and decision trees are used for classification. The same features are provided as input to all
classifiers. They report the detecting accuracy as the area
under an ROC curve (AUC) which is a more complete measure compared with the detection accuracy [29]. AUCs show
that the boosted decision trees outperform rest of the classifiers for both classification problems.
2.3
Stolfo et al — NG
In a seminal work, Stolfo et al have used n-gram analysis
for filetype identification [15] and later for malware detection
[16], [17]. Their earlier work, called fileprint analysis, uses
1-gram byte distribution of a whole file and compares it with
different filetype models to determine the filetype. In their
later work, they detect malware embedded in DOC and PDF
files using three different models single centroid, multi centroid, and exemplar of benign byte distribution of the whole
files. A distance measure, called Mahanalobis Distance, is
calculated between the n-gram distribution of these models
and a given test file. They also use 1-gram and 2-gram distributions to test their approach on a dataset comprising of
31 benign application executables, 331 benign executables
from the System32 folder and 571 viruses. The experimental results have shown that their proposed technique is able
to detect a considerable proportion of the malicious files.
However, their proposed technique is specific to embedded
malware and does not deal with detection of stand-alone
malware.
3.
ARCHITECTURE OF PROPOSED TECHNIQUE
In this section, we explain our data mining based malware
detection. It is important to emphasize that our technique
does not require any a priori information about the filetype
of a given file; as a result, our scheme is robust to subtle file
header obfuscations crafted by an attacker. Moreover, our
technique is also able to classify the malware as a function
of their payload, i.e. it can detect the family of a given
malware.
The architecture of proposed technique is shown in Figure
1. It consists of four modules: (1) block generator (B), (2)
feature extractor (F), (3) data miner (D), and (4) correlation (C). It can accept any type of file as an input. We now
discuss the details of every module.
3.1
Block Generator Module (B)
The block generator module divides the byte-level contents of a given file into fixed-sized chunks — known as
blocks. We have used blocks for reducing the processing
overhead of the module. In future, we want to analyze the
benefit of using variable-sized blocks as well. Remember
that using a suitable block size plays a critical role in defining the accuracy of our framework because it puts a lower
limit on the minimum size of malware that our framework
can detect. We have to compromise a trade-off between the
amount of available information per block and the accuracy
of the system. In this study, we have set the block size to
1024 bytes (= 1K). For instance, if the file size is 100K
bytes then the file is split into 100 blocks. The frequency
histograms for 1-, 2-, 3-, and 4-gram of byte values are calculated for each block. These histograms are given as input
to the feature extraction module.
3.2
Feature Extraction Module (F)
Feature extraction module computes a number of statistical and information-theoretic features on the histograms
for each block generated by the previous module. Overall,
the features’ set consist of 13 diverse features, which are
separately computed on 1, 2, 3, and 4 gram frequency histograms.2 This brings the total size of features’ set to 52
features per block. We now provide brief descriptions of
every feature.
3.2.1
Simpson’s Index
The Simpson’s index [31] is a measure defined in an ecosystem, which quantifies the diversity of species in a habitat.
It is calculated using the following equation:
n(n − 1)
,
N (N − 1)
Si =
(1)
where n is the frequency of byte-values of consecutive ngrams and N is the total number of bytes in a block i.e, 1000
in our case. A value of zero shows no significant difference
between frequency of n-grams in a block. Similarly, as the
value of Si increases, the variance in frequency of n-grams
in a block also increases.
In all subsequent feature definitions, we represent with Xi
the frequency of j th n-gram in ith block, where j is varying
from 0 − 255 for 1-gram, 0 − 65535 for 2-grams, 0 − 16777215
for 3-grams, and 0 − 4294967295 for 4-grams.
3.2.2
Canberra Distance
This distance measures the sum of series of fractional differences between coordinates of a pair of objects [31]. Mathematically we represent it as:
n
CA(i) =
j=0
3.2.3
| Xj − Xj+1 |
| Xj | + | Xj+1 |
(2)
Minkowski Distance of Order
This is a generalized metric to measure the differences
between absolute magnitude of differences between a pair of
objects:
n
mi =
λ
j=0
| Xj − Xj+1 , |λ
(3)
where we have used λ = 3 as suggested in [31].
3.2.4
Manhattan Distance
It is a special case of Minkowski distance [31] with λ = 1.
n
mhi =
j=0
| Xj − Xj+1 |
(4)
2
In rest of the paper, we use the generic term n-grams once
we want to refer to all 4-grams separately.
Figure 1: Architecture of our proposed technique
3.2.5
Chebyshev Distance
3.2.8
Chebyshev distance measure is also called maximum value
distance. It measures the absolute magnitude of differences
between coordinates of a pair of objects:
chi = maxj | Xj − Xj+1 |
(5)
It is a special case of Minkowski difference [31] with λ =
∞.
3.2.6
Bray Curtis Distance
It is a normalized distance measure which is defined as the
ratio of absolute difference of frequencies of n-grams and the
sum of their frequencies [31]:
bci =
3.2.7
n
j=0 | Xj
n
j=0 (Xj
− Xj+1 |
+ Xj+1 )
(6)
Angular Separation
This feature models the similarity of two vectors by taking
cosine of the angle between them [31]. A higher value of
angular separation between two vectors shows that they are
similar.
ASi =
n
j=0
(
n
j=0
Xj2 ·
Correlation Coefficient
The standard angular separation between two vectors which
is centered around the mean of the its magnitude values is
called correlation coefficient [31]. This again measures the
similarity between two values:
n
j=0 (Xj
CCi =
(
n
j=0
− X i ) · (Xj+1 − X i )
2
(Xj − X i ) ·
n
j=0
2 1/2
,
where X i is the mean of frequencies of n-grams in a given
block i.
3.2.9
Entropy
Entropy measures the degree of dispersal or concentration
of a distribution. In information-theoretic terms, entropy
of a probability distribution defines the minimum average
number of bits that a source requires to transmit symbols
according to that distribution [31]. Let R be a discrete random variable such that R = {ri , i ∈ ∆n }, where ∆n is the
image of a random variable. Then entropy of R is defined
as:
E(R) = −
t(ri ) log2 t(ri ),
i∈∆n
where t(ri ) is the frequency of n-grams in a given block.
Xj · Xj+1
n
j=0
2
Xj+1
)1/2
(7)
(8)
(Xj+1 − X i ) )
(9)
3.2.10
Kullback - Leibler Divergence
KL Divergence is a measure of the difference between two
probability distributions [31]. It is often referred to as a distance measure between two distributions. Mathematically,
it is represented as:
n
KLi (Xj || Xj+1 ) =
3.2.11
Xj log
j=0
Xj
Xj+1
1
1
D(Xj || M ) + D(Xj+1 || M ) (11)
2
2
where M = 12 (Xj + Xj+1 ).
Itakura-Saito Divergence
Itakura-Saito Divergence is a special form of Bregman distance [31].
n
BF (Xj , Xj+1 ) =
(
j=0
Xj
Xj
− log
− 1)
Xj+1
Xj+1
which is generated by the convex function Fi (Xj ) = −
(12)
log Xj .
Total Variation
It measures the largest possible difference between two
probability distributions Xj and Xj+1 [31]. It is defined as:
δi (Xj , Xj+1 ) =
3.3
1
2
j
| Xj − Xj+1 |
(13)
Data Mining based Classification Module
(D)
The classification module gets as an input the feature vector in the form of an arff file [26]. This feature file is then
further presented for classification to 6 sub-modules. The
six modules actually contain learnt models of six types of
malicious files: backdoor, virus, worm, trojan, constructor
and miscellaneous. The feature vector file is presented to all
sub-modules in parallel and they produce an output of n or
pm per block. In addition to this, the output of the classification sub-modules provide us insights into the payload of
the malicious file.
Boosted decision tree is used for classifying each block as
n or pm. We have used AdaBoostM1 algorithm for boosting decision tree (J48) [24]. We have selected this classifier
after extensive pilot studies which are detailed in Section
5. We provide brief explanations of decision tree (J48) and
boosting algorithm (AdaBoostM1) below.
3.3.1
Qty.
DOC
EXE
JPG
MP3
PDF
ZIP
300
300
300
300
300
300
Avg. Size
(kilo-bytes)
1, 015.2
4, 095.0
1, 097.8
3, 384.4
1, 513.1
1, 489.6
Min. Size
(kilo-bytes)
44
48
3
654
25
8
Max. Size
(kilo-bytes)
7, 706
15, 005
1, 629
6, 210
20, 188
9, 860
Jensen-Shannon Divergence
JSDi (Xj || Xj+1 ) =
3.2.13
Filetype
(10)
It is a popular measure in probability theory and statistics
and measures the similarity between two probability distributions [31]. It is also known as Information Radius (IRad).
Mathematically, it is represented as:
3.2.12
Table 1: Statistics of benign files used in this study
Decision Tree (J48)
We have used C4.5 decision tree (J48) implemented in
Waikato Environment for Knowledge Acquisition (WEKA)
[26]. It uses the concept of information entropy to build the
tree. Every feature is used to split the dataset into smaller
subsets and the normalized information gain is calculated.
The feature with highest information gain is selected for
decision making.
3.3.2
Boosting (AdaBoostM1)
We have used AdaBoostM1 algorithm implemented in WEKA
[24]. As the name suggests, it is a meta algorithm which is
designed to improve the performance of base learning algorithms. AdaBoostM1 keeps calling the weak classifier until
a pre-defined number of times and tweaks those instances
that have resulted in misclassification. In this way, it keeps
on adapting itself with the ongoing classification process. It
is known to be sensitive to outliers and noisy data.
3.4
Correlation Module (C)
The correlation module gets the per block classification
results in the form of n or pm. It then calculates the correlation among the blocks which are labeled as n or pm. Depending upon the fraction of n and pm blocks in a file, the
file is classified as malicious or benign. We can also set a
threshold for tweaking the final classification decision. For
instance if we set the threshold to 0.5, a file having 4 benign
and 6 malicious blocks will be classified as malicious, and
vice-versa.
4. DATASET
In this section, we present an overview of the dataset used
in our study. We first describe the benign and then the
malware dataset used in our experiments. It is important
to note that in this work we are not differentiating between
packed and non-packed files. Our scheme works regardless
of the packed/non-packed nature of the file.
4.1
Benign Dataset
The benign dataset for our experiments consists of six different filetypes: DOC, EXE, JPG, MP3, PDF and ZIP. These filetypes encompass a broad spectrum of commonly used files
ranging from compressed to redundant and from executables to document files. Each set of benign files contains 300
typical samples of the corresponding filetype, which provide
us with 1, 800 benign files in total. We have ensured the
generality of the benign dataset by randomizing the sample sources. More specifically, we queried well-known search
engines with random keywords to collect these files. In addition, typical samples are also collected from the local network of our virology lab.
Some pertinent statistics of the benign dataset used in
this study are tabulated in Table 1. It can be observed
from Table 1 that the benign files have very diverse sizes
varying from 3 KB to 20 MB, with an average file size of
approximately 2 MB. The divergence in sizes of benign files
is important as malicious programs are inherently smaller in
size for ease of propagation.
Table 2: Statistics of malware used in this study
Maj. Category
Qty.
Backdoor
Constructor
Trojan
Virus
Worm
Miscellaneous
3,444
172
3114
1048
1471
1062
4.2
Avg. Size
(kilo-bytes)
285.6
398.5
135.7
50.7
72.3
197.7
Min. Size
(bytes)
56
371
12
175
44
371
Max. Size
(kilo-bytes)
9, 502
5, 971
4, 014
1, 332
2, 733
14, 692
Malware Dataset
We have used ‘VX Heavens Virus Collection’ [27] database which is available for free download in the public domain. Malware samples, especially recent ones, are not easily available on the Internet. Computer security corporations do have an extensive malware collection, but unfortunately they do not share their malware databases for research purposes. This is a comprehensive database that
contains a total of 37, 420 malware samples. The samples
consist of backdoors, constructors, flooders, bots, nukers,
sniffers, droppers, spyware, viruses, worms and trojans etc.
We only consider Win32 based malware in PE file format.
The filtered dataset used in this study contains 10, 311 Win32
malware samples. To make our study more comprehensive,
we divide the malicious executables based on the function
of their payload. The malicious executables are divided into
six major categories such as backdoor, trojan, virus, worm,
constructor, and miscellaneous (malware like nuker, flooder,
virtool, hacktool etc). We now provide a brief explanation
of each of the six malware categories.
4.2.1
Backdoor
A backdoor is a program which allows bypassing of standard authentication methods of an operating system. As a
result, remote access to computer systems is possible without explicit consent of the users. Information logging and
sniffing activities are possible using the gained remote access.
4.2.2
Constructor
This category of malware mostly includes toolkits for automatically creating new malware by varying a given set of
input parameters.
4.2.3
Worm
The malware in this category spreads over the network by
replicating itself.
4.2.4
Trojan
A trojan is a broad term that refers to stand alone programs which appear to perform a legitimate function but
covertly do possibly harmful activities such as providing remote access, data destruction and corruption.
4.2.5
Virus
A virus is a program that can replicate itself and attach
itself with other benign programs. It is probably the most
well-known type of malware and has different flavors.
4.2.6
Miscellaneous
The malware in this category include DoS (denial of service), nuker, exploit, hacktool and flooders. The DoS and
Table 3: AUC’s for detecting malicious executables
vs benign files. Bold entries in every column represent the best results.
Classifier
Back
Cons
RIPPER
NB
M-NB
B-J48
0.591
0.615
0.615
0.625
0.611
0.610
0.606
0.652
SMO
B-SMO
NB
B-NB
J48
B-J48
IBk
0.720
0.711
0.715
0.715
0.712
0.795
0.752
−
−
−
−
−
−
−
Misc
Strings
0.690
0.714
0.711
0.765
Troj
Virus
Worm
0.685
0.751
0.755
0.762
0.896
0.944
0.952
0.946
0.599
0.557
0.575
0.642
KM
0.611
0.611
0.610
0.606
0.560
0.652
0.611
0.755
0.766
0.847
0.821
0.805
0.851
0.850
0.865
0.931
0.947
0.939
0.850
0.921
0.942
0.750
0.759
0.750
0.760
0.817
0.820
0.841
J48
B-J48
NB
B-NB
IBk
B-IBk
Proposed solution with
0.835
0.812
0.863
0.849
0.841
0.883
0.709
0.716
0.796
0.709
0.722
0.794
0.779
0.817
0.844
0.782
0.817
0.844
4 features
0.837
0.909
0.839
0.933
0.748
0.831
0.746
0.807
0.791
0.917
0.791
0.918
0.880
0.884
0.715
0.707
0.812
0.812
B-J48
Proposed solution with 52 features
0.979
0.965
0.950
0.985
0.970
0.932
nuker based malware allow an attacker to launch malicious
activities at a victim’s computer system that can possibly
result in a denial of service attack. These activities can
result in slow down, restart, crash or shutdown of a computer system. Exploit and hacktool exploit vulnerabilities
in a system’s implementation which most commonly results
in buffer overflows. Flooder initiates unwanted information
floods such as email, instant messaging and SMS floods.
The detailed statistics of the malware used in our study
is provided in Table 2. The average malware size in this
dataset is 64.2 KB. The sizes of malware samples used in
our study vary from 4 bytes to more than 14 MB. Intuitively
speaking, small sized malware are harder to detect than the
larger ones.
5. PILOT STUDIES
In our initial set of experiments, we have conducted an extensive search in the design space to select the best features’
set and classifier for our scheme. The experiments are done
on 5% of total dataset to have small design cycle for our
approach. Recall that in Section 3, we have introduced 13
different statistical and information-theoretic features. The
pilot studies are aimed to convince ourselves that we need
all of them. Moreover, we have also evaluated well-known
data mining algorithms on our dataset in order to find the
best classification algorithm for our problem. We have used
Instance based (IBk) [22], Decision Trees (J48) [25], Naı̈ve
Bayes [23] and the boosted versions of these classifiers in
our pilot study. For boosting we have used AdaBoostM1
algorithm [24]. We have utilized implementations of these
algorithms available in WEKA [26].
5.1
Discussion on Pilot Studies
The classification results of our experiments are tabulated
in Table 3 which show that the boosted Decision Tree (J48)
significantly outperforms other classifiers in terms of detec-
1
Table 4: Feature Analysis. AUC’s for detecting
virus executables vs benign files using boosted J48.
F1,F2, F3 and F4 correspond to 4 different features.
No. of Gram/feature
1
1
1
2
1, 1
1, 1
1, 2
1, 2
1, 1, 2
1, 1
1, 2
1, 1, 1, 2
AUC
0.823
0.839
0.866
0.891
0.940
0.928
0.932
0.929
0.962
0.954
0.913
0.956
0.8
tp rate
Feature
F1
F2
F3
F4
F1-F2
F1-F3
F1-F4
F2-F4
F1-F2-F4
F3-F2
F3-F4
F1-F2-F3-F4
0.6
0.4
Boosted IBK
IBK
Boosted NB
NB
Boosted J48
J48
0.2
0
0
0.2
0.4
0.6
0.8
1
fp rate
Figure 2: ROC plot for virus-benign classification
tion accuracy. Similarly we also evaluate the role of number of features and the number of n-grams on the accuracy
of proposed approach. In Table 3, we first use four features namely: Simpson’s index (F1), Entropy Rate (F2),
Canberra distance (F3) on 1-gram and Simpson’s index on
2-grams (F4). Our scheme achieves a significantly higher detection accuracy compared with strings and KM. We then
tried different combinations of features and n-grams and tabulated the results in Table 4. It is obvious from Table 4 that
once we move from a single feature from (F1) on 1-gram to
(F4) on 2-grams the detection accuracy improves from 0.82
to 0.891. Once we use combination of features computed on
1-gram and 2-grams the accuracy approaches to 0.962 (see
F1-F2-F4). This provided us the motivation to use all 13
features on 1-, 2-, 3- and 4-grams resulting in a total of 52
features. The ROC curve for the virus-benign classification
using F1, F2, F3 and F4 features is shown in Figure 2.
It is clear in Table 3 that strings approach has a significantly higher accuracy for virus types compared with other
malware types. We analyzed the reason behind this by looking at the signatures used by this method. We observed that
typically viruses carry some common strings like “Chinese
Hacker” and for strings approach they become its signatures.
Since similar strings do not appear in other malware types;
therefore, strings accuracy degrades to as low as 0.62 in case
of backdoors.
Recall that KM uses 4-grams as binary features. KM
follows the same pattern of higher accuracy for detecting
viruses and relatively lower accuracy for other malware types.
However, its accuracy results are significantly higher compared with the strings approach. Note that our proposed solution with 52 features not only provides the best detection
accuracy for virus category but its accuracy remains consistent across all malware types. This shows the strength of
using diverse features’ set with a boosted J48 classifier.
6.
RESULTS & DISCUSSION
Remember that our approach is designed to achieve a challenging objective: to distinguish between malicious files of
type backdoor, virus, worm, trojan, constructor from benign files of types DOC, EXE, JPG, MP3, PDF and ZIP just on
the basis of byte-level information.
We now explain our final experimental setup. The proposed scheme, as explained in Section 3.2, computes features
on n-grams of each block of a given file. We create the training samples by randomly selecting 50 malicious and benign
files each from malware and benign datasets respectively.
We create six separate training samples for each malware
category. We use these samples to train boosted decision
tree and consequently get 6 classification models for each
malware type.
For an easier understanding of the classification process,
we take backdoor as an example. We have selected 50 backdoor files and 50 benign files to get a training sample for classifying backdoor and benign files. We train the data mining
algorithm with this sample and as a result get the training
model to classify a backdoor. Moreover, we further test,
using this model, all six benign filetypes and backdoor malware files. It is important to note that this model is specifically designed to distinguish backdoor files from six benign
filetypes (one-vs-all classification), where only 50 backdoor
samples are taken in training. This number is considerably
small keeping in view the problem at hand. Nonetheless, the
backdoor classification using a single classifier completes in
seven runs: six for benign filetypes and one for itself. In
a similar fashion, one can classify malware types i.e., virus,
worm, trojan, constructor and miscellaneous. Once our solution is trained for each category of malware, we test it
on the benign dataset of 1800 files and the malware dataset
of 10, 311 files from VX Heavens. We ensure that the benign and malware files used in the training phase are not
included in the testing phase to verify our claim of zero-day
malware detection. The classification results are shown in
Figure 3 in the form of AUCs. It is interesting to see that
viruses, as expected, are easily classified by our scheme. In
comparison trojans are programs that look and behave like
benign programs, but perform some illegitimate activities.
As expected, the classification accuracy for trojans is 0.88
— significantly smaller compared with viruses.
In order to get a better understanding, we have plotted the
ROC graphs of classification results for each malware type in
Figure 3. The classification results show that malware files
are quite distinct from benign files in terms of byte-level
file content. The ROC plot further confirms that virus are
easily classified compared with other malware types while
trojan and backdoors are relatively difficult to classify. Table 5 shows portions of the developed decision trees for all
malware categories. As decision trees provide a simple and
robust method of classification for a large dataset, it is interesting to note that malicious files are inherently different
from the benign ones even at the byte-level.
1
Table 5: Portions of developed decision trees
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
KL1
|
|
|
|
|
> 0.022975
Entropy2 <= 6.822176
|
Manhattan1 <= 0.005411
|
|
Manhattan4 <= 0.000062: Malicious (20.0)
|
|
Manhattan4 > 0.000062: Benign (8.0/2.0)
| Manhattan1 > 0.005411: Malicious (538.0/6.0)
(a) between Backdoor and Benign files
|
|
|
|
|
|
|
|
|
|
|
|
CorelationCoefficient1 > 0.619523
|
Chebyshev4 <=1.405
|
|
Itakura2 <= 87.231983: Malicious (352.0/9.0)
|
|
Itakura2 > 87.231983
|
|
| TotalVariation1 <= 0.3415: Malicious (11.0)
|
|
| TotalVariation1 > 0.3415: Benign (8.0)
0.8
tp rate
|
|
|
|
|
|
0.6
0.4
Virus (AUC = 0.945)
Worm (AUC = 0.919)
Trojan (AUC = 0.881)
Backdoor (AUC = 0.849)
Constructor (AUC = 0.925)
Miscellaneous (AUC = 0.903)
0.2
0
0
0.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
CorelationCoefficient4 > 0.187794
|
Simpson_Index_1 <= 0.005703
|
|
CorelationCoefficient3 <= 0.17584: Malicious (32.0)
|
|
CorelationCoefficient3 > 0.17584
|
|
|
Entropy1 <= 4.969689: Malicious (4.0/1.0)
|
|
|
Entropy1 > 4.969689: Benign (5.0)
|
Simpson_Index_1 > 0.005703: Benign (11.0)
(c) between Virus and Benign files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Entropy2 > 3.00231
|
Canberra2 <= 49.348481
|
|
Canberra1 <= 14.567909: Malicious (161.0)
|
|
Canberra1 > 14.567909
|
|
|
Itakura3 <= 126.178699: Malicious (5.0)
|
|
|
Itakura3 > 126.178699: Benign (5.0)
|
Canberra2 > 49.348481: Benign (13.0/1.0)
(d) between Worm and Benign files
CorelationCoefficient3 <= -0.013832
|
Itakura2 <= 5.905754
|
|
Itakura3 <= 5.208592
|
|
|
CorelationCoefficient4 <= -0.155078: Malicious (7.0)
|
|
|
CorelationCoefficient4 > -0.155078: Benign (43.0/6.0)
|
|
Itakura3 > 5.208592: Benign (303.0/5.0)
(e) between Constructor and Benign files
|
|
|
|
|
|
|
|
Entropy4 > 6.754364
|
KL1 <= 0.698772
|
|
Manhattan3 <= 0.001003
|
|
|
Entropy2 <= 6.411063: Malicious (29.0)
|
|
|
Entropy2 > 6.411063
|
|
|
|
KL1 <= 0.333918: Malicious (2.0)
|
|
|
|
KL1 > 0.333918: Benign (2.0)
|
|
Manhattan3 > 0.001003: Benign (3.0)
(f ) between Miscellaneous and Benign files
The novelty of our scheme lies in the way the selected
features have been computed — per block n-gram analysis,
and the correlation between the blocks classified as benign
or potentially malicious. The features used in our study are
taken from statistics and information theory. Many of these
features have already been used by researchers in other fields
for similar classification problems. The chosen set of features
is not, by any means, the optimal collection. The selection of
optimal number of features remains an interesting problem
which we plan to explore in our future work. Moreover, the
executable dataset used in our study contained both packed
and non-packed PE files. We plan to evaluate the robustness
of our proposed technique on manually crafted packed file
dataset.
0.4
0.6
0.8
1
fp rate
(b) between Trojan and Benign files
Figure 3: ROC plot for detecting malware from benign files
7. CONCLUSION & FUTURE WORK
In this paper we have proposed a non-signature based
technique which analyzes the byte-level file content. We
argue that such a technique provides implicit robustness
against common obfuscation techniques — especially repacked
malware to obfuscate signatures. An outcome of our research is that malicious and benign files are inherently different even at the byte-level.
The proposed scheme uses a rich features’ set of 13 different statistical and information-theoretic features computed
on 1-, 2-, 3- and 4-grams of each block of a file. Once we
have calculated our features’ set, we give it as an input
to the boosted decision tree (J48) classifier. The choice of
features’ set and classifier is an outcome of extensive pilot
studies done to explore the design space. The pilot studies demonstrate the benefit of our approach compared with
other well-known data mining techniques: strings and KM
approach. We have tested our solution on an extensive executable dataset. The results of our experiments show that
our technique achieves 90% detection accuracy for different
malware types. Another important feature of our framework
is that it can also classify the family of a given malware file
i.e. virus, trojan etc.
In future, we would like to evaluate our scheme on a larger
dataset of benign and malicious executables and reverse engineer the features’ set for further improving the detection
accuracy. Moreover, we plan to evaluate the robustness of
our proposed technique on a customized dataset containing
manually packed executable files.
Acknowledgments
This work is supported by the National ICT R&D Fund,
Ministry of Information Technology, Government of Pakistan. The information, data, comments, and views detailed
herein may not necessarily reflect the endorsements of views
of the National ICT R&D Fund.
We acknowledge M.A. Maloof and J.Z. Kolter for their
valuable feedback regarding the implementation of strings
and KM approaches. Their comments were of great help in
establishing the experimental testbed used in our study. We
also acknowledge the anonymous reviewers for their valuable
suggestions pertaining to possible extensions of our study.
8.
REFERENCES
[1] Symantec Internet Security Threat Reports I-XI (Jan
2002—Jan 2008).
[2] F-Secure Corporation, “F-Secure Reports Amount of
Malware Grew by 100% during 2007”, Press release,
2007.
[3] A. Stepan, “Improving Proactive Detection of Packed
Malware”, Virus Buletin, March 2006, available at
http://www.virusbtn.com/virusbulletin/
archive/2006/03/vb200603-packed.dkb
[4] R. Perdisci, A. Lanzi, W. Lee, “Classification of Packed
Executables for Accurate Computer Virus Detection”,
Pattern Recognition Letters, 29(14), pp. 1941-1946,
Elsevier, 2008.
[5] AVG Free Antivirus, available at
http://free.avg.com/.
[6] Panda Antivirus, available at
http://www.pandasecurity.com/.
[7] M.G. Schultz, E. Eskin, E. Zadok, S.J. Stolfo, “Data
mining methods for detection of new malicious
executables”, IEEE Symposium on Security and
Privacy, pp. 38-49, USA, IEEe Press, 2001.
[8] J.Z. Kolter, M.A. Maloof, “Learning to detect malicious
executables in the wild”, ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining,
pp. 470-478, USA, 2004.
[9] J. Kephart, G. Sorkin, W. Arnold, D. Chess, G.
Tesauro, S. White, “Biologically inspired defenses
against computer viruses”, International Joint
Conference on Artificial Intelligence (IJCAI), pp.
985-996, USA, 1995.
[10] R.W. Lo, K.N. Levitt, R.A. Olsson, “MCF: A
malicious code filter”, Computers & Security,
14(6):541-566, Elseveir, 1995.
[11] O. Henchiri, N. Japkowicz, “A Feature Selection and
Evaluation Scheme for Computer Virus Detection”,
IEEE International Conference on Data Mining
(ICDM), pp. 891-895, USA, IEEE Press, 2006.
[12] P. Kierski, M. Okoniewski, P. Gawrysiak, “Automatic
Classification of Executable Code for Computer Virus
Detection”, International Conference on Intelligent
Information Systems, pp. 277-284, Springer, Poland,
2003.
[13] T. Abou-Assaleh, N. Cercone, V. Keselj, R. Sweidan.
“Detection of New Malicious Code Using N-grams
Signatures”, International Conference on Intelligent
Information Systems, pp. 193-196, Springer, Poland,
2003.
[14] J.H. Wang, P.S. Deng, “Virus Detection using Data
Mining Techniques”, IEEE International Carnahan
Conference on Security Technology, pp. 71-76, IEEE
Press, 2003.
[15] W.J. Li, K. Wang, S.J. Stolfo, B. Herzog, “Fileprints:
identifying filetypes by n-gram analysis”, IEEE
Information Assurance Workshop, USA, IEEE Press,
2005.
[16] S.J. Stolfo, K. Wang, W.J. Li, “Towards Stealthy
Malware Detection”, Advances in Information Security,
Vol. 27, pp. 231-249, Springer, USA, 2007.
[17] W.J. Li, S.J. Stolfo, A. Stavrou, E. Androulaki, A.D.
Keromytis, “A Study of Malcode-Bearing Documents”,
International Conference on Detection of Intrusions &
Malware, and Vulnerability Assessment (DIMVA), pp.
231-250, Springer, Switzerland, 2007.
[18] M.Z. Shafiq, S.A. Khayam, M. Farooq, “Embedded
Malware Detection using Markov n-Grams”,
International Conference on Detection of Intrusions &
Malware, and Vulnerability Assessment (DIMVA), pp.
88-107, Springer, France, 2008.
[19] M. Christodorescu, S. Jha, and C. Kruegal, “Mining
Specifications of Malicious Behavior”, European
Software Engineering Conference and the ACM
SIGSOFT Symposium on the Foundations of Software
Engineering (ESEC/FSE 2007), pp. 5-14, Croatia, 2007.
[20] Frans Veldman, “Heuristic Anti-Virus Technology”,
International Virus Bulletin Conference, pp. 67-76,
USA, 1993, available at http://mirror.sweon.net/
madchat/vxdevl/vdat/epheurs1.htm.
[21] Jay Munro, “Antivirus Research and Detection
Techniques”, Antivirus Research and Detection
Techniques, ExtremeTech, 2002, available at
http://www.extremetech.com/article2/0,2845,
367051,00.asp.
[22] D.W. Aha, D. Kibler, M.K. Albert, “Instance-based
learning algorithms”, Journal of Machine Learning, Vol.
6, pp. 37-66, 1991.
[23] M.E. Maron, J.L. Kuhns, “On relevance, probabilistic
indexing and information retrieval”, Journal of the
Association of Computing Machinery, 7(3), pp.216-244,
1960.
[24] Y. Freund, R. E. Schapire, “A decision-theoretic
generalization of on-line learning and an application to
boosting”, Journal of Computer and System Sciences,
No. 55, pp. 23-37, 1997
[25] J.R. Quinlan, “C4.5: Programs for machine learning”,
Morgan Kaufmann, USA, 1993.
[26] I.H. Witten, E. Frank, “Data mining: Practical
machine learning tools and techniques”, Morgan
Kaufmann, 2nd edition, USA, 2005.
[27] VX Heavens Virus Collection, VX Heavens website,
available at http://vx.netlux.org
[28] J. Oberheide, E. Cooke, F. Jahanian. “CloudAV:
N-Version Antivirus in the Network Cloud”, USENIX
Security Symposium, pp. 91-106, USA, 2008.
[29] T. Fawcett, “ROC Graphs: Notes and Practical
Considerations for Researchers”, TR HPL-2003-4, HP
Labs, USA, 2004.
[30] S.D. Walter, “The partial area under the summary
ROC curve”, Statistics in Medicine, 24(13), pp.
2025-2040, 2005.
[31] T.M. Cover, J.A. Thomas, “Elements of Information
Theory”, Wiley-Interscience, 1991.
Online Phishing Classification Using Adversarial Data
Mining and Signaling Games
Gaston L’Huillier
University of Chile
Blanco Encalada 2120
Santiago, Chile
[email protected]
Richard Weber
Nicolas Figueroa
[email protected]
[email protected]
University of Chile
Republica 701
Santiago, Chile
University of Chile
Republica 701
Santiago, Chile
ABSTRACT
1.
In adversarial systems, the performance of a classifier decreases after it is deployed, as the adversary learns to defeat
it. Recently, adversarial data mining was introduced as a solution to this, where the classification problem is viewed as
a game mechanism between an adversary and an intelligent
and adaptive classifier. Over the last years, phishing fraud
through malicious email messages has been a serious threat
that affects global security and economy, where traditional
spam filtering techniques have shown to be ineffective. In
this domain, using dynamic games of incomplete information, a game theoretic data mining framework is proposed
in order to build an adversary aware classifier for phishing
fraud detection. To build the classifier, an online version of
the Weighted Margin Support Vector Machines with a game
theoretic prior knowledge function is proposed. In this paper, a new content-based feature extraction technique for
phishing filtering is described. Experiments show that the
proposed classifier is highly competitive compared with previously proposed online classification algorithms in this adversarial environment, and promising results where obtained
using traditional machine learning techniques over extracted
features.
In security applications, modern threats are becoming more
effective as adversaries are adapting and evolving over current security systems. In many domains, such as fraud,
phishing, spam, intrusion detection, and other malicious activities, a permanent race between adversaries and classifiers
is present. The evolution of the initial problem is driven by
a rational change of the adversaries’ behavior.
In this context, one of the major problems of a classifier
is to consider the drift concept and incremental properties
of security systems. In recent studies on this topic [35, 38],
the incremental characteristic of these applications has been
mainly considered, leaving the adversarial behavior as an
open question in most of the previously mentioned domains.
Nowadays, in the Cyber-Crime context, one of the most
common social engineering threats is the phishing fraud.
This malicious activity consists of email scams, where attackers ask for personal information to break into any site
where victims store useful private information, such as financial institutions, e-commerce or massive services. The
phishing filtering problem is not an easy task. Even human
behavior, mental models for phishing identification and security skins for graphical user interfaces were proposed for
enhancing human skills to detect phishing emails [9, 10, 28].
While client side phishing filtering techniques have been developed by large software companies, server side filtering
techniques have been a large research focus [1, 5, 11, 4].
Most of this work is based on machine learning approaches
to determine the relevant features to extract from phishing
emails, and data mining techniques to determine hidden patterns associated to the relationship between the extracted
features.
There is an important issue when using data mining to
build a classifier for the phishing detection task, and many
other adversarial classification tasks: it must deal with the
uncertainty of classifying malicious or regular activities, without information about the real intention of the message. The
latter can be modeled as a Bayesian game (or incomplete information game), where the classifier must decide an strategy not knowing the adversaries’ real type, whether it was
malicious or just happened to be a “malicious like” regular
message. All this, using just the revealed set of features to
decide.
The dynamic behavior of signaling games, a special representation of dynamic games of incomplete information (dynamic Bayesian games), presents common elements with the
online learning theory. In both cases the final outcomes,
whether a game equilibria or an online classifier, are de-
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications—
Data Mining; I.5.1 [Pattern Recognition]: Design Methodology—classifier design and evaluation; K.4.4 [Computers
and Society]: Electronic Commerce—Security
General Terms
Algorithms, Email Filtering, Game Theory, Data Mining
Keywords
Spam and phishing detection, Adversarial Classification, Games
of Incomplete Information
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CSI-KDD’09 , June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-4 ...$5.00.
INTRODUCTION
termined by incremental events presented in their environments. As the nature of email filtering is determined by a
high stream of messages, online algorithms, as well as generative learning algorithms have been considered as a measure
to minimize the computational cost, even if the predictive
power is lower than the obtained by discriminative learning
algorithms.
The aim of this work is to present a game-theoretic data
mining framework using dynamic games of incomplete information for the adversarial classification problem. A mechanism is proposed to model a signaling game between an adversary and a classifier, where equilibrium strategies and the
classifier beliefs are used to build an online machine learning
classifier to detect phishing emails.
Section 2 of this paper introduces previous work on the
adversarial data mining and latest research in phishing filtering. The problem definition and the proposed mechanism
is introduced in section 3. The proposed Classifier strategy
and main contribution of this paper is presented in section
4. In section 5 the proposed phishing feature extraction and
problem parameters are defined. Experimental design is presented in section 6, followed by results shown in section 7.
Finally, main conclusions and future work are presented in
section 8.
2.
2.1
PREVIOUS WORK
Adversarial Machine Learning
As described by Dalvi et al. in [8], an adversarial game
can be represented as a game between two players: A malicious agent whose adversarial activity reports him benefits, and a classifier whose main objective is to identify as
many malicious activities as possible, maximizing its expected utility. The malicious agent tries to avoid detection by changing its features (hence its behavior), inducing
a high false-negative rate to the classifier. The adversary is
aware that changing features to a non-adversarial behavior
might not increase its benefit. Considering this, the adversary might try to maximize its benefit minimizing the cost
of changing features. This framework, based on a single
shot game of complete information, was initially tested in
a spam detection domain where the adversary-aware naı̈ve
Bayes classifier had significantly less false positives and false
negatives than the classifier’s plain version. Then a repeated
version of the game was tested, where results showed that
the adversary-aware classifier outperformed consistently the
adversary-unaware naı̈ve Bayes classifier.
Some extensions of the adversarial classification framework were recently developed. An interesting approach,
proposed by M. Kantarcioglu et al. in [19], considers an
adversarial stackelberg game model to define the interaction
between the classifier and the adversary. On this setting, the
subgame perfect equilibrium is determined using stochastic optimization, were monte-carlo simulations on mixture
models and linear adversarial transformations where tested
over spam data sets, showing promising results. Another
approach, developed by Sönmenz in [26], a two-player zerosum game is solved using a joint Linear Support Vector Machines (L-SVM) and minimax optimization problem to find
the optimal equilibrium and the optimal hyperplane simultaneously. In this work, the general sum case is determined
using a Nikaido-Isado-type functions to define the optimal
hyperplanes in a L-SVM context.
Recently, several studies about the possibility that a classifier is maliciously mis-trained or that its optimal strategies
could be revealed in adaptive adversarial environment has
been developed. Open questions such as if “Can machine
learning be secure?” or questions such as “Can the adversary
manipulate a learning system to permit a specific attack?”
are extensively discussed in [3]. More specifically, Nelson et
al. presents in [23] how to exploit a spam classifier to render it useless using a very specific attack framework, using
indiscriminate, focused attacking and an optimal attacking
function, all of them aware about the training model used a
naı̈ve Bayes classifier.
Another potential adversarial problem for the classifier,
introduced by Lowd and Meek in [18], proposed as the adversarial learning theory, which enables the adversary to reconstruct the classifier based on reasonable assumptions and
reverse engineering algorithms. However, Biggio et al. in [6]
present a promising alternative to randomize the classifier
decision function in order to hide the classifier’s strategy observed by the adversary, minimizing the adversarial learning
and the possibilities to mis-train or learn from the classifier.
2.2
Phishing Classification
Spam filtering has been discussed over the last years, and
many filtering techniques have been described [15]. Phishing classification is different in many aspects from the spam
case, where most of the spam email just want to inform
about some product. In phishing there is a more complex
interaction between the message and the receiver, like following malicious links, filling deceptive forms, or replying with
useful information which are relevant for the message to succeed. Also, there is a clear difference among many phishing
techniques, where the two main message categories are the
popularly known deceptive phishing and malware phishing.
While malware phishing has been used to spread malicious
software installed on victim’s machines, Deceptive phishing,
according to [2], can be categorized in the following six categories: Social engineering, Mimicry, Email spoofing, URL
hiding, Invisible content and Image content. For each one
of these subcategories, specific feature extraction techniques
have been proposed [2] to help phishing classifiers to use the
right characterization of these messages.
Among the countermeasures used against phishing, three
main alternatives have been used [2]: Blacklisting and whitelisting, network and encryption based countermeasures and content based filtering. The first alternative, in general terms
consists in using public lists of malicious phishing web-sites
(the black list) and lists of legitimate non-malicious web-sites
(white list). The idea is that every link in a message must be
checked in both lists. The main problem of this countermeasure is that phishing web-sites do not persist long enough to
be updated on-time in the black list, making difficult to keep
an up-to-date list of malicious web-sites. The second alternative is based on email authentication methods, where the
transaction time could be a considerable computational cost.
Besides, a special technological infrastructure is needed for
this countermeasure [2]. Previous work on content-based
phishing filtering [1, 5, 2, 11, 4] focused on the extraction of
a large number of features and the usage of popular machine
learning techniques for classification. These approaches for
automatic phishing filtering have shown promising results
on setting the relative importance of features.
Different text mining techniques for phishing filtering have
been proposed, where Abu-Nihmed et al. in [1] used Logistic
Regression, Support Vector Machines (SVM) and Random
Forests to estimate classifiers for the correct labeling of email
messages, obtaining the best results with an F-measure1 of
90%. Fette et al. in [11], using a list of improved features
directly extracted from email messages, proposed a SVM
based model which obtained an F-measure of 97,64% in a
different phishing corpus data set. Bergholz et al. in [5] proposed a more sophisticated characterization of emails using
Class-Topic model, where using a SVM model obtained an
F-measure of 99,46% in an updated version of previously
used phishing corpus. Later, in [2], Bergholz et al. proposed an improved list of features to extract from emails,
that could characterize most of phishing tactics proposed
by the same authors. Using a SVM model, a F-measure of
99,89% in a new benchmark database was obtained.
3.
PROBLEM DEFINITION
Consider a message arriving at time t represented by the
feature vector xt = (xt,1 , ..., xt,i , ..., xt,a ), where xt,i is the
ith feature of message xt . Each message can belong to two
classes: positive (or malicious) messages, and negative (or
regular) messages. We define the adversarial classification
under a dynamic game of incomplete information as a signaling game between an Adversary, which attempts to defeat
a Classifier by not revealing information about his real
type, modifying xi (a message of type i) into xj (a message
of type j) by using the transformation function φ(xi ) = xj .
Consider the incomplete information game, as defined by
J. Harsanyi in [16], as the tuple
Γb = (N , (An )n∈N , (Tn )n∈N , (pn )n∈N , (Un )n∈N )
where N = {1, ..., N } is the set of players, An is the set of
possible actions for player n, ∀n ∈ N . Tn is the nth player
possible types set ∀n ∈ N . pn is a probability function
pn : Tn → [0, 1] which assigns a probability distribution
over
j∈N Tj to each possible player type (Tn ), ∀n ∈ N .
Finally, the utility function of player n is denoted by Un :
( j∈N Aj ) × ( j∈N Tj ) → R, which corresponds to the
payoff of player n as a function over the actions of all players
(An ) and their types (tn ).
Based on the previous scheme, as described in [12, 14], dynamic games of incomplete information can be modeled as a
signaling game. The proposed model of incomplete information for the adversarial classification between an Adversary
(A) and a Classifier (C), i.e. N = {A, C}, behaves as the
following sequence of events:
×
×
×
1. Nature draws
! a type ti for the Adversary from T =
{tR,xi }ki=1 {tM,xi }ki=1 , which states whether the adversary is Regular (R) or Malicious (M), and defines
the initial optional message of type i, xi . Nature
draws according to the probability distribution p(ti ),
"
where p(ti ) > 0, ∀i and ki=1 p(ti ) = 1.
1
2. The Adversary observes his type ti , which can be
either tR,xi or tM,xi , and chooses a message xj from
his set of actions AA = {φ(xi ) = xj }kj=1 , where xi
is defined from the type tR,xi or tM,xi . The function
φ : Ra → Ra transforms a feature vector xi into xj , the
F-measure a machine learning quality measure represented
by the harmonic mean between precision and recall. See
section 6.1.
message which the Classifier has to decide its class.
A non malicious adversary does not have incentives to
modify its behavior, so φ(xi ) = xi , when its type is
tR,xi , ∀i = {1, ..., k}.
3. The Classifier observes xj (but not ti ) and chooses
an action C(xj ) from its set of actions AC = {+1, −1}.
It is important to notice that the Classifier is a single
type player, so its type is common knowledge and there
is no need to be mentioned further.
4. Finally, payoffs are revealed by UA (ti , φ(xi ), C(φ(xi )))
and UC (ti , φ(xi ), C(φ(xi ))).
The extensive form game that represents the signaling
game between the Adversary and Classifier is presented
in figure 1.
In order to analyze the optimal strategies for the Classifier in the proposed mechanism, special requirements and
assumptions over the traditional Bayesian Nash equilibrium
must be considered. These requirements must be satisfied to
refer to the perfect Bayesian equilibrium (PBE) refinement
concept in the adversarial classification signaling game.
Definition 1. Sequential rationality: At each information set, the Classifier must have a belief about which
node on the information set has been reached by the play of
the game. Given these beliefs, the Classifier’s strategies
must be sequentially rational [14, 17].
Previous definition insist that the Classifier have beliefs and act optimally given these beliefs, but it is necessary to take assumptions in order to discard unreasonable
beliefs. In an extensive-form game, information sets are “on
the equilibrium path” if they will be reached with positive
probability and the game is played according to the equilibrium strategies. At “on the equilibrium path” information
sets, beliefs are determined by Bayes’ rule and the players’
equilibrium strategies. Krebs and Wilson formalized in [17]
the concept of sequential rationality, where it is stated that
the equilibrium no longer consists on just optimal strategies
for each agent, but also includes a belief for each agent at
each information set at which the agent has to make a move.
Definition 2. Signaling requirement 1 (S1) After observing any message xj , from AA , the Classifier must
have a belief about which types could have sent xj . Denote this belief by the probability
distribution µ(ti |xj ), where
"
µ(ti |xj ) ≥ 0, ∀ti ∈ T and ti ∈T µ(ti |xj ) = 1.
Definition 3. Signaling requirement 2 (S2C) For each
xj ∈ AA , the Classifier’s optimal strategy defined as the
∗
probability distribution σC
over the Classifier’s actions
C(xj ) ∈ AC , must maximize the Classifier’s expected utility, given the beliefs µ(ti |xj ) about which types could have
sent xj . That is,
#
µ(ti |xj ) · UC (ti , xj , αC ) (1)
∀xj , σC∗ (·|xj ) ∈ arg max
αC
where
UC (ti , xj , σ(·|xj )) =
ti∈T
#
C(xj )∈AC
σC (C(xj )|xj )UC (ti , xj , C(xj ))
(2)
Figure 1: Extensive-form representation of the signaling game between the Classifier an the Adversary. On
the figure, xij is defined by φ(xi ) = xj and Ij is the j th information set where the classifier has to decide
C(xj ) = {+1, −1}. All intermediate nodes between Nature and information sets, represents the strategy nodes
for the adversary, where φ(xi ) = xj is decided.
Definition 4. Signaling requirement 3 (S2A) For each
ti ∈ T , the Adversary’s optimal message xj = φ(xi ),
∗
defined by the probability distribution σA
over the Adversary’s actions xj ∈ AA , must maximize the Adversary’s
utility function, given the Classifier’s strategy σC∗ . That
is,
∀ti , σA (·|ti ) ∈ arg max UA (ti , αA , σC∗ )
αA
where
UA (ti , σA , σC ) =
#
xj ∈AA
and
UA (ti , xj , σC (·|xj )) =
(3)
σA (xj |ti )UA (ti , xj , σC (·|xj ))
#
C(xj )∈AC
σC (C(xj )|xj )UA (ti , xj , C(xj ))
Definition 5. Signaling requirement 4 (S3) For each
∗
xj ∈ AA , if there exists ti ∈ T such that σA
, then the Classifier’s belief at the information set Ij corresponding to xj
must follow from Bayes’ rule and the Adversary’s strategy
µ(ti |xj ) = "
∗
σA
(xj |ti ) · p(ti )
∗
tr ∈T σA (xj |tr ) · p(tr )
(4)
"
∗
If tr ∈T σA
(xj |tr )·p(tr ) = 0, µ(ti |xj ) can be defined as any
probability distribution.
Sequential equilibria, a subset of perfect Bayesian equilibrium (PBE) in the adversarial signaling game is a pair of
∗
mixed strategies σA
and σC∗ and a belief µ(ti |xj ) satisfying
signaling requirements S1, S2C, S2A, and S3. It is clear, by
construction of the mechanism, that requirements S1 and S3
are satisfied by the adversarial classification game. Nevertheless, signaling requirement S2A will be considered satisfied as a first approach and a strong assumption on the game
development. Adversarial behavior strategies, as described
by [8], could be considered as a interesting alternative to
be developed. However, this will be considered as an open
question to be treated as future work.
Assumption 1. Signaling game refinements: The dynamic game of incomplete information between the Classifier and the Adversary can be modeled by a signaling
game, which satisfies signaling requirements for sequential
rationality.
Recently, numerical approximation on the sequential equilibria refinement have been proposed by Turocy in [31] based
on previous work defined in [21, 30], using a transformation
of the logit quantal response equilibrium (QRE) correspondence, parameterized by a scalar precision parameter, which
as tends to infinity, a numerical approximation for the sequential equilibria is obtained. This numerical algorithm has
been implemented in Gambit [20], an open-source project for
estimating equilibrium results in finite games.
4.
CLASSIFIER STRATEGY
As mentioned before, the Classifier’s optimal strategies
are defined by the set AC = {+1, −1}. From the signaling requirement S2C, it can be shown that the Classifier’s
optimal strategy C ∗ (xj ) can be solved by the following conditional statement,
$
+1 if condition 6 is satisfied
∗
C (xj ) =
(5)
−1 Otherwise
#
ti ∈TM
ti
µ(ti , xj )∆UC,M
(xj ) >
{tM,xi }ki=1 ,
Where TM =
defined by equation 4,
ti
∆UC,R
(xj ) =
#
ti ∈TR
ti
µ(ti , xj )∆UC,R
(xj ) (6)
TR = {tR,xi }ki=1 , µ(ti |xj ) is
σC∗ (−1|xj )UC (tR,xi , xj , −1)
− σC∗ (+1|xj )UC (tR,xi , xj , +1)
and
ti
∆UC,M
(xj )
=
σC∗ (+1|xj )UC (tM,xi , xj , +1)
− σC∗ (−1|xj )UC (tM,xi , xj , −1)
On the following, these expressions will be considered as
ti
∆UC,M
(xj ) = $M · (σC∗ (+1|xj ) + σC∗ (−1|xj ) · γ) · (wT · xj + b)
and
ti
∆UC,R
(xj ) = $R ·(σC∗ (−1|xj )+σC∗ (+1|xj )·γ)·(wT ·(e−xj )+b)
where γ, $R and $M must be defined based on an microeconomic assumptions on the primitives of the game, and e
is a vector of ones, whose dimension is a. For more detail,
the modeling intuition and the final analytical expression of
the utility functions are presented on Appendix A.
The previous game-theoretic result (condition 6), can be
considered as a prior knowledge constraint in a classification
problem, associated with the regularized risk minimization
from the statistical learning theory proposed by Vapnik in
[32]. All this, is formulated as the following quadratic problem,
stated as follows,
Algorithm 4.1: Bayesian Adversary Aware Online
SVM
Data: (x1 , y1 ), ..., (xn , yn ), γ, $M , $R , m, Gp, C
Result: f (xt ) = wT · xt + bt
1 Initialize w0 := 0, b0 := 0, seenData := {};
2 foreach xt , yt do
T
3
Classify xt using f (xt ) = wt−1
· xt + bt−1 ;
(
' T
4
if yt wt−1 · (xt ) + bt−1 < Ψ(xt ) then
5
Find w$ , b$ with prior knowledge SMO on
seenData, with wt−1 and bt−1 as seed
hypothesis, and Ψ(xt );
6
set wt := w$ and bt := b$ ;
7
8
if size(seenData) > m then
remove oldest example from seenData;
9
if T mod Gp = 1 then
Approximate sequential equilibrium strategies
using logit QRE;
add xt to seenData;
update p(ti ) based on observed messages on
seenData;
update beliefs µ(ti |x), ∀ti ∈ T, x ∈ seenData using
signaling requirement S3 ;
update Ψ(xi ), ∀i ∈ seenData;
10
11
min
w,b,ξ
s.t
a
N
#
1# 2
wi + C
ξi
2 i=1
i=1
%
&
yi wT · xi + b · Ψ(xi ) ≥ (1 − ξi )
12
13
(7)
14
∀i ∈ {1, .., N }
Where,
1 + ψ(xi )
Ψ(xi ) = "a
k=1 wi + 2 · b
and
"
t ∈T
ψ(xi ) = " r M
tr ∈TR
µ(tr |xi ) · $M · (σC∗ (+1|xi ) + γ · σC∗ (−1|xi ))
µ(tr |xi ) · $R · (σC∗ (−1|xi ) + γ · σC∗ (+1|xi ))
The online algorithm to solve the proposed minimization
problem, is based on solving its dual formulation using the
Sequential Minimal Optimization (SMO) described by Platt
in [24]. The SMO algorithm is used to train SVMs breaking
up the large Quadratic Programming (QP) representation of
the dual into small series of QP problems, which are solved
analytically by the algorithm. Small changes in the SMO
algorithm, such as explained in previous prior knowledge
inclusion in SVMs [37] was considered.
A dual representation of the classification problem in equation 7, is presented as the following,
max
α
N
#
i=1
s.t
αi −
N #
N
#
i=1 j=1
N
#
i=1
Algorithm 4.1 presents the online learning algorithm, Bayesian
Adversary-Aware Online SVM (BAAO-SVM). Based on the
Classifier’s beliefs and sequential equilibrium strategies,
the hyperplane parameters are updated, incorporating as
prior knowledge constraints the game theoretic results. The
main idea of the algorithm is that given an incoming message xt , a label is assign using the classification function
T
f (xt ) = wt−1
· xt + bt−1 . If the Classifier’s optimal strategy is not satisfied (equation 6), the hyperplane parameters
are updated using a modified version of the SMO algorithm
over the seen messages (seenData set). A memory parameter
m is used to set the number of messages in seenData. Then,
every Gp periods, the sequential equilibrium strategies are
updated using logit QRE. Finally, xt is added to seenData
and the type’s probabilities are updated, hence beliefs and
Ψ(xi ).∀i ∈ seenData. It is important to notice that the algorithm evolves dynamically as messages are presented to
the Classifier.
5.
5.1
yi Ψ(xi ) · yj Ψ(xj ) · αi αj xTi xj
Ci ≥ αi ≥ 0
return 1 ;
15
ξi ≥ 0 ∀i ∈ {1, .., N }
(8)
yi Ψ(xi ) · αi = 0
∀i ∈ {1, .., N }
Based on previous work on Online Support Vector Machines algorithms described by Gentile in [13] and later by
Sculley in [25], the proposed adversary aware classifier is
PHISHING FEATURES, STRATEGIES AND
TYPES EXTRACTION
Corpus Description
The previously defined classifier will be tested over an
English language phishing and Ham email corpus built using Jose Nazario’s phishing corpus [22] and the Spamassassin Ham collection. The phishing corpus2 consists of 4450
emails manually retrieved from November 27, 2004 to August 7, 2007. The Spamassassin collection, from the Apache
SpamAssassin Project3 , is based on a collection of 6951 Ham
email messages. The email collection was saved in a unix
mbox email format, and was processed using Perl scripts.
2
3
Available at http://monkey.org/~jose/wiki/doku.php?id=PhishingCorpus
Available at http://spamassassin.apache.org/publiccorpus/
5.2
Basic Features
As initially described in [11] and then in [5, 2, 4], the extraction of basic content based features is needed for a minimum representation of phishing emails. These features are
associated to structural properties of the email, link analysis,
programming elements and the output of the spam filters:
• Structural properties are proposed as four binary features defined by the MIME (Multipurpose Internet
Mail Extensions) standard, related to the possible number of email formats, and states information about the
total number of different body parts (all body parts,
discrete body parts, composite body parts and alternative body parts).
• Link analysis provides seven binary features related to
the properties of every link in an email message: The
existence of links in a message, if the number of internal links is greater than one, if the number of external
links is greater than one, if the number of links with IP
numbers is greater than one, if the number of deceptive links4 is greater than one, if the number of links
behind images is greater than one and if the maximum
number of dots in all links for a message is greater than
10.
• Programming elements are defined as binary features
representing whether HTML, JavaScript and forms are
used in a message.
• Finally, the SpamAssassin filter’s output score for an
email was used, indicating with a binary feature if the
score was greater than 5.0, the recommended spam
threshold.
It is important to notice that all previously mentioned
features (a total of 15 features) are directly extracted from
content-based properties of an email message, and each one
can be considered as a strategy for the Adversary to defeat
the Classifier.
5.3
Word List and Clustering Features
Previously mentioned features are not sufficient for the appropriate characterization of a phishing message, and clearly
not the complete representation of adversarial strategies.
Following the content-based extraction, a new list of features is proposed to determine the characterization for the
phishing emails, which is indeed directly associated to the
possible strategy AA for the Adversary.
On the following, word-based features will be described
as an approach to fulfill the needed phishing strategies representation. These features will be presented as a binary
variable for each word in a list of keywords, whose value is 1
if the word is used in the document, and 0 otherwise. Also,
a feature for each cluster of words, defined as the keyword
clusters is considered. The main idea is that phishing strategies are defined as a list of words used in a message. So, for
each keyword cluster (Adversary type), a list of relevant
words will be associated, representing a phishing strategy.
First, a stop-words removal and stemming pre-processing
is necessary to setup the email database. Let R be the total number of different words in the complete collection of
4
Links whose real URL is different from the URL presented
to the email reader.
phishing emails, and Q the total number of emails. A vectorial representation of a the phishing corpus is given by
M = (mij ), i = 1, ..., R and j = 1, ..., Q , where mij is the
weight associated to whether a given word is more important than another in a document. The mij weights considered in this research are defined as an improvement of the
tf-idf term [34] (Term Frequency times inverse document
frequency), defined by
) *
Q
mij = fij (1 + sw(i)) × log
(9)
ni
where fij is the frequency of the i-th word in the j-th document, sw(i) is a factor of relevance associated of word i in a
set of words and ni is the number of documents containing
wi
i
word i. On this case, sw(i) = email
, where wemail
is the
TE
frequency of word i over all documents, and T E is the total
amount of emails. The tf-idf term is a weighted representation of the importance of a given word, in a document that
belongs to a collection of documents. The term frequency
indicates the weight of each word in a document, while the
inverse document frequency states whether the word is frequent or uncommon in the document, setting a lower or
higher weight respectively.
Based the on previous tf-idf representation, a clustering
technique must be considered for the segmentation of the
whole collection of phishing emails. K-Means clustering with
the cosine between documents, as the distance function was
used (see equation 10).
"R
k=1 mki mkj
+"
cos(mi , mj ) = +"
(10)
R
R
2
2
k=1 (mki )
k=1 (mkj )
The optimal number of clusters was determined using as
stopping rules the minimization of the distance within every cluster and the maximization of the distance between
clusters. Then, for each cluster the most relevant words are
determined by
,Cw(i) = |ζ|
mip
(11)
p∈ζ
for i ∈ 1, .., R, where Cw is a vector containing the geometric
mean of each word weights within the messages contained
in a given cluster. Here, ζ is the set of documents in each
cluster and mip is represented by equation 9. Finally, the
most important words for each cluster can be determined
ordering the weights of vector Cw. This procedure is based
on previous work described in [33], except that the application was for web-mining and the clustering algorithm used
was Self Organizing Feature Maps. Results on this method
showed that the optimal number of clusters is 13, where the
30 most relevant words of each cluster where considered as
features (a total of 403). The first five relevant words of each
cluster are presented in table 1.
5.4
Strategies and Types Extraction
Based on previously mentioned features (a total of 418
features), a feature selection algorithm is used to improve
the performance of the classification algorithms, eliminating
noisy features that does not represent the target value at
all, and don’t give enough information about the real phenomenon observed by the game agents. This is a key step
for eliminating word features considered arbitrarily as the
30 most relevant words for each cluster, giving a final list of
Table 1: This table shows the five most relevant
words for each of the 13 clusters of the phishing
corpus.
Cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
Word 1
limit
address
ebay
chase
vector
account
signin
amazon
ebay
login
area
use
union
Word 2
use
follow
secur
repli
area
paypal
list
union
email
respons
demo
sidebar
nation
Word 3
credit
bill
bank
payment
desktop
messag
partner
never
page
verif
hidden
card
answer
Word 4
card
communiti
access
answer
loan
inform
site
maintain
polici
window
expens
repli
googl
Word 5
provid
violat
user
info
keybank
updat
offer
world
help
yahoo
image
review
barclay
relevant words for the phishing/ham classification problem.
An information gain algorithm was implemented, similar to
the one used in decision trees. Here, the information gain for
each feature was calculated over the whole database, eliminating those features that does not reported a minimum
threshold. A total of 153 features where eliminated, obtaining the final data set of 265 features.
After the feature selection algorithm, the Adversary’s
types ti ∈ T are extracted using K-Means clustering over the
whole collection of emails (phishing and ham). Therefore,
the number of clusters over the whole set of features Kfeatures
will represent the total number of types for the Adversary
player. For each message x, represented by a vector of 265
variables, the type will be determined by
ti = arg min d(x, Ci ), ∀i = {1, .., Kfeatures }
i
(12)
where Ci is the centroid of cluster i, and function d :
Ra × Ra → R represents the distance between two vectors
of dimension a. The distance function used in this research
is the Hamming distance, represented by the number of bits
needed to change one vector into another.
Results shows that Kfeatures = 7, where distributed among
four frequency dominant phishing clusters and three frequency dominant ham clusters. Clusters where obtained
following the same stopping rule used in the keyword clustering, by the minimization of the distance within every cluster
and the maximization of the distance between clusters.
As stated in section 3 and Appendix A, strategies are directly associated to the utility function modeling for the
Classifier, where the usage of any of the 265 previously
stated features, could increase or decrease the utility function depending on the classification strategy considered by
the Classifier.
6.
EXPERIMENTS
The classification of phishing emails is a natural extension
of text mining, where the most promising classification algorithms are Support Vector Machines, naı̈ve Bayes, Random
Forest, among other text categorization algorithms [27]. A
considerable problem is the online setting associated to the
email inbox nature, where messages arrives from an infinite
set of messages. On this context, the following experimental settings will be determined to give the right benchmark
results for the proposed feature extraction between previous
results and batch learning SVMs. Also the objective of the
experimental setting is to show the accuracy and effectiveness between different online classification algorithms and
the proposed adversary aware classifier.
First, a 10 times 10 cross validation learning schema using
SVM on the complete database characterized with 265 features was developed, using the libSVM-library [7], and the
same learning schema was used to train a naı̈ve Bayes model
implemented in Weka [36].
Then, an incremental drift evaluation of SVMs was made
using a stratified hold-out learning schema using the 20% of
the database to train the classifier, and the next 80% of the
database was considered as an arriving message test schema
without any actualization of the support vectors, as a proxy
for the concept drift behavior over batch algorithms.
Finally, for the online setting, the Relaxed Online SVM
proposed by Sculley in [25] was used, as well as an incremental evaluation of naı̈ve Bayes, and the proposed adversary
aware classifier (BAAO-SVM) was evaluated in this schema.
The adversary aware classifier was developed using the
265 features as possible Adversary’s strategies, and the
{+1, −1} set as the Classifier’s strategies. Types where
considered as previously described type extraction method,
where a total of 7 clusters were obtained. Approximation
on the sequential equilibria was determined using logit QRE,
implemented in Gambit [20] software command-line tool (gambitlogit). The Classifier’s strategy (adversary-aware classifier) described in section was implemented in C++, extending D. Sculley’s Online SVM implementation [25], with a
modified version of SMO for prior knowledge described in
[37].
The values of γ, $R and $M where defined as an initial
estimation over the primitives of the game. More details on
this model parameters finding where intentionally omitted
by the authors.
6.1
Evaluation Criteria
The resulting confusion matrix of this binary classification
task can be described using four possible outcomes: Correctly classified phishing messages or True Positives (TP),
correctly classified ham messages or True Negative (TN),
wrong classified ham messages as phishing or False Positive (FP) and wrong classified phishing messages as ham or
False Negative (TN). The evaluation criteria considered are
common machine learning measures, which are constructed
using the latter classification outcomes.
• The False Positive Rate (FP-Rate) and the False Negative Rate (FN-Rate) as the proportion of wrongly classified ham and phishing email messages respectively.
FP-Rate =
FP
FP + TN
(13)
FN-Rate =
FN
FN + TN
(14)
• Precision, that states the degree in which identified
as phishing messages are indeed malicious. Can be
interpreted as the classifier’s safety.
Precision =
TP
TP + FP
(15)
• Recall, that states the percentage of phishing messages
that the classifier manages to classify correctly. Can
be interpreted as the classifier’s effectiveness.
TP
Recall =
TP + FN
(16)
• F-measure, the harmonic mean between the precision
and recall
F-measure =
2 ∗ Precision ∗ Recall
Precision + Recall
(17)
• Accuracy, the overall percentage of correct classified
email messages.
Accuracy =
7.
7.1
TP + TN
TP + TN + FP + FN
RESULTS
Word List and Clustering Features
Online Algorithms Performance
To identify the online property of learning algorithms is
not an easy task. In this work, a first approach using previously mentioned classification performance measures in section 6.1, the applicability and accuracy of the overall proposed algorithm were tested. In this context, results where
obtained for ROSVM with an F-measure of 86.01% with an
accuracy of 85.20%, for an online version of Naı̈ve Bayes the
F-measure is 85.20% whose accuracy is 81.18% and for the
proposed adversary aware classifier (BAAO-SVM)5 the Fmeasure is 87.69% whose accuracy is 86.63%, with a better
performance than previously used online classification algorithms on these evaluating criteria.
8.
CONCLUSIONS AND FUTURE WORK
An extension of the Adversarial Classification framework
for Adversarial Data Mining was presented, considering dynamic games of incomplete information, or signaling games,
as a new approach to make classifiers improve their performance in adversarial environments. This approach considered strong assumptions on the Adversary strategies, the
utility function modeling for the Classifier, and experimental setups related to the database processing. As a first
approach, interesting empirical results are presented.
5
Model
FP-Rate
FN-Rate
Accuracy
Bergholz’s SVM
10x10xv SVM
Naı̈ve Bayes
Inc. drift SVM
Inc. Naı̈ve Bayes
Online SVM
BAAO-SVM
0.07%
1.21%
4.47%
5.33%
1.33%
15.45%
14.69%
1.11%
0.33%
6.60%
10.60%
25.66%
14.26%
12.26%
99,52%
99.48%
94.31%
91.60%
81.18%
85.20%
86.63%
(18)
As shown in Table 4, the F-measure obtained for a 10
times 10 fold cross-validation SVM is 99.32% and for Naı̈ve
Bayes under the same learning schema the F-measure obtained is 94.84%. Previous results for the same email corpus (phishing and ham corpus) reported an F-measure of
99.89% obtained by Bergholz et al. in [5]. In some evaluating measures, this work’s results are slightly worst than
previously obtained results, but are highly competitive as
most of the features considered after the feature extraction
and selection algorithms are different from those proposed
by previous authors. This points out an interesting open
question: as a future work, a combined feature extraction
technique could achieve better results. However, results for
the False Positive Rate is considerable better than previously obtained with a value of 0.33%, compared to 1.11%
respectively.
7.2
Table 2: Experimental results for the benchmark
machine learning algorithms. FP-Rate, FN-Rate
and Accuracy evaluation criteria.
BAAO-SVM: Bayesian Adversary Aware Online SVM
Table 3: Experimental results for the benchmark
machine learning algorithms. Precision, Recall and
F-measure evaluation criteria
Model
Precision Recall
F-measure
Bergholz’s SVM
10x10xv SVM
Naı̈ve Bayes
Inc. drift SVM
Inc. Naı̈ve Bayes
Online SVM
BAAO-SVM
99.89%
99.67%
93.35%
95.90%
99.78%
85.20%
87.64%
99.89%
98.97%
96.38%
89.40%
74.34%
86.83%
87.74%
99.89%
99.32%
94.84%
92.54%
85.20%
86.01%
87.69%
The proposed adversary aware classifier, whose core is
mainly the Support Vector Machines model, considers a signaling game where beliefs, mixed strategies and probabilities for the messages’ types are updated and incorporated
as prior knowledge, as new messages are presented. This
enables the classifier to change the margin error parameter
dynamically as the game evolves, considering an embedded
awareness of the adversarial environment. More specifically,
this is considered in the miss-classification constraint in the
optimization problem for the SVM algorithm. Results obtained showed promising results over previous online text
categorization algorithms used for email filtering.
Feature extraction is a key component for the game strategies and types for the dynamic game of incomplete information proposed. Results showed that the proposed feature extraction and selection techniques are highly competitive in
comparison with previous feature extraction work. Future
work could be oriented to consider a mixture of previous and
present feature extraction techniques. This could estimate
a better strategy space for the Adversary, therefore the
Adversary types. This is an important topic that affects
directly the definition of the signaling game between the
Adversary and the Classifier, hence the Classifier’s
performance.
Determining the actual drift concept of the game is an important open question. An experimental setup to show the
impact for the classifier related to the inclusion of new Adversary strategies within an already defined set of strategies (features) might help to answer this question.
In the game modeling, the Adversary strategies could
be estimated using linear programming problem, as previous authors recommended in the original Adversarial Classification framework [8]. However, this first approach in adversarial classification with dynamic games of incomplete
information showed interesting empirical and theoretical results. An extension on theoretical aspects of the game theory
framework, such as refinements on these equilibria, using for
example the intuitive criteria proposed by Cho and Kreps,
among other special refinements for the perfect Bayesian
equilibria.
[12]
9.
[14]
ACKNOWLEDGMENTS
[13]
Support from the Millennium Science Institute on Com[15]
plex Engineering Systems (http://www.sistemasdeingenieria.cl)
and the Center for Analysis and Modeling for Security (CEAMOS)
is greatly acknowledged.
[16]
10. REFERENCES
[1] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A
comparison of machine learning techniques for
phishing detection. In eCrime ’07: Proceedings of the
anti-phishing working groups 2nd annual eCrime
researchers summit, pages 60–69, New York, NY,
USA, 2007. ACM.
[2] B. Andre, J. D. Beer, S. Glahn, M.-F. Moens,
G. Paass, and S. Strobel. New filtering approaches for
phishing email. Journal of Computer Security, 2009.
Accepted for publication.
[3] M. Barreno, B. Nelson, R. Sears, A. D. Joseph, and
J. D. Tygar. Can machine learning be secure? In
ASIACCS ’06: Proceedings of the 2006 ACM
Symposium on Information, computer and
communications security, pages 16–25, New York, NY,
USA, 2006. ACM.
[4] R. Basne, S. Mukkamala, and A. H. Sung. Detection
of Phishing Attacks: A Machine Learning Approach,
chapter Studies in Fuzziness and Soft Computing,
pages 373–383. Springer Berlin / Heidelberg, 2008.
[5] A. Bergholz, J.-H. Chang, G. Paass, F. Reichartz, and
S. Strobel. Improved phishing detection using
model-based features. In Fifth Conference on Email
and Anti-Spam, CEAS 2008, 2008.
[6] B. Biggio, G. Fumera, and F. Roli. Adversarial
pattern classification using multiple classifiers and
randomisation. In SSPR/SPR, pages 500–509, 2008.
[7] C.-C. Chang and C.-J. Lin. LIBSVM: a library for
support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[8] N. Dalvi, P. Domingos, M. Sumit, and
S. DeepakVerma. Adversarial classification. In in
Proceedings of the Tenth International Conference on
Knowledge Discovery and Data Mining, volume 1,
pages 99–108, Seattle, WA, USA, 2004. ACM Press.
[9] J. S. Downs, M. Holbrook, and L. F. Cranor.
Behavioral response to phishing risk. In eCrime ’07:
Proceedings of the anti-phishing working groups 2nd
annual eCrime researchers summit, pages 37–44, New
York, NY, USA, 2007. ACM.
[10] J. S. Downs, M. B. Holbrook, and L. F. Cranor.
Decision strategies and susceptibility to phishing. In
SOUPS ’06: Proceedings of the second symposium on
Usable privacy and security, pages 79–90, New York,
NY, USA, 2006. ACM.
[11] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect
phishing emails. In WWW ’07: Proceedings of the
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
16th international conference on World Wide Web,
pages 649–656, New York, NY, USA, 2007. ACM.
D. Fudenbert and J. Tirole. Game Theory. MIT Press,
October 1991.
C. Gentile. A new approximate maximal margin
classification algorithm. Journal of Machine Learning
Research, 2:213 – 242, December 2001.
R. Gibbons. Game Theory for Applied Economists.
Princeton University Press, 1992.
J. Goodman, G. V. Cormack, and D. Heckerman.
Spam and the ongoing battle for the inbox. Commun.
ACM, 50(2):24–33, 2007.
J. C. Harsanyi. Games with incomplete information
played by bayesian players. the basic probability
distribution of the game. Management Science,
14(7):486–502, 1968.
D. M. Kreps and R. Wilson. Sequential equilibria.
Econometrica, 50(4):863–94, July 1982.
D. Lowd and C. Meek. Adversarial learning. In KDD
’05: Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in
data mining, pages 641–647, New York, NY, USA,
2005. ACM.
C. C. M. Kantarcioglu, B. Xi. A game theoretic
framework for adversarial learning. In CERIAS 9th
Annual Information Security Symposium, 2008.
M. A. M. McKelvey, Richard D. and T. L. Turocy.
Gambit: Software tools for game theory, version
0.2007.01.30, 2007.
R. D. McKelvey and T. R. Palfrey. Quantal response
equilibria for normal form games. In Normal Form
Games, Games and Economic Behavior, pages 6–38,
1996.
J. Nazario. Phishing corpus, 2004-2007.
B. Nelson, M. Barreno, F. J. Chi, A. D. Joseph,
B. I. P. Rubinstein, U. Saini, C. Sutton, J. D. Tygar,
and K. Xia. Exploiting machine learning to subvert
your spam filter. In LEET’08: Proceedings of the 1st
Usenix Workshop on Large-Scale Exploits and
Emergent Threats, pages 1–9, Berkeley, CA, USA,
2008. USENIX Association.
J. Platt. Sequential minimal optimization: A fast
algorithm for training support vector machines, 1998.
D. Sculley and G. M. Wachman. Relaxed online svms
for spam filtering. In SIGIR ’07: Proceedings of the
30th annual international ACM SIGIR conference on
Research and development in information retrieval,
pages 415–422, New York, NY, USA, 2007. ACM.
O. Sönmez. Learning game theoretic model
parameters applied to adversarial classification.
Master’s thesis, Saarland University, 2008.
F. Sebastiani. Text categorization. In A. Zanasi,
editor, Text Mining and its Applications to
Intelligence, CRM and Knowledge Management, pages
109–129. WIT Press, Southampton, UK, 2005.
D. S. Skins and R. Dhamija. The battle against
phishing:. In In SOUPS Š05: Proceedings of the 2005
symposium on Usable privacy and security, pages
77–88. ACM Press, 2005.
T. L. Turocy. A dynamic homotopy interpretation of
the logistic quantal response equilibrium
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
correspondence. Games and Economic Behavior,
51(2):243–263, May 2005.
T. L. Turocy. Using quantal reponse to compute nash
and sequential equilibria. Economic Theory, Vol. 42,
Issue 1, 2010.
V. N. Vapnik. The Nature of Statistical Learning
Theory (Information Science and Statistics). Springer,
1999.
J. D. Velasquez, S. A. Rios, A. Bassi, H. Yasuda, and
T. Aoki. Towards the identification of keywords in the
web site text content: A methodological approach.
IJWIS, 1(1):53–57, 2005.
Y. H. A. T. Velásquez, J.D. and R. Weber. A new
similarity measure to understand visitor behavior in a
web site. IEICE Transactions on Information and
Systems, Special Issues in Information Processing
Technology for web utilization, vE87-D i2.:389–396,
2004.
H. Wang, W. Fan, P. S. Yu, and J. Han. Mining
concept-drifting data streams using ensemble
classifiers. In KDD ’03: Proceedings of the ninth ACM
SIGKDD international conference on Knowledge
discovery and data mining, pages 226–235, New York,
NY, USA, 2003. ACM.
I. H. Witten and E. Frank. Data Mining: Practical
machine learning tools and techniques. Morgan
Kaufmann, San Francisco, 2nd edition edition, 2005.
X. Wu and R. Srihari. Incorporating prior knowledge
with weighted margin support vector machines. In
KDD ’04: Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and
data mining, pages 326–333, New York, NY, USA,
2004. ACM.
P. Zhang, X. Zhu, and Y. Shi. Categorizing and
mining concept drifting data streams. In KDD ’08:
Proceeding of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 812–820, New York, NY, USA, 2008. ACM.
APPENDIX
A. CLASSIFIER’S UTILITY FUNCTION
Based on a simple example, the notion of the utility function for the Classifier is presented. Finally a general expression for the utility function is presented, and an analytti
ti
ical expression for ∆UC,R
(xj ) and ∆UC,M
(xj ) is deduced.
Consider a database characterized by just one feature x1 .
This feature represents the usage of a phishing word, suppose the word “paypal”. Therefore, if “paypal” is used in
a message, then x1 = 1 and x1 = 0 otherwise. Suppose
we have a classification function defined by y = w · x1 + b.
If y >= τ , for a given threshold τ , the message is classified as phishing (or malicious), and if y < τ is classified
as regular (or non-malicious). As a strong assumption, the
construction and modeling of the utility function considers
that every feature with a value different than 0 presents a
phishing strategy.
Suppose the case where the real type of given message is
malicious. If a malicious feature x1 is used in the message,
then the value for the classification function is y = w + b,
so there is a high probability that w + b > τ , therefore to
classify the message as malicious. The utility of the Classi-
fier, if the message is classified as malicious, is proposed
to be considered as UC = $M (w + b), the maximum reward to the classifier for performing a good job. Now, if
the Classifier decides that the message was not malicious,
even if x1 = 1, it must be penalized with his maximum cost
UC = −γ$M (w + b), where γ > 1 is a miss-classification cost
parameter. Now, if x1 = 0, then the classifier is rewarded
(or penalized) with a lower value, as it manages to classify a
message without phishing properties as malicious, or missclassifying a malicious message without phishing properties
respectively. Also, to differ the payoffs from the case where
the message real type is regular, an $M parameter is introduced. Here, C(x1 ) is the prediction of the classification
function.
Based on the previous idea, the Classifier’s payoffs for
the malicious and regular cases are defined as follows:

$M · (w + b)
if x1 = 1, C(x1 ) = +1



$ · b
if
x1 = 0, C(x1 ) = +1
M
UCMalicious (x1 , C(x1 )) =

−γ
·
$
·
(w
+
b)
if
x1 = 1, C(x1 ) = −1
M



−γ · $M · b
if x1 = 0, C(x1 ) = −1
In case that the real type of a given message is regular,
the payoff are quite different. Here, the maximum payoff can
be achieved when the classifier decides is not phishing, when
is not using a phishing characteristic. The minimum payoff
is achieved when the regular message doesn’t use a phishing
characteristic but is classified as malicious. Without loose
of generality, an $R parameter is introduced to differ the
payoffs from the previous case.

$R · (w + b)



$ · b
R
Regular
UC
(x1 , C(x1 )) =
−γ · $R · (w + b)



−γ · $R · b
if
if
if
if
x1
x1
x1
x1
= 0,
= 1,
= 0,
= 1,
C(x1 ) = −1
C(x1 ) = −1
C(x1 ) = +1
C(x1 ) = +1
For the general case with a features6 , following the same
idea of previous simple case where a = 1, the classification function is defined by y = wT · x + b, where dim(x) =
dim(w) = a. All this, associated to mixed strategies of playing C(xj ) = +1 or C(xj ) = −1, represented by σC∗ (+1|xj )
or σC∗ (+1|xj ) respectively. It can be shown that,
ti
∆UC,M
(xj )
ti
∆UC,M
(xj )
=
=
ti
∆UC,M
(xj ) =
∗
σC
(+1|xj )UC (tM,xi , xj , +1)
∗
−σC
(+1|xj )UC (tM,xi , xj , −1)
σC (+1|xj )$M · (wT · x + b)
∗
−σC
(−1|xj )($M · (−γ) · (wT · x + b))
∗
∗
$M (σC
(+1|xj ) + γ · σC
(−1|xj )) · (wT · xj + b)
Following the same idea, when the message’s real type is
regular, it can be shown that,
ti
∗
∗
∆UC,R
(xj ) = $R ·(σC
(−1|xj )+σC
(+1|xj )γ)·(wT ·(e−xj )+b)
Where e is a vector of dimension a with ones. It is important to consider that the previous utility function representation is specially designed for binary features that only
captures malicious strategies.
6
a is the number of features or strategies revealed to the
Classifier from a given Adversary
Data Security and Integrity: Developments and Directions
Bhavani Thuraisingham
Department of Computer Science
University of Texas at Dallas
[email protected]
ABSTRACT
Data is a critical resource in numerous organizations. One of the
challenging problems facing these organizations today is to
ensure that only authorized individuals have address to data.
Data also has to be protected from malicious corruption.
Much of the early work on data security focused on multilevel
secure data management systems where users have different
clearance levels and data has different sensitivity levels and
access to data is governed by the security policies. There were
many efforts on securing relational, distributed and objectoriented databases. More recently, several aspects of data
security are being investigated including data confidentiality,
integrity, trust and privacy. Furthermore, securing data
warehouses, semantic web, as well as applying data mining for
solving security problems are getting a lot of attention
This presentation will review the developments in data security
and integrity as well as discuss directions for further research
and development. In particular, policy management for the
semantic web, assured information sharing, privacy preserving
data mining and novel ways to build secure data management
systems will be discussed.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
KDD-WS'09, June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-4...$5.00
BIOGRAPHY
Dr. Bhavani Thuraisingham joined The University of Texas at
Dallas (UTD) in October 2004 as a Professor of Computer
Science and Director of the Cyber Security Research Center in
the Erik Jonsson School of Engineering and Computer Science.
She is an elected Fellow of three professional organizations: the
IEEE (Institute for Electrical and Electronics Engineers), the
AAAS (American Association for the Advancement of Science)
and the BCS (British Computer Society) for her work in data
security. She received the IEEE Computer Society’s prestigious
1997 Technical Achievement Award for “outstanding and
innovative contributions to secure data management.” Her
research interests are in Assured information sharing and
trustworthy semantic web; secure geospatial data management;
and Data mining for security applications. Prior to joining UTD,
Thuraisingham worked for the MITRE Corporation for 16 years
which included an IPA (Intergovernmental Personnel Act) at
the National Science Foundation. Her work in information
security and information management has resulted in over 80
journal articles, over 200 refereed conference papers three US
patents. She is the author of eight books in data management,
data mining and data security and have given over 60 keynote
addresses on these topics. Prof. Thuraisingham’s website is
http://www.utdallas.edu/~bxt043000.
Towards Trusted Intelligence Information Sharing
Joseph V. Treglia
Joon S. Park
School of Information Studies
Syracuse University
Syracuse, NY, USA
School of Information Studies
Syracuse University
Syracuse, NY, USA
jvtregli @syr.edu
[email protected]
A BST R A C T
While millions of dollars have been invested in information
technologies to improve intelligence information sharing among
law enforcement agencies at the Federal, Tribal, State and Local
levels, there remains a hesitation to share information between
agencies. This lack of coordination hinders the ability to prevent
and respond to crime and terrorism. Work to date has not
produced solutions nor widely accepted paradigms for
understanding the problem. Therefore, to enhance the current
intelligence information sharing services between government
entities, in this interdisciplinary research, we have identified three
major areas of influence; Technical, Social, and Legal.
Furthermore, we have developed a preliminary model and theory
of intelligence information sharing through a literature review,
experience and interviews with practitioners in the field. This
model and theory should serve as a basic conceptual framework
for further academic work and lead to further investigation and
clarification of the identified factors and the degree of impact they
exert on the system so that actionable solutions can be identified
and implemented.
C ategories and Subject Descriptors
C.2.0 [Computer-Communication Networ ks]: General !
security and protection.
H.3.5 [Information Storage and Retrieval]: Online Information
Services ! data sharing.
K.4.1 [Computers and Society]: Public Policy Issues !
transborder data flow.
K.6.1 [M anagement of Computing and Information Systems]:
Security and Protection ! unauthorized access.
G eneral T erms
Management, Reliability, Security, Theory.
K eywords
Information Sharing, Intelligence Sharing, Security.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
KDD-WS'09, June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-4...$5.00
1. I N T R O D U C T I O N
The term information sharing in law enforcement gained
popularity as a result of the 9/11 Commission Hearings and report
of the United States government's lack of response to information
that was known about planned terrorist attacks on the New York
City World Trade Towers prior to the events. This led to the
enactment of several executive orders by President Bush
mandating agencies implement policies to "share information"
across organizational boundaries (United States, 2007c).
Intelligence information sharing is the transfer of information
obtained that relates to an actual or impending occurrence of a
criminal or terrorist act. It includes suspicious activity reports
regarding incidents or observations which are of a less obvious
nature but which may be supportive or related to criminal or
terrorist related activity.
An incident or activity of suspicious nature or that is outside of
the norm for a particular environment and circumstance could be
considered suspicious activity or intelligence information
depending upon the circumstances. The adjudication as to
whether an event is considered and captured as suspicious activity
or intelligence is subjective and left to the discretion of the officer
involved in the report or observation. Where a person of average
intelligence and familiarity with the normative environment
would believe an act may be a part of the cause or furtherance of
a criminal act, it would qualify as intelligence information. An
example may be helpful here; a person taking pictures of trains in
a train yard may not cause someone familiar with the area to be
concerned. Many people collect, model, and photograph trains.
However, if we add to this scenario; 1) the person is not from the
surrounding area, 2) the person is taking photos of trains and the
facilities, 3) the person becomes agitated when asked about
her/his purpose, 4) the person provides inaccurate information
about where they are from or inconsistent versions of what they
are doing, then perhaps the person is affiliated with a terrorist
group or have criminal intentions. This all may bring the incident
to the level of a reportable incident, or suspicious activity report,
which would become intelligence information to be shared among
enforcement agencies. It is the combination of activity and
circumstance that is the trigger. This will also be referred to as
intelligence information or intelligence. It is yet another problem,
worthy of study, to identify the means by which various agencies
collect and manage this type of information.
While millions of dollars have been invested in information
technologies to improve information sharing capabilities among
all law enforcement agencies, according to the National Security
Agency (NSA) there remains a hesitation to share intelligence
information between agencies (Lieberman, 2007). Information
technologies for the future should provide for ubiquitous and
distributed computing and communication systems that deliver
transparent and high quality service, without disruption, and while
enabling and preserving privacy, security, trust, participation and
cooperation. In this paper we identify barriers affecting effective
and trusted information sharing between federal, tribal, state and
local law enforcement agencies in the United States. Research in
understanding these dynamics will lead to identification of
actionable solutions to law enforcement intelligence information
sharing across the federal, tribal, state and local levels.
2. SU M M A R Y O F R E L A T E D W O R K
Inter-organizational systems studied by management information
systems researchers have primarily focused on the private sector
and do not directly apply to the government sector (Lai &
Mahapatra, 1997). Preliminary work involved an exploration of
conditions for cooperation between emergency management
agencies where perceived information assurance of others and
information sharing standards were more strongly related to
information sharing than were cultural norms, in emergency
contexts (Lee & Rao, 2007). Research on emergency services
"#$%"&#'( &)*&( &#+),-+*.( #,/-"%,0#,&12( 13+)( *1( %&)#"( *4#,+-#15(
information assurance level and technical standards, seemed to
encourage information sharing systems use. There is limited work
available that focuses on interagency information sharing issues
in the law enforcement sector, examples here include studies done
in agencies outside the united states (Glomseth et al., 2007; Jing
& Pengzhu, 2007). Such studies may not generalize to this
environment in many respects. Some studies done in the United
States do involve looking at agencies sharing information at the
same levels (Pardo et al., 2006a). There is initial work on cultural
influences on information sharing behaviors in the public sector
(Luna-Reyes et al., 2007). There is recent work on organizational
capability assessment for information systems development,
which involved criminal justice agencies and this also included
the cultural considerations and system complexity issues
(Cresswell et al., 2007). Further examination of the influence of
perception of technological issues, culture, trust, and legal or
policy issues in the law enforcement and public sector context is
necessary.
3. K E Y I N F L U E N C ES O N T R UST E D
I N F O R M A T I O N SH A R I N G
In order to enhance the current intelligence information sharing
services between government entities, we have identified three
major areas of influence; Technical, Social, and Legal,
(summarized in Figure 1) in our previous work (Treglia and Park,
2009).
In this paper, we further develop the preliminary model and
theory of intelligence information sharing through a literature
review, experience, and interviews with practitioners in the field.
Within each area we identify, individual factors are discussed that
play roles in influencing whether or not intelligence information
is ultimately shared.
Technical
Interoperability
Availability
Control
Intelligence
Information
Sharing
Social
Trust
Shadow Network
Criticality
Legal
Policy Conflict
Governance
F igure 1. K ey Influences on T rusted Information Sharing
3.1 T echnical Influences
3.1.1 Interoperability
The interoperability of information systems and the data elements
captured and used was found to be a problematic issue. Tools
such as XML are widely used in business development of Web
services and for B2B integration and data exchange (Lampathaki
et al., 2008, Fernández-Medina and Yagüe, 2008). Although 82%
of non-federal law enforcement agencies in the United States use
computers for internet access unified standards for information
systems have not been universally accepted by law enforcement
entities (U.S. Dept. of Justice Statistics, 2006). This has lead to
hardware, software and network inconsistencies (Chau et al.,
2002; United States, 2007b). It is also related to the definition of
fields and data descriptions. With more than 19,000 law
enforcement agencies in the United States, each having their own
systems and hierarchy, it is no wonder that there are issues with
compatibility between agencies and systems when you try and
+%..*6%"*&#( %"( -,&#"+%,,#+&( 7893"#*3( %:( ;31&-+#( <&*&-1&-+1( =*>(
Enfo"+#0#,&( <&*&-1&-+12?( @AABCD( E1( *( 0*&&#"( %:( :*+&2( &)#"#( *"#(
many different information systems currently being used by law
enforcement agencies for data management and communication,
such as COPLINK, OneDOJ, N-DEx, ALECS, LInX and others
(Bulman, 2008; Chen et al., 2003; McKay, 2008).
Furthermore, the interoperability in regulations hinders trust
between agencies. For instance, there are no broadly accepted
standards for security clearances and subsequent access across
agencies. The federal government has a lengthy process for
approving access to intelligence information and there is no
provision to readily accept security clearances from other federal
or non-federal agencies (Whitehouse, 2007). A state police
officer with secret clearance in his agency does not carry this
standard or designation with other local or federal agencies.
Security clearances even across federal agencies do not
automatically transfer and must be reevaluated and reassessed by
the individual agency. A justifiable concern is that there are not
universal standards for hiring and background checks across the
/*"-%31(*4#,+-#1D(F)#($"%+#11(31#'(&%(/#"-:G(*($#"1%,51(+"#'-6-.-&G(
in a given agency may not be adequate for a certain level of
secure access at another agency. This is a tremendous obstacle to
sharing information among agencies and feeds into a perception
of mistrust across agencies. This is, however, an area which can
be addressed through legislative changes and changes to internal
agency processes.
3.1.2 Availability
Availability means that the systems must respond in a timely
manner. These systems must have a high degree of survivability
and function in mission critical environments where parts of the
network may be compromised but accurate service must be
continued (Park et al., 2009; Schooley, 2007). For instance,
network availability impacts acceptance and use of systems (Chan
& Teo, 2007; Koroma et al., 2003). Furthermore, information
031&(6#(H#$&(3$(&%('*&#(*,'(-,(*++%"'*,+#(>-&)(&)#(31#"15(-,&#"#1&1(
and needs. For the new systems to integrate with the varieties of
technology and protocols that are used the complexity of
processing and connections are increased and system performance
and reliability become taxed. Systems become more prone to
delays or failures as they must incorporate legacy and other
protocols into their core programming and functions. Increases in
overhead for security also add to the workload and increases the
potential for system delays or failure.
Systems that are
considered slow or non-responsive according to the expectations
of the users will have a hard time being adopted.
3.1.3 Control
Control as perceived by the users is required for information
sharing and systems adoption. Information-sharing systems must
be capable of controlling, monitoring and managing all usage and
dissemination of intelligence information for tracking purposes to
provide assurance, which is required for trust (Li et al., 2008).
There is no broadly accepted set of minimum security and access
control standards and protocols for intelligence information
systems that have been uniformly adopted for use across federal,
tribal, state and local agencies (Cresswell et al., 2007).
Distributed workflow control tasks in these integrated and grid
environments may increase the level of information sharing,
availability, cost effectiveness, but, on the flip side, they also
increase the complexity and control problems (Serra da Cruz et
al., 2008; Park et al., 2001). Therefore, provenance and user
control tasks and capabilities must be suitable to these varied
environs in a trusted information-sharing system.
3.2
Social Influences
3.2.1 Trust
Trust is a key influencer of sharing behavior. Here trust refers to
the degree in which the person with intelligence information trusts
other people in other agencies who may receive information or
have access to it. Trust has been identified as an area of concern
in much of the information systems and management research
(Gao, 2005; Humenn et al., 2004; Jing & Pengzhu, 2007; Koufaris
& Hampton-Sosa, 2004; Lee, 2006; Lee et al., 2008; Xiong &
Liu, 2004; McKnight et al., 2002; Niu, 2007; Razavi & Iverson,
2006; Ruppel et al., 2003; Schoorman et al., 2007; Zhang, 2005;
Rocco, 1998; ISAC White Paper, 2004; Li et al., 2008; Ray,
2004; Chakraborty et al., 2006; Park et al., 2006; Park et al.,
2007).
Trust occurs at the individual and organizational level. It includes
other law enforcement officers and extends to the other staff or
persons who may gain access to information were it made
available to them and assumes that here is a means for sharing this
information (Scott, 2006). Agencies treat information security
differently and there is a reality to corruption in agencies at any
level (Ivkovic & Shelley, 2005). The person responsible to share
intelligence information may have personal knowledge of
individual employees who they do not trust or a general
impression or bias, correct or not, of the security within the
agency in general terms. Personal impression does influence their
decision to share information on the unified system or not. The
person deciding to share the information weighs trust in this way.
Trust may weigh heavily on the decision to provide information
as well (Niu, 2007; Pardo et al., 2006b). In the case of a very
trusting person they are more likely to freely provide information
to the system than someone who is more apprehensive or who has
some specific concerns as above.
3.2.2 Shadow Network
Shadow networks involve the situation where a personal or
agency connection, in or outside of the work place, creates a
conflict of interest and the organization or individual may not act
in a non-biased, objective manner. This may involve personal
friendships, affiliations or family ties and connections through
other activities or interests outside the workplace. This can have
positive and negative effects for organizations (Ingram &
Lifschitz, 2006). Intelligence information that that may negatively
impact an agency or key individuals or associates may be
withheld and not shared by the organization involved. The stigma
or interpersonal links behind the scenes play a role in interaction
and sharing decisions (Kulik et al., 2008). This is related to the
organizational notion of shadow systems, which are described by
<&*+G(7IJJKC(*1(8&)#(+%0$.#L(>#6(%:(-,&#"*+&-%,1(-,(>)-+)(1%+-*.(
covert political and psycho-dynamic systems coexist in tension
>-&)(&)#(.#4-&-0*&#(1G1&#0(7<)*>2(IJJBM(<&*+#G2(IJJKCD?(F)#"#(-1(
an obvious link here to personal integrity and to social impacts of
$%&#,&-*..G( '*0*4-,4( -,:%"0*&-%,( &)*&( )-&1( 8&%%( +.%1#( &%( )%0#D?((
The personal integrity of the individual member with the
information has an influence on whether or not they will share.
Integrity is internal to the individual. Trust is focused outward to
the perception of another agency by the individual. Integrity has
to do with the specific character and makeup of the person with
the information. Influences such as policy, trust and personal
interests, personal connection and corruption affect different
individuals in different ways based on their personal integrity and
interests. A person who demonstrates a high degree of respect for
the rules and regulations of the agency would be considered to
have a high degree of integrity and would be more likely to
follow policy than someone with a record of bending or not
following the rules. Integrity involves a willingness to place the
%"4*,-N*&-%,1("3.#1(*,'(-,&#"#1&1(*6%/#(%,#51(%>,D(
3.2.3 Criticality
Criticality of the information itself and its potential harmful
impact if not disclosed is a key influencer of action in sharing
information. Studies by J. Lee and H.R. Rao have shown that
officers are more likely to share information where there is a clear
and present danger to life or property (Lee & Rao, 2007). The
greater the threat the greater the likelihood that the people
involved will cooperate and share information. Preliminary work
exploring possible causes and effects of inter-agency information-
sharing systems adoption in the counter-terrorism and disaster
management domains involved an exploration of environmental
and situational conditions for cooperation between emergency
management agencies. The perceived information assurance of
others and having information sharing standards were more
strongly related to agencies sharing than were cultural norms, in
emergency contexts. It does support an assertion that during a
crisis, where criticality is described as a factor, people are more
willing to share information regardless of other influences. The
timeliness of the information itself is also related to criticality.
The relationship of time to the consequences or effectiveness of
the information influences whether or not the information is
shared or not. In the case of information obtained too late or after
the fact it may or may not be shared based on what consequence it
may have at that point in time. Information of a questionable
value may be held in waiting so that it can be verified or
supported in some way before sharing. As the time draws near to
where the information may become useless if not shared the
decision to share or not share is reevaluated.
3.3
L egal Influences
3.3.1 Policy Conflict and Competition
Agency policy also has influence on whether information gets
shared or not. In an agency with defined policy as to what is to be
shared it is easier for staff to make the determination to follow
through with information that is clearly within the guidelines.
Clear and enforced rules for information sharing do lead to better
sharing of this information (Carter & United States 2004).
Policies vary and are subject to interpretation.
Furthermore, based on the funding or evaluation policy, agencies
may compete for resources and there is a competitive element to
doing the job better than other agencies that have shared interests
and responsibility. For instance, funding for activities may be
based on how many crimes are solved or specific incidents
handled by a particular agency. An example of this would be
formula grants, which are disseminated based on key reported
activities handled by an agency. Actually, the Department of
Justice alone Distributed $2.396 billion dollars of assistance to
law enforcement and other agencies based on formula and
+%0$#&-&-/#( 4"*,&( "#O3#1&1( *,'( %&)#"( $"%4"*01( 78P#'#"*.(
E11-1&*,+#( :"%0( Q#$*"&0#,&( %:( ;31&-+#2( PR( @AAS2( 1300*"G2?(
n.d.). This leads to competition for important cases and an interest
in being the agency to close a particular case or handle a
particular incident. As long as funding determinations are made in
this manner, competition among agencies will likely continue to
be an influencing factor. Under the present structure many law
enforcement agencies are put in a position of being in competition
for statistics and resources with other agencies because agencies
from the Federal to Local levels each must justify their budgets to
their constituencies and oversight entities. There is an interest in
showing that your agency is the one doing the work, more activity
and responsibility correlating to getting more money and
resources.
3.3.2 Governance
There is not a clear and universal guide to what intelligence
information can and cannot be shared across the federal, tribal,
state and local levels. The laws and policies governing
information security, dissemination and use vary across local,
state tribal and federal agencies. Where agencies do not have
clear guidance on whether or not intelligence information may be
shared they may choose to take the safer path of not sharing to
protect them from liability. For instance, security clearances for
intelligence information sharing and recognition of legitimate
rights to access intelligence information by local, state tribal and
federal agencies remains a process that is not coordinated or
acknowledged across agencies.
The governance regarding law enforcement collaboration must be
reevaluated. Law enforcement agencies in the United States share
overlapping responsibilities and jurisdiction with no one unitary
command; this creates problems over control and authority in
investigations, information sharing and access. They
independently act in the interests of their constituencies as well as
for the broader collective good. Our approach and expectations
for collaboration in this environment must be challenged to be
effective in the future. Law enforcement in the US remains
uniquely decentralized and does not operate under unitary
command or control. Recent case studies on knowledge sharing
within public sector inter-organizational networks confirm
information-sharing difficulties across agencies (Jing & Pengzhu,
2007; Pardo et al., 2006a). Agencies overlap jurisdictions and
responsibility; each with a duty to their own constituencies.
4. F R A M E W O R K A N D
IMPL E M ENT A TION
Based on the key influences on trusted information sharing we
analyzed in Section 3, we introduce our framework with two
types of influences affecting whether or not sharing occurs;
:*+-.-&*&%"1(*,'('#&"*+&%"1D(F)-1(0%'#.(-1(*,(%::1$"-,4(%:(=#>-,51(
force field analysis, which is used here for looking at factors or
forces influencing the decision of an individual or organization to
share intelligence information (Thomas, 1985). Forces act as
facilitators, driving movement toward information sharing, or
detractors drawing momentum away from a choice to share
intelligence information. Each of the factors under the headings
given has a potential for facilitating or detracting from a choice to
share intelligence information in a given context.
F igure 2. Influencing Intelligence Information Sharing
Facilitators include the positive influences that result from
technical, social, and legal issues. Detractors include negative
influences resulting from technical, social and legal rules,
regulations, actions or perceptions (see Figure 2 below).
As facilitators, technical issues such as having compatible
operating systems, software, hardware, data definitions, secure
access, control, high usability, system availability all can work
towards improving the potential for information sharing but do
not cause information to be shared (Lee & Rao, 2007; Scott,
2006). Regarding technology we may picture two young friends
who tie two tin cans together on a string to communicate; it is not
the technology of the cans that cause the two to talk across the
string but their desire to share with each other that controls use of
technology. It is therefore the social and cultural aspects of the
relationship that matter more than the technology in the equation
for information sharing. Today, the two kids from our example
are texting.
Socially, greater trust and knowledge of the other parties involved
lead to greater tendency towards intelligence information sharing.
This involves agency culture and personal ties or connections
with other involved agencies; which include shadow networking
ties outside the workplace to include family and friend or other
associations that involve one member having some other contact
or relationship with someone associated with another agency
(Drake et al., 2004b; Marks & Sun, 2007). A ready example is
family, friends or participation in clubs or activities that involve
others apart from the work environment. These external contacts
can have a positive influence on the likelihood of intelligence
information sharing. Shared training and joint operations such as
the U.S. Marshals joint fugitive round up effort with state and
local agencies in Florida can have a positive effect on information
sharing (Clark, 2008). Importance, as described previously, can
be a very critical factor influencing the sharing of intelligence
information as well. Information that is credible and which may
result in some specific harm or loss is more readily shared and the
pressure to share this information increased where there may be
an approaching deadline (Lee & Rao, 2007).
In the area of legal influence, having a clear and enforced agency
policy regarding intelligence information sharing will lead to the
greater likelihood that information will be shared as will increased
knowledge of laws and regulations which allow for intelligence
information sharing. System governance and participation by
others is also expected to be facilitator of intelligence information
sharing where members and organizations have positive regard
:%"( *,'( *++#$&( #*+)( %&)#"51( "%.#1( 7T"#11>#..(#&(*.D2(@AABM(U*"H(#&(
al., 2001). People within agencies are more likely to participate in
systems that they have choice, investment and control over.
As detractors, intelligence comes from the field or other sources
to an agency and the identified factors may negatively affect the
degree to which this information is likely to be shared. Legal
factors with a negative influence include security clearances,
which are not uniform or recognized across agencies, laws
regarding privacy, secrecy or sharing of information that are
conflicting or not well understood by participants. Social issues
here involve issues of lack of trust, integrity, assurance or an
agency culture, which is geared towards not sharing (Lee & Rao,
2007). Trust is reduced where agencies compete for statistics,
media attention and funding. Informal or outside contacts which
are described as part of the shadow network have great potential
to provide a negative influence if the information may be
potentially damaging to an entity or person. Criticality includes
timing of information and its potential impact so where there is
little urgency the pressure to share this intelligence is reduced.
Where there is no identified time frame or deadline the
information may not be reacted to in a timely manner thus
reducing the pressure to share this information. Lack of
knowledge or inaccurate knowledge about what actions can be
taken regarding sharing of information can hinder information
sharing. Matters of jurisdiction, authority, and governance or
control over the power or influence also work against sharing
(Drake et al., 2004a).
Technical factors act as detractors as well. Many agencies use
different hardware and software programs for communication and
information management and these may not interact together.
Systems that are not responsive or show poor performance may
not be adopted. Agencies with existing systems may not be
financially able to change to more compatible or standardized
systems. The costs for retraining onto new services can be high as
well. Costs for maintenance of the systems must be considered.
These factors serve as the basis and framework for investigation.
Inter-relationships of the identified factors influence the degree to
which information sharing is more or less likely to occur and can
be viewed as the balance of the result.
We are developing a formula and model based on this framework
to describe and to predict resulting conditions regarding
information sharing behaviors based on knowledge of the
influencing factors. Probable effects from modifying the
influencing factors are more readily apparent and easier to
identify using such a model. The further investigation of these
influences will inform the model as to the degree of influence
each may have relative to the other so that decision makers can
pattern solutions towards desired ends.
5. C O N C L USI O NS A N D F U T U R E W O R K
By using the approach developed we can visualize the effects of
making changes in different areas of influence on the level of
information sharing and observe the outcomes. We can see how
adjustments in degree of influence for the influencing factors
identified determine whether or not intelligence information is
shared under the given circumstances. This leads to the theory
that intelligence information sharing between law enforcement
agencies is affected by technical, social, and legal factors which
are comprised of issues of interoperability, availability, control,
trust, shadow networks, criticality, policy conflict, competition,
and governance. We have identified the major areas of influence
and posit that these factors work to facilitate or detract from the
sharing of intelligence information between agencies. This model
and theory should serve as a basic conceptual framework for
further academic work and lead to further investigation and
clarification of the identified factors and the degree of impact they
exert on the system so that actionable solutions can be identified
and implemented.
6. R E F E R E N C ES
[1] Bulman, P. (2008). Communicating Across State and County
Lines: The Piedmont Regional Voice over Internet Protocol
Project. NIJ Journal, 261. Retrieved February 22, 2009, from
http://www.ojp.usdoj.gov/nij/journals/261/piedmontvoip.htm.
[2] Bureau of Justice Statistics Law Enforcement Statistics.
(2007, August 8). . Retrieved May 9, 2008, from
http://www.ojp.usdoj.gov/bjs/lawenf.htm.
[3] Carter, D. L., & United States. (2004). Law Enforcement
Intelligence a Guide for State, Local, and Tribal Law
Enforcement Agencies. Washington, D.C.: U.S. Dept. of
Justice, Office of Community Oriented Policing Services.
[4] Chakraborty, Sudip and Ray, Indrajit (2006). TrustBAC:
Integrating trust relationships into the RBAC model for
access control in open systems. In Proceedings of the 11th
ACM Symposium on Access Control Models and
Technologies, Lake Tahoe, CA, June 2006.
[5] Chan, H. C., & Teo, H. (2007). Evaluating the boundary
conditions of the technology acceptance model: An
exploratory investigation. ACM Trans. Comput.-Hum.
Interact., 14(2), 9. doi: 10.1145/1275511.1275515.
[6] Chau, M., Atababhsh, H., Zeng, D., & Chen, H. (2002).
Building an Infrastructure for Law Enforcement Information
Sharing and Collaboration: Design Issues and Challenges.
National Science Foundation. Retrieved February 4, 2008,
from http://dlist.sir.arizona.edu/473/01/chau4.pdf.
[7] Chen, H., Zeng, D., Atabakhsh, H., Wyzga, W., &
Schroeder, J. (2003). COPLINK: managing law enforcement
data and knowledge. Communications of the ACM, 46(1),
28-34.
[8] Clark, J. (2008, September 18). Remarks by Director John
Clark at the Operation Orange Crush Press Conference.
USMarshals.gov. Government Agency. Retrieved November
27, 2008, from
http://www.usmarshals.gov/news/chron/2008/091808.htm.
[9] Cresswell, A. M., Pardo, T. A., & Hassan, S. (2007).
Assessing capability for justice information sharing. In
Proceedings of the 8th annual international conference on
Digital government research: bridging disciplines & domains
(pp. 122-130). Philadelphia, Pennsylvania: Digital
Government Society of North America. Retrieved September
18, 2008, from
http://portal.acm.org/citation.cfm?id=1248460.1248479&col
l=ACM&dl=ACM&CFID=3249028&CFTOKEN=78511360
[10] Drake, D., Steckler, N. A., & Koch, M. J. (2004).
Information Sharing in and Across Government Agencies:
The Role and Influence of Scientist, Politician, and
Bureaucrat Subcultures. Social Science Computer Review,
22(1), 67-84. doi: 10.1177/0894439303259889.
[11] Federal Assistance from Department of Justice, FY 2008,
summary. (n.d.). . Retrieved February 22, 2009, from
http://www.usaspending.gov/faads/faads.php?datype=T&det
ail=1&database=faads&fiscal_year=2008&maj_agency_cat=15.
[12] Eduardo Fernández-V#'-,*(*,'(V*"-#00*(WD(R*4X#2(8<&*&#(
of stan'*"'1(-,(&)#(-,:%"0*&-%,(1G1&#01(1#+3"-&G(*"#*2?(
Computer Standards & Interfaces 30, no. 6 (August 2008):
339-340, doi:10.1016/j.csi.2008.03.001.
[13] Gao, J. (2005). Information sharing, trust in automation, and
cooperation for multi-operator multi-automation systems.
Retrieved from
http://proquest.umi.com/pqdweb?did=1079666781&Fmt=7&
clientId=3739&RQT=309&VName=PQD.
[14] Glomseth, R., Gottschalk, P., & Solli-Saether, H. (2007).
Occupational culture as determinant of knowledge sharing
and performance in police investigations. International
Journal of the Sociology of Law, 35(2), 96-107. doi:
10.1016/j.ijsl.2007.03.003.
[15] Ingram, P., & Lifschitz, A. (2006). Kinship in the Shadow of
the Corporation: The Interbuilder Network in Clyde River
Shipbuilding, 17111990. American Sociological Review, 71,
334-352.
[16] ISAC White Paper. (2004). Vetting and trust for
communication among ISACs and government entities.
Retrieved October 23, 2008, from
http://www.isaccouncil.org/pub/Vetting_and_Trust_013104.
pdf.
[17] Ivkovic, S. K., & Shelley, T. O. (2005). The Bosnian Police
and Police Integrity: A Continuing Story. European Journal
of Criminology, 2(4), 428-464. doi:
10.1177/1477370805056057.
[18] Jing, F., & Pengzhu, Z. (2007). A case study of G2G
information sharing in the Chinese context. In Proceedings
of the 8th annual international conference on Digital
government research: bridging disciplines & domains (pp.
234-235). Philadelphia, Pennsylvania: Digital Government
Society of North America. Retrieved September 18, 2008,
from
http://portal.acm.org/citation.cfm?id=1248460.1248496&col
l=ACM&dl=ACM&CFID=3249028&CFTOKEN=78511360
[19] Koroma, J., Li, W., & Kazakos, D. (2003). A generalized
model for network survivability. In Proceedings of the 2003
conference on Diversity in computing (pp. 47-51). Atlanta,
Georgia, USA: ACM. doi: 10.1145/948542.948552.
[20] Koufaris, M., & Hampton-Sosa, W. (2004). The
development of initial trust in an online company by new
customers. Inf. Manage., 41(3), 377-397.
[21] Kulik, C. T., Bainbridge, H. T. J., & Cregan, C. (2008).
KNOWN BY THE COMPANY WE KEEP: STIGMA-BYASSOCIATION EFFECTS IN THE WORKPLACE.
Academy of Management Review, 33(1), 216-230. doi:
Article.
[22] Lai, V. S., & Mahapatra, R. K. (1997). Exploring the
research in information technology implementation. Inf.
Manage., 32(4), 187-201.
[23] Lampathaki, F., Mouzakitisa, S., Gionisa, G., Charalabidisa,
RD2(*,'(E1H%3,-1*2(QD(7@AASCD(8931-,#11(&%(631-,#11(
interoperability: A current review of XML data integration
1&*,'*"'12?(T%0$3&#"(<&*,'*"'1(Y(W,&#":*+#12(
doi:10.1016/j.csi.2008.12.006.
[24] Lee, C. (2006). The role of trust in information sharing: A
study of relationships of the interorganizational network of
real property assessors in New York state. Retrieved from
http://proquest.umi.com/pqdweb?did=1288656571&Fmt=7&
clientId=3739&RQT=309&VName=PQD.
[25] Lee, H. (2008). Cyber crime and challenges for crime
investigation in the information era. In Intelligence and
Security Informatics, 2008. ISI 2008. IEEE International
Conference on (pp. xxv-xxvi). doi:
10.1109/ISI.2008.4565011.
[34] Niu, J. (2007). Circles of trust: A comparison of the size and
composition of trust circles in Canada and in China.
Retrieved from
http://proquest.umi.com/pqdweb?did=1276413271&Fmt=7&
clientId=3739&RQT=309&VName=PQD.
[26] Lee, J., & Rao, H. R. (2007). Exploring the causes and
effects of inter-agency information sharing systems adoption
in the anti/counter-terrorism and disaster management
domains. In Proceedings of the 8th annual international
conference on Digital government research: bridging
disciplines \& domains (pp. 155-163). Philadelphia,
Pennsylvania: Digital Government Research Center.
[35] Pardo, Cresswell, Thompson, & Zhang. (2006). Knowledge
sharing in cross-boundary information system development
in the public sector. Information Technology and
Management, 7(4), 293-313. doi: 10.1007/s10799-006-02786.
[27] Li Xiong, & Ling Liu. (2004). PeerTrust: supporting
reputation-based trust for peer-to-peer electronic
communities. Knowledge and Data Engineering, IEEE
Transactions on, 16(7), 843-857. doi:
10.1109/TKDE.2004.1318566.
[28] Li, X., Hess, T. J., & Valacich, J. S. (2008). Why do we trust
new technology? A study of initial trust formation with
organizational information systems. The Journal of Strategic
Information Systems, 17(1), 39-71. doi:
10.1016/j.jsis.2008.01.001.
[29] Lieberman, J. (2007). Confronting the Terrorist Threat to the
Homeland: Six Years After 9/11. 342 DIRKSEN SENATE
OFFICE BUILDING, WASHINGTON, D.C.: Federal News
Service. Retrieved May 8, 2008, from
http://www.fas.org/irp/congress/2007_hr/091007transcript.pd
f.
[30] Luna-Reyes, L. F., Andersen, D. F., Richardson, G. P.,
Pardo, T. A., & Cresswell, A. M. (2007). Emergence of the
governance structure for information integration across
governmental agencies: a system dynamics approach. In
Proceedings of the 8th annual international conference on
Digital government research: bridging disciplines \&
domains (pp. 47-56). Philadelphia, Pennsylvania: Digital
Government Society of North America. Retrieved October
16, 2008, from
http://portal.acm.org/citation.cfm?id=1248460.1248468&col
l=GUIDE&dl=GUIDE&type=series&idx=SERIES10714&p
art=series&WantType=Proceedings&title=AICPS&CFID=6
749952&CFTOKEN=77254060#.
[31] Marks, D. E., & Sun, I. Y. (2007). The impact of 9/11 on
organizational development among state and local law
enforcement agencies. Journal of Contemporary Criminal
Justice, 23(2), 159-173.
[36] Park, J., Chandramohan, P., Suresh, A., & Giordano, J.
(2009). Component survivability for mission-critical
distributed systems. Journal of Automatic and Trusted
Computing (JoATC). In press.
[37] Park, J., An, G., and Chandra D. (2007). Trusted P2P
computing environments with role-based access control
(RBAC). IET (The Institution of Engineering and
Technology, formerly IEE) Information Security, 1(1):27-35.
[38] Park, J., Kang, M., and Froscher, J. (2001). A secure
workflow system for dynamic cooperation. In Michel Dupuy
and Pierre Paradinas, editors, Trusted Information: The New
Decade Challenge , pages167!182. Kluwer Academic
Publishers, 2001. Proceedings of the 16th IFIP TC11
International Conference on Information Security
(IFIP/SEC), Paris, France, June 11-13.
[39] Park, J., Sandhu, R., and Ahn, G. (2001). Role-based access
control on the Web. ACM Transactions on Information and
System Security (TISSEC), 4(1):37!71.
[40] Park, J., Suresh, A., An, G., and Giordano, J. (2006). A
framework of multiple-aspect component-testing for trusted
collaboration in mission-critical systems. In Proceedings of
the IE E E Workshop on Trusted Collaboration (TrustCol),
Atlanta, Georgia, November 17-20. IEEE Computer Society.
[41] Ray, Indrajit and Chakraborty, Sudip (2004). A vector model
of trust for developing trustworthy systems. In P. Samarati et
al. editors, Computer Security- ESORICS, Proceedings of
the 9th European Symposium on Research in Computer
Security, September 13-15, 2004, Sophia Antipolis, France.
LNCS 3193, Springer.
[42] Razavi, M. N., & Iverson, L. (2006). A grounded theory of
information sharing behavior in a personal learning space. In
Proceedings of the 2006 20th anniversary conference on
Computer supported cooperative work (pp. 459-468). Banff,
Alberta, Canada: ACM. doi: 10.1145/1180875.1180946.
[32] Mckay, J. (2008). Statement of John McKay, Former United
States Attorney For the Western District of Washington,
Before the Subcommittee on Intelligence, Information
Sharing And Terrorism Risk Assessment Committee on
Homeland Security United States House of Representatives
(Washington, D.C., 2008), Retrieved October 18, 2008, from
http://webdev.maxwell.syr.edu/insct/Research/IS%20Page/M
cKay%20Testimony.pdf.
[43] Rocco, E. (1998). Trust breaks down in electronic contexts
but can be repaired by some initial face-to-face contact. In
Proceedings of the SIGCHI conference on Human factors in
computing systems (pp. 496-502). Los Angeles, California,
United States: ACM Press/Addison-Wesley Publishing Co.
[33] McKnight, D. H., Choudhury, V., & Kacmar, C. (2002).
Developing and Validating Trust Measures for e-Commerce:
An Integrative Typology. Info. Sys. Research, 13(3), 334359.
[45] Schooley, B. L. (2007). Inter-organizational systems analysis
to improve time-critical public services: The case of mobile
emergency medical services. Retrieved from
http://proquest.umi.com/pqdweb?did=1390309161&Fmt=7&
clientId=3739&RQT=309&VName=PQD.
[44] Ruppel, C., Underwood-Queen, L., & Harrington, S. J.
(2003). e-Commerce: The Roles of Trust, Security, and Type
of e-Commerce Involvement. e-Service Journal, 2(2), 25-45.
[46] Schoorman, F. D., Mayer, R. C., & Davis, J. H. (2007). AN
INTEGRATIVE MODEL OF ORGANIZATIONAL
TRUST: PAST, PRESENT, AND FUTURE. Academy of
Management Review, 32(2), 344-354. doi: Article.
[47] Scott, E. D. (2006). Factors influencing user-level success in
police information sharing: An examination of Florida's
FINDER system. Retrieved from
http://proquest.umi.com/pqdweb?did=1251886251&Fmt=7&
clientId=3739&RQT=309&VName=PQD.
[48] Serra da Cruz, S., Chirigati, F., Dahis, R., Campos, M., &
Mattoso, M. (2008). Using Explicit Control Processes in
Distributed Workflows to Gather Provenance. In Provenance
and Annotation of Data and Processes (pp. 186-199).
Retrieved February 22, 2009, from
http://dx.doi.org/10.1007/978-3-540-89965-5_20.
[49] Shaw, P. (1997). Intervening in the shadow systems of
organizations Consulting from a complexity perspective.
Journal of Organizational Change Management, 10(3), 235.
[50] Stacey, R. (1996). Complexity and Creativity in
Organizations (1st ed., p. 312). Berrett-Koehler Publishers.
[51] Thomas, J. (1985). Force field analysis: A new way to
evaluate your strategy. Long Range Planning, 18(6), 54-59.
doi: 10.1016/0024-6301(85)90064-0.
[52] United States. (2007b). Building the Information Sharing
Environment: Addressing the Challenges of Implementation:
Hearing Before the Subcommittee on Intelligence,
Information Sharing, and Terrorism Risk Assessment of the
Committee on Homeland Security, U.S. House of
Representatives, One Hundred Ninth Congress, Second
Session, May 10, 2006 (p. 27). Washington: U.S. G.P.O.
[53] United States. (2007c). Federal Support for Homeland
Security Information Sharing: Role of the Information
Sharing Program Manager: Hearing Before the
Subcommittee on Intelligence, Information Sharing, and
Terrorism Risk Assessment of the Committee on Homeland
Security, House of Representatives, One Hundred Ninth
Congress, First Session, November 8, 2005 (p. 58).
Washington: U.S. G.P.O. Retrieved from
http://www.gpoaccess.gov/congress/index.html.
[54] U.S. Dept. of Justice, Bureau of Justice Statistics (2006).
Law Enforcement Management and Administrative Statistics
(LEMAS): 2003 Sample Survey of Law Enforcement
Agencies [Computer file]. ICPSR04411-v1. Ann Arbor, MI:
Inter-university Consortium for Political and Social Research
[producer and distributor], 2006.
[55] Treglia, J. and Park J. (2009). Technical, Social & Legal
Barriers to Effective Information Sharing Among Sensitive
Organizations. In Proceedings of iConference , Chapel Hill,
North Carolina, February 8-11. Poster.
[56] Whitehouse. (2007). Report of the Security Clearance
Oversight Group Consistent with Title III of the Intelligence
Reform and Terrorism Prevention Act of 2004. Retrieved
March 16, 2008, from
http://www.whitehouse.gov/omb/pubpress/2007/sc_report_to
_congress.pdf.
Social Networks Integration and Privacy Preservation
using Subgraph Generalization
Christopher C. Yang
Xuning Tang
College of Information Science and Technology
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
College of Information Science and Technology
Drexel University
3141 Chestnut Street
Philadelphia, PA 19104
[email protected]
A BST R A C T
Intelligence and law enforcement force make use of terrorist and
criminal social networks to support their investigations such as
identifying suspects, terrorist or criminal subgroups, and their
communication patterns. Social networks are valuable resources
but it is not easy to obtain information to create a complete
terrorist or criminal social network. Missing information in a
terrorist or criminal social network always diminish the
effectiveness of investigation. An individual agency only has a
partial terrorist or criminal social network due to their limited
information sources. Sharing and integration of social networks
between different agencies increase the effectiveness of social
network analysis. Unfortunately, information sharing is usually
forbidden due to the concern of privacy preservation. In this
paper, we introduce the KNN algorithm for subgraph generation
and a mechanism to integrate the generalized information to
conduct social network analysis. Generalized information such as
lengths of the shortest paths, number of nodes on the boundary,
and the total number of nodes is constructed for each generalized
subgraphs. By utilizing the generalized information shared from
other sources, an estimation of distance between nodes is
developed to compute closeness centrality. Two experiments
have been conducted with random graphs and the Global Salafi
Jihad terrorist social network. The result shows that the proposed
technique improves the accuracy of closeness centrality measures
substantially while protecting the sensitive data.
C ategories and Subject Descriptors
H3.3 [Information Storage and Retrieval]: Information Search and
Retrieval ! information filtering, selection process; H3.5
[Information Storage and Retrieval: Online Information Services
! web-based services; H5.4 [Information Interfaces and
Presentations] Hypertext/Hypermedia - Navigation
G eneral T erms
Algorithms, Management, Performance, Experimentation.
K eywords
Keywords are your own designated keywords.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage
and that copies bear this notice and the full citation on the first page.
To copy otherwise, or republish, to post on servers or to redistribute to
lists, requires prior specific permission and/or a fee.
KDD-WS'09, June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-"#$5.00
1. I N T R O D U C T I O N
Social network analysis techniques have been widely used in
intelligence and security informatics to support intelligence and
law enforcement force to identify suspects, gateways, and
extracting communication patterns of terrorist or criminal
organizations. In our previous work [20] [22], we have shown
how social network analysis and visualization techniques are
useful in knowledge discovery of terrorist social network.
However, using partial data of a terrorist or criminal social
network, the social network analysis techniques are not able to
extract the essential knowledge. In some cases, an inaccurate
result can be obtained. For example, each law enforcement unit
has its own criminal social network. Mining on an incomplete
criminal social network may not be able to identify the bridge
between two criminal subgroups. Unfortunately, limited by the
privacy policy, different organizations are only allowed to share
the insensitive information but not their social networks. As a
result, an accurate social network analysis cannot be conducted
unless an integration of the social networks owned by different
organizations can be made.
Assuming organization P ( O P) and organization Q ( O Q ) own the
social networks G P and G Q respectively as shown in Figure 1,
without integrating G P and G Q , O P will never discover the
connection between the subgroup of D , E and F and the subgroup
of G and H. Similarly, O Q will never discover the connection
between the subgroup of A, B and C and the subgroup of G and H .
After integrating G P and G Q to obtain G , both O P and O Q will
identify the connections.
G p:
G Q:
A
G
C
D
B
E
H
G:
C
B
A
D F
E
G
C
B
A
D
H
E
G
F
H
F
F igure 1. Illustrations of social networ ks integration
The research problem is defined as follow. Given two or more
social networks ( G 1, G 2$%#&%'()*%+,''-(-./%)(01.,21/,).3%4 O 1, O 2,
#&$% /5-% )67-8/,9-% ,3% sharing the necessary information between
these social networks to achieve a more accurate social network
analysis and mining result and preserving the sensitive
information at the same time. Each organization O i has a piece of
social network, which is part of the whole picture ! a social
network G constructed by integrating all G i. Conducting the
social network analysis task on G , one can obtain the exact result.
However, conducting the social network analysis task on any G i,
one can never achieve the exact SNAM result because of the
missing information. By integrating G i and some generalized
information of G j, O i should be able to achieve a more accurate
social network analysis and mining result.
In this paper, we propose the algorithms for social network data
sharing and integration. The proposed information sharing and
integration of social networks have three major components, (i)
constructing generalized subgraph, (ii) creating generalized
information for sharing, and (iii) social networks integration and
analysis.
1.1 Related Wor k
Thuraisingham [15][16] defined assured information sharing as
enforcing security and integrity policies during information
sharing between organizations so that the data is integrated and
mined to extract nuggets. Members or partners in a coalition
conduct data sharing in a dynamic environment where they can
join and leave the coalition in accordance with the policies and
procedures [15]. Baird et al. [2][3] first discussed several aspects
of coalition data sharing in the Markle report. Thuraisingham [15]
has further discussed these aspects including confidentiality,
privacy, trust, integrity, dissemination and others. In this work,
we focus on social network data sharing and integration.
Sensitive information should be protected while insensitive
information can be shared. Sensitive information can also be used
to generate generalized information so that privacy can be
protected.
In the recent years, a number of approaches for preserving privacy
of relational data have been developed. The application is mainly
data publishing so that sensitive personal information can be
protected while organizations are releasing their data such as
medical, census, customer transactions, and voter registrations.
These techniques include k-anonymity [11],[13], l-diversity [8], tcloseness [6], m-invariance [19], and !-presence [9]. A naïve
approach of preserving privacy is removing attributes that
uniquely identifying a person such as names and identification
numbers. However, trivial linking attack by using the innocuous
sets of attributes known as quasi-identifiers across multiple
databases can easily identify a person. k-anonymity [11],[13] is
the first attempt of privacy preservation of relational data by
ensuring at least k records with respect to every set of quasiidentifier attributes are indistinguishable. However, k-anonymity
fails when there is a lack of diversity or other background
knowledge. l-diversity [8] ensures that there are at least l wellrepresented values of the attributes for every set of quasi-identifier
attributes. The weakness is that one can still estimate the
probability of a particular sensitive value. m-invariance [19]
ensures that each set of quasi-identifier attributes has at least m
tuples, each with a unique set of sensitive values. There is at most
1/m confidence in determining the sensitive values. Other
enhanced techniques of k-anonymity and l-diversity with
personalization, such as personalized anonymity [18] 1.+% 4:$k)anonymity [17], allows users to specify the degree of privacy
;()/-8/,).%)(%3;-8,'<%1%/5(-35)=+%:%).%/5-%(-=1/,9-%'(->?-.8<%)'%/5-%
sensitive data. In this work, we focus on the privacy preservation
of social network data rather than relational data. The technique
in privacy preservation of relational data is not directly applicable
in social network data because the data representations are not the
same.
There is relatively less research work on privacy preservation of
social network data (or graphs). A naïve approach is removing
the identities of all nodes but only revealing the edges of a social
network. In this case, the global network properties are preserved
for other research applications assuming that the identities of
nodes are not of interest in the research applications. This
assumption is true in the objective of publishing social network
data for studying the global network structure but it is not
necessarily true in the case of social network analysis which is
discussed in this paper. However, Backstorm et al. [1] proved that
it is possible to discover whether edges between specific targeted
pairs of nodes exist or not by active or passive attacks. Based on
the uniqueness of small random subgraphs embedded in a social
network, one can infer the identities of nodes by solving a set of
restricted isomorphism problems.
In order to tackle active and passive attacks and preserve the
privacy of node identities in a social network, there are several
anonymization models proposed in the recent literature: kcandidate anonymity [6], k-degree anonymity [8], and kanonymity [24]. Such anonymization models are proposed to
increase the difficulty of being attacked based on the notion of kanonymity in tabular data. k-candidate anonymity [6] defines
that there are at least k candidates in a graph G that satisfies a
given query Q . k-degree anonymity [8] defines that, for every
node v in a graph G , there are at least k-1 other nodes in G that
have the same degree as v. k-anonymity [24] has the strictest
constraint. It defines that, for every node v in a graph G , there are
at least k-1 other nodes in G such that their anonymized
neighborhoods are isomorphic. The technique to achieve the
above anonymities is edge or node perturbation [6],[8],[24]. By
adding and/or deleting edges and/or nodes, a perturbed graph is
generated to satisfy the anonymity requirement. Adversaries can
only have a confident of 1/k to discover the identity of a node by
neighborhood attacks.
Since the current research on privacy preservation of social
network data focuses on preserving the node identities in data
publishing, the anonymized social network can only be used to
study the global network properties but may not be applicable to
other social network analysis tasks which is the focus of this
work. In addition, the sets of nodes and edges in a perturbed
social network are different from the set of nodes and edges in the
original social network. As reported by Zhou and Pei [24], the
number of edges added can be as high as 6% of the original
number of edges in a social network. A recent study [23] has
investigated how edge and node perturbation can change certain
network properties. Such distortion may cause significant errors
in certain SNAM tasks such as centrality measurement although
the global properties can be maintained.
2. A F ramewor k for Integration Social
Networ ks with Privacy Preservation
Assuming organization P ( O P) and organization Q ( O Q ) have
social networks G P and G Q respectively, O P needs to conduct a
social network analysis and mining (SNAM) task but G P is only a
partial social network for the SNAM task.
If there is not any
privacy concern, one can integrate G P and G Q to generate an
integrated G and obtain a better SNAM result. Due to privacy
concern, O Q cannot release G Q to O P but only shares some data to
O P according to the agreed privacy policy. At the same time, O P
does not need all data from O Q but only those that are critical for
the SNAM task. The objectives are maximizing the information
sharing that is needed for the SNAM task but conforming to the
privacy policy so that sensitive information can be preserved and
more accurate SNAM result can be obtained.
The information we share can be sensitive or insensitive and
useful or not useful for a SNAM task to be conducted by the
information requesting party. When we integrate social networks,
we need to maximize the information sharing that is insensitive
and useful for the SNAM tasks. The shared information should
not include any sensitive information; however, it must be useful
for improving the performance of the SNAM task conducted by
the information requesting party.
Organization P ( O P)
Information
Needs N P
Organization Q ( O Q )
Trust Degree
of P
Social
Network G Q
Privacy
Policy
Social
Network G Q
Social
Network G P
Social
Networks
Integration
Sharing Data
! Generalized
Social
Network of
G Q ( G Q ")
Social Network
Analysis &
Mining Task
F igure 2. F ramewor k of Social Networ ks Integration
We develop the privacy policy by considering the information
needs based on the SNAM task, the trust degree of the
information requesting party, and the information available in its
own social network. The privacy policy determines what data can
be shared. Thuraisingham [15],[16] discussed a coalition of
dynamic relational data sharing, in which security and integrity
policies are enforced. When we perform social network data
sharing, we need to consider what kinds of nodes and edges are
needed to accomplish a particular SNAM task by analyzing the
network structure. The trust degree of the information requesting
party will determine certain sensitive data to be protected and
certain insensitive data to be shared. At the same time, the
information available for sharing may not be able to obtain the
exact SNAM result but our objective is maximizing the accuracy
so that a better result can be obtained in comparing with the
SNAM result obtained from the original network without sharing.
Using subgraph generalization, a generalized social network, G Q ",
will be created from G Q and conforming to the privacy policy.
The generalized social graph only contains generalized
information of G Q without releasing any sensitive information.
For instances, the generalized information can be the maximum or
minimum length of the shortest paths between two subgroups, the
degree of an insensitive node, the radius of a subgroup, etc. The
generalized social network G Q " will then be integrated with G P to
support a social network analysis and mining task. Given the
generalized information from G Q , it is expected to achieve better
social network analysis and mining than conducting the analysis
and mining on G P alone.
2.1 Subgraph G eneralization
Given the insensitive data in G Q , we propose a subgraph
generalization approach to create a generalized social network
G Q " for sharing with O P. A subgraph generalization creates a
generalized version of a social network, in which a connected
subgraph is transformed as a generalized node and only
generalized information will be presented in the generalized node.
The edge that links from other nodes in the network to any nodes
of the subgraph will be connected to the generalized node. The
generalized social network protects all sensitive information while
releasing the crucial and non-sensitive information to the
information requesting party for social network integration and
the intended SNAM task. A mechanism is needed to (i) identify
the subgraphs for generalization, (ii) determine the connectivity
between the set of generalized nodes in the generalized social
network, and (iii) construct the generalized information to be
shared.
A subgraph of G = (V, E ) is denoted as G i = (V i, E i) where V i V,
E i E , E i V i ! V i. G i is a connected subgraph if there is a path
for each pair of nodes in G i. We only consider connected
subgraph when we conduct subgraph generalization. The
generated subgraphs should be mutually exclusive and exhaustive.
A node v can only be part of a subgraph but not any other
subgraphs. The union of nodes from all subgraphs should be
equal to V, the original set of nodes in G .
To construct a subgraph for generalization, we propose the Knearest neighbor ( KNN) method. Given a social network, G = (V,
E ) with n nodes, |V| = n. K of these nodes are insensitive nodes.
We divide G into K (or less) subgraphs G i = (V i, E i) where each
subgraph has at least one insensitive node. Each subgraph will
also be known as a generalized node in the generalized graph #".
V = Ui =1to K V i. vi C corresponds to the center of a sub-graph G i and
vi C must also be an insensitive node in G i. Let SP D (v, vi C ) be the
distance of the shortest path between v and vi C . When v is
assigned to the subgraph G i in subgraph generation, SP D (v, vi C )
must be shorter than or equal to SP D (v, vj C ) where j = 1, 2, .., K
and j $%&. An edge exists between two generalized nodes G i and G j
in the generalized graph #"%if and only if there is an edge between
any two nodes in G such that one from each generalized node, G i
and G j. #" has a set of generalized nodes, G 1, G 2$%#$% G K , and a set
of edges, '".
For simplicity, we use the graphs in Figure 3 to illustrate the
subgraph generation by KNN method. G has eight nodes
including v1 and v2. If we take v1 and v2 as the insensitive nodes
and we are going to create 2 subgraphs by 2NN method, all other
nodes will be assigned to one of the two subgraphs depending on
their shortest distances with v1 and v2. Two subgraphs G 1 and G 2
are generated as illustrated in Figure 3. This illustration is made
only for simplicity. A real social network will have significantly
more number of nodes and edges.
G
V1
V2
V1
V2
F igure 4. Illustrations of a generalized graph
G1
V1
G2
2.2 G eneralized Subgraph Information
V2
F igure 3. Illustrations of generating subgraphs
The KNN subgraph generation algorithm is presented below:
!"#$%&'(
)%()(*(+,!"-(,#"(!"#(,./(0'(
1234#()($"%(
((((567(#892($%& &&'((
((((((567(#892(((%(&("6()(
((
:5;*+,;$%-&&$("&<(%%(!"#$<'(((
(((((((((((((( ((((((((((('((%('((=($%'(
((((((((((((((((((((((((((((('%('(&($%'(
(((((((>?@(567'(
((((>?@(567'(
!"#$=='(
>?@(1234#(
567(#892(;$(-$%<( (.((
((((:5;(*/012345;$(<(%%(*/012345;$%<(<(
((((AA(*/012345;$(<(3B("2#(BCDE78$2(BC92("28"($(( &*/012345;$(<(
(((((((67&%(*/012345;$(<(
(((((((.7(%(.7(=(;$(-$%<((
((((>F!>(
/7#8"#(8?(#@E#(D#"G##?(*/012345;$(<(8?@(*/012345;$%<((8?@(
'((")*"*+",-((((((
>?@(567(
The KNN subgraph generation algorithm creates K subgraphs G 1,
G 2$%#$% G K from G . Each subgraph, G i, has a set of nodes, V i, and
a set of edges, E i. Edges between subgraphs, '", are also created.
A generalized graph, #", is constructed where each generalized
node corresponds to a subgraph G i and labeled by the insensitive
node, vi C . Using the example in Figure 4, the generalized graph is
presented in Figure 4. When an organization is sharing its social
network with another organization, the generalized graph is
shared but not all information within each generalized node is
shared but only generalized subgraph information is shared so that
sensitive information is preserved. At the same time, generalized
information will support integration and social network analysis.
In the next section, we describe the generalized subgraph
information for sharing.
Given a generalized node G i, which is generated by KNN
subgraph generation, and its center (insensitive node) vi C , we
construct generalized subgraph information for the generalized
node which should not reveals sensitive information including
sensitive identities and sensitive relationships. The generalized
information, however, should provide useful information for
social network analysis after integration. Since the social network
analysis we are investigating is closeness centrality, the length of
the shortest paths in the generalized node is useful for estimating
distances between nodes within the same generalized node or
across different generalized nodes.
In this paper, we propose to create generalized subgraph
information such as the longest and shortest length of the shortest
paths in a subgraph as well as the number of nodes in a subgraph
and the number of nodes adjacent to another subgraph.
Let vp and vq be any two nodes in G i and the length of the shortest
path between vp and vq be SP D (vp,vq,G i). By considering all the
shortest
paths
between
any
two
nodes
in
G i,
SP D (vp,vq,G i) vp,vq V i, we compute the longest length of the
shortest paths between any two nodes in G i, denoted as
L_SP D (G i), and the shortest length of the shortest paths between
any two nodes in G i, denoted as S_SP D (G i).
L_SP D (G i) = {SP D (vm,vn,G i)|
SP D (vp,vq,G i)}
vp,vq
V i, SP D (vm,vn,G i) (
and
S_SP D (G i) = { SP D (vm,vn,G i) |
SP D (vp,vq,G i)}
vp,vq V i , SP D (vm,vn,G i) )
The length of any shortest paths in G i ( ) must be smaller or equal
to L_SP D (G i) and larger or equal to S_SP D (G i).
S_SP D (G i*%)%+%)%L_SP D (G i)
We can also compute the probability of the length of the shortest
path between any two nodes in G i, denoted as Prob(SP D (G i) = +*,
and ,%)%-./012- D (G i*3+*%)%4%
Similary, let the length of the shortest path between vp and the
center of G i, vi C , be SP D (vp,vi C ,G i). By considering all the shortest
paths between any node in G i and vi C , we compute the longest
length and the shortest length of the shortest paths between vp and
vi C , denoted as L_SP D (vi C ,G i) and L_SP D (vi C ,G i).
L_SP D (vi C ,G i) = { SP D (vm,vi C ,G i)| vp
SP D (vp,vi C ,G i)}
V i , , SP D (vm,vi C ,G i) (
and
S_SP D (vi C ,G i) = { SP D (vm,vi C ,G i)| vp
SP D (vp,vi C ,G i)}
V i , SP D(vm,vi C ,G i) )
The probability of the length of the shortest path between any
node and vi C , denoted as Prob(SP D (vi C ,G i) = ), can also be
computed. S_SP D (vi C , G i) )
)% L_SP D (vi C , G i) and 0 )%
Prob(SP D (vi C , G i*3%+* )%1,
further away. Let the closest insensitive node to vi in G P be ,
and the second closest insensitive node to vi in G P be
. We set
the weights 5A and 56" as
and
Gi
Gj
F igure 5 E xample of Num(G i) and Num(G i ,G j)
We denote Num(G i) as the number of nodes in G i and Num(G i ,G j)
as the number of nodes in G i that are adjacent to another subgraph
G j. For example, Figure 5 illustrates two subgraphs G i and G j and
Num(G i) is 7. Num(G i ,G j) is 3.
In summary, the generalized subgraph information for sharing
includes:
L_SP D (G i)
S_SP D (G i)
Prob(SP D ( G i*3+*
L_SP D (vi C ,G i)
S_SP D (vi C ,G i)
Prob(SP D (vi C ,G i) = )
Num(G i)
Num(G i ,G j)
This generalized information does not publish any identities of
sensitive nodes but only the insensitive nodes, vi C . It also
provides the number of nodes in a subgraph and the number of
nodes that are adjacent to other subgraphs. It does not publish any
information about edges between any two sensitive nodes or
between any sensitive and insensitive nodes. It only provides the
information about the length of the shortest paths based on the
edges in a subgraph.
2.3 G eneralized G raph Integration and Social
Networ k A nalysis
Given the generalized graph G Q" and the generalized subgraph
information of G Q " from O Q , O P wants to integrate G Q " with its
own graph G P to conduct more accurate closeness centrality. In
order to achieve such purpose, we need to make estimation of the
distance between any two nodes vi and vj in G P by integrating the
generalized subgraph information of G Q ".
To estimate the distance between two nodes vi and vj in G P, we
identify the two closest insensitive nodes for vi and vj in G P and
use the generalized information from G Q " to estimate the distances
between the insensitive nodes. The shortest path between vi and vj
may go through the subgraph of their closest insensitive node, the
subgraph of the second closest insensitive, or the subgraph of
other less closer insensitive node. However, the chance of going
through the further insensitive node is lower, and therefore, we set
a higher weight on the insensitive node that is closer. These
insensitive nodes are also the centers of subgraphs in the
generalized graph G Q " shared by O Q . In this paper, we only
consider the two closest insensitive nodes but the formulation can
be easily modified to consider other insensitive nodes that are
such that
node is higher.
and the weight for the closest insensitive
Similarly, let the closest insensitive node to vj in G P be , and the
second closest insensitive node to vj in G P be
, we set the
weights 5B and 57" as
as
such that
and
.
In G Q ,
, and
are the centers of generalized
subgraphs G A,
, G B, and
, respectively. We estimate the
distance between vi and vj, d(vi,vj), by integrating the estimated
distances of the four possible paths going through these
insensitive nodes by a linear combination with weights equal to
.
is the estimated distance between vi and vj for the path
going through
and
, where a can be A or A! and b can be B
or B!.
where G k is a generalized node on the shortest path between
and
and going through
and
in a generalized
graph
is estimated by
,
, and
if a
b and
is estimated by
if a = b. D (G a ,vi) and
D (G b,vj) correspond to the portion of the estimated distance
within G a and G b respectively while
is the portion of the
estimated distance in each subgraph G k that the shortest path
between vi and vj is going through in the generalized graph G Q !.
is computed by
and the percentage of nodes in
G a that is adjacent to the subgraph that is immediately following
G a in the shortest path between vi and vj in the generalized
subgraph G Q ! if vi is not the same as . If vi is the same as ,
is computed by the probabilities,
.
is computed by the probabilities,
.
Computation of
is done similarly.
generalized graph from G Q and conduct social network analysis
by integrating G P and the generalized graph of G Q.
3.1.2 Evaluation
where
is the percentage of nodes in G a as a gatekeeper
which is adjacent to G k and G k is the subgraph that is
subgraph immediately following G a in the shortest path
between vi and vj in the generalized graph G Q !H
D (vi ,vj) corresponds to the estimated distance between vi and vj
when both vi and vj are nodes of the same subgraph.
By using the estimated distance between any two nodes vi and vj,
with the shared information from G Q !, we compute the
closeness centrality as follow:
closeness centrality(vi )
n 1
n
d (vi , v j )
j 1,i j
where n is the total number of nodes in G P
3. E xperiment
We have conducted two experiments to evaluate the effectiveness
of our proposed techniques. The objective is conducting social
network analysis based on the incomplete information of one
organization ( G p) and the shared information from another
organization ( G Q ).
3.1 E xperiment 1 ! Random G raphs
In this experiment, we assume G Q has the complete information
and evaluate how integrating the generalized information from G Q
with G P can help us to conduct social network analysis as close as
if we have the complete information.
3.1.1 Datasets
We generated a series of random graphs as our datasets in this
experiment. As discussed in [19, 21], terrorist and criminal social
network usually consist of some clusters, which represent
potential gangs. In other words, terrorist or criminal social
network is not a graph with nodes randomly connected to one
another with the same probability. To simulate the real-world
data, we constructed random graphs by generating a random
number of clusters and connecting these clusters randomly. Each
cluster is generated by creating an edge randomly between any
two nodes with a probability of p. To control the size of clusters,
we set the upper limit for each cluster size as 40% of the random
graph size.
The generated random graphs are considered as a social network
with complete information ( G Q ). G P will then be generated by
randomly removing edges in G Q since G P only has partial
information of the social network. In the experiment, we create
In this experiment, the closeness centrality is utilized as the social
network analysis. The closeness centrality computed from G Q is
taken as the benchmark since it has the complete information. The
closeness centrality computed from G P is considered as the worst
case since it only has the partial information. The closeness
centrality computed by our proposed subgraph generation and
social network integration techniques will be compared with the
benchmark. It will also be compared with the worst case to
determine how much it can reduce the errors.
The error is
measured by the difference of the closeness centrality obtained
from our proposed technique or G P divided by the closeness
centrality obtained from G Q .
In the experiments, we tested the impact of two random graph
generation parameters on the performance of the proposed
techniques. The parameters are Size (size of G Q in terms of
number of nodes) and Similarity (similarity between G Q and G P).
The cluster size (the upper limit of cluster size) is set as 40% of
Size. The average closeness centrality and average error reported
in the experiment is computed from ten sets of random graphs.
3.1.3 Experimental Results
Table 1 and Figure 6 present the average closeness centrality
computed from the estimation based on the proposed subgraph
generalization and social network integration techniques ( E ), G Q
(benchmark), and G P (worst case) for different Size. Table 2 and
Figure 7 present the average error of E and G P for different Size.
Similarity is set as 50% when we are evaluating the impact of
Size.
T able 1 A verage C loseness C entrality of E , G Q and G P with
different Size
Size
200
100
50
20
10
E
0.365
0.353
0.276
0.31
0.349
GQ
0.402
0.39
0.32
0.315
0.379
GP
0.315
0.286
0.207
0.248
0.33
*Similarity =50%,
IHMJ
IHM
IHLJ
IHL
IHKJ
IHK
IH&J
!3N#%KII !3N#%&II
>
!3N#%JI
OP
!3N#%KI
O$
!3N#%&I
F igure 6 A verage C loseness C entrality of E , G Q and G P with
different Size
It is found that the error increases when Size reduces from 200 to
50 and then decreases when Size reduces from 50 to 10 for both E
and G P. However, the error in E is consistently and substantially
lower than G P. The largest difference in error, which is 0.213, is
observed when Size is 50. When Size is only 10, the difference in
error is 0.061.
T able 2 A verage E r ror in C loseness C entrality of E and G P
with different Size
Size
200
100
50
20
10
E
0.086
0.096
0.135
0.08
0.07
GP
0.216
0.267
0.356
0.209
0.131
T able 3 A verage C loseness C entrality of E , G Q and G P with
different Similarity
Similarity
40%
50%
60%
70%
80%
90%
E
0.29
0.276
0.278
0.287
0.287
0.292
GQ
0.343
0.32
0.327
0.334
0.339
0.33
GP
0.203
0.207
0.245
0.275
0.296
0.318
* Size =50
IHM
IHLJ
*Similarity =50%
IHL
IHM
IHLJ
IHL
IHKJ
IHK
IH&J
IH&
IHIJ
I
IHKJ
IHK
IH&J
MIQ
JIQ
RIQ
>
SIQ
TIQ
OP
UIQ
O$
F igure 8 A verage C loseness C entrality of E , G Q and G P with
different Similarity
KII
&II
JI
>
KI
O$
&I
F igure 7 A verage E r ror in C loseness C entrality of E and G P
with different Size
Table 3 and Figure 8 present the average closeness centrality of E ,
G Q , and G P for different Similarity. Table 4 and Figure 9 present
the average error of E and G P for different Similarity. Size is set
as 50 when we evaluate the impact of Similarity.
It is found that the error decreases consistently from 0.426 to
0.044 for G P when Similarity increases from 40% to 90%. Since
G P becomes more similar to G Q when Similarity increases, the
error should reduce to 0 when Similarity is 100%. However, the
error stays at around 0.15 for E when Similarity increases from
40% to 90%. The error does not vary considerably when
Similarity changes. That means the proposed technique of
integrating generalized information produce a reasonable
performance with an error of 0.163 even when the similarity is as
low as 40%. However, the performance does not improve much
when Similarity increases. When Similarity increases, the
proposed technique is still using the generalized information
rather than the actual data, the performance is indeed worse than
using G P to measure the closeness centrality. However, it only
happens when Similarity is over 80% which is very unlikely in
reality.
T able 4 A verage E r ror in C loseness C entrality of E and G P
with different Similarity
Similarity
40%
50%
60%
70%
80%
90%
E
0.163
0.135
0.148
0.144
0.152
0.159
GP
0.426
0.356
0.249
0.183
0.129
0.044
* Size=50
IHMJ
IHM
IHLJ
IHL
IHKJ
IHK
IH&J
IH&
IHIJ
I
MIQ
JIQ
RIQ
>
SIQ TIQ
O$
UIQ
F igure 9 A verage E r ror in C loseness C entrality of E and G P
with different Similarity
3.2 E xperiment 2 ! G lobal Salafi Jihad
T er rorist Social Networ k
3.2.1 Dataset
In this experiment, we used the Global Salafi Jihad terrorist social
network [11],[22] as our data source to create incomplete social
networks G P and G Q . The Global Salafi Jihad terrorist social
network has 366 nodes and 1,275 links. There are four major
clusters, Central Staff of al Qaeda (CSQ), Core Arab (CA),
Southeast Asia (SA), and Maghreb Arab (MA). These clusters are
connected in the Global Salafi Jihad terrorist social network. To
simulate the real-world situation that every agency is more
informative in one or two terrorist clusters but less familiar with
other terrorist clusters, we randomly removed nodes from each
cluster with different percentage and then randomly removed
edges from the remaining subgraph. First, we generated G P by
randomly removing 30%, 30%, 70%, and 70% of nodes from
CSQ, CA, SA, and MA respectively. Similarly, we generated G Q
by randomly removing 70%, 70%, 30%, and 30% of nodes from
CSQ, CA, SA, and MA respectively. An edge with both of its end
nodes removed was also be removed. We further removed K % of
edges from G P and G Q . Ten pairs of G P and G Q were generated
for each K .
3.2.2 Evaluation
In this experiment, we tested the impact of K on the performance
of the proposed techniques. We compared the average closeness
centrality and average error obtained from the proposed
techniques ( E ), with those obtained from the complete Global
Salafi Jihad terrorist social network ( G ), and those obtained only
from the G P (the worst case).
3.2.3 Experimental Result
Table 5 and Figure 10 present the average closeness centrality of
E , G , and G P for different K . Table 6 and Figure 11 present the
average error in closeness centrality of E and G P for different K .
It is found that the error for E decreases from 0.302 to 0.206 when
K decreases from 50 to 30. The error for G P decreases from 0.378
to 0.274 when K decreases from 50 to 30. The average error of E
is consistently lower than that of G P by a substantial amount.
When K = 50, 40, and 30, the error of E is reduced from the error
of G P by 20%, 23%, and 25% respectively. It shows that more
accurate closeness centrality can be obtained by integrating the
generalized information from G Q with G P.
T able 5 A verage C loseness C entrality of E , G and G P with
different K
K
50
40
30
E
0.168
0.183
0.191
G
0.241
0.241
0.241
GP
0.149
0.166
0.175
IHKR
IHKM
IHKK
IHK
IH&T
IH&R
IH&M
IH&K
IH&
.%JI
.%MI
>
.%LI
O
O$
F igure 10 A verage C loseness C entrality of E , G and G P with
different Similarity
T able 6 A verage E r ror in C loseness centrality of E and G P
with different K
K
50
40
30
E
0.302
0.240
0.206
GP
0.378
0.312
0.274
IHM
IHLJ
IHL
IHKJ
IHK
IH&J
.%JI
>
.%MI
O$
.%LI
F igure 11 A verage E r ror in C loseness C entrality of E and G P
with different Similarity
4. Conclusion
Social network analysis is very useful in investigating the terrorist
and criminal communication patterns and the structure of their
organizations. However, most law enforcement force and
intelligence agents only have a small piece of information without
integration with the information from other agents. Due to
privacy issues, information sharing is not always possible. In this
paper, we propose to construct generalized graphs before sharing
the social network with other parties. The generalized graph will
then be integrated with the social network owned by the agent
himself to conduct social network analysis such as closeness
centrality. Our experiment shows that our proposed technique
improves the closeness centrality measurements substantially. In
our future work, we shall investigate other subgraph generation
techniques and integration techniques and conduct experiments
with terrorist social networks.
5. R E F E R E N C ES
[1]
L. Backstrom, C. Dwork, and J. Kleinberg, "Wherefore Art
Thou R3579X? Anonymized Social Networks, Hidden
Patterns, and Structural Steganography," in WWW'07 Banff,
Alberta, Canada, 2007.
[2] Z. Baird, J. Barksdale, and M. Vatis, Creating a Trusted
Network for Homeland Security, Markle Foundation, 2003.
[3] Z. Baird and J. Barksdale, Mobilizing Information to Prevent
Terrorism: Accelerating Development of a Trusted
Information Sharing Environment, Markle Foundation, 2006.
[4] K. Caruson, S. A. Macmanus, M. Khoen, and T. A. Watson,
@A)*-=1.+%B-8?(,/<%C(-;1(-+.-33D%E5-%F-6,(/5%)'%
F-0,).1=,3*$G% Publius, 35(1), 2005, pp.143-189.
[5] R. FH%I(,-+*1..%1.+%JH%KH%L1..).$%@A)*-=1.+%B-8?(,/<%1.+%
Community Policing: Competing or Complementing Public
B1'-/<%C)=,8,-3$G%Journal of Homeland Security and
E mergency Management, 4(4), 2005, pp.1-20.
[6] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava,
"Anonymizing Social Networks," Technical Report 07-19,
University of Massachusetts, Amherst 2007.
[7] N. Li and T. Li, "t-closeness: Privacy Beyond k-anonymity
and l-diversity," in IC D E '07, 2007.
[8] K. Liu and E. Terzi, "Towards Identity Anonymization on
Graphs," in ACM SIGMO D '08 Vancouver, BC, Canada:
ACM Press, 2008.
[9] Machanavajjhala, J. Gehrke, D. Kifer, and M.
M-.N,/13?6(1*1.,1*$%@=-Diversity: Privacy Beyond kO.).*,/<$G%C()8--+,.03%)'%/5-%PP.+%Q./-(.1/,).1=%
Conference on Data Engineering, 2006.
[10] M. E. Nergiz, M. Atzori, and C. Clifton, "Hiding the
Presence of Individuals from Shared Database," in
SIGMO D '07, 2007.
[11] M. Sageman, Understanding Terror Networks, University of
Pennsylvania Press, 2004.
[12] CH%B1*1(1/,$%@C()/-8/,.0%F-3;).+-./3R%Q+-./,/,-3%,.%
Microdata R-=-13-$G%QSSS%E(1.318/,).3%).%T.)U=-+0-%1.+%
Data Engineering, 2001.
[13] VH%BU--.-<$%@N-Anonymity: A Model for Protecting
C(,918<$G%Q./-(.1/,).1=%K)?(.1=%).%W.8-(/1,./<%I?22,.-33%
Knowledge-based Systems, 10(5), 2002, pp.557-570.
[14] XH%E5185-($%@E5-%V)81=%F)=-%,.%A)*-=1.+%B-8?(,/<$G% Law &
Society, 39(3), 2005, pp.557-570.
[15] YH%E5?(1,3,.051*$%@B-8?(,/<%Q33?-3%')(%I-+-(1/-+%X1/1613-3%
B<3/-*3$G%L)*;?/-(3%1.+%B-8?(,/<$%Z)(/5%A)==1.+$%
December, 1994.
[16] B. Thuraisingham, Chapter 1. Assured Information Sharing:
Technologies: Challenges and Directions, Intelligence and
Security Informatics: Applications and Technique, Editors:
H. Chen and C. C. Yang, Springer-Verlag, to appear in 2008.
[17] FH%LH%J).0$%KH%V,$%OH%I?$%1.+%TH%J1.0$%@4:$N&-Anonymity:
An Enhanced k-Anonymity Model for Privacy-Preserving
X1/1%C?6=,35,.0$G%C()8--+,.03%)'%BQ[TXX$%O?0?3/%P\-23,
Philadelphia, Pennsylvania, US, 2006.
[18] ]H%],1)%1.+%^H%E1)$%@C-(3).1=,2-+%C(,918<%C(-3-(91/,).$G%
Proceedings of SIGMOD, June 27-29, Chicago, Illinois,
2006.
[19]
X. Xiao and Y. Tao, "m-invariance: Towards Privacy
Preserving Re-publication of Dynamic Datasets," in ACM
SIGMO D '07: ACM Press, 2007.
[20] LH%LH%^1.0$%ZH%V,?$%1.+%_H%B10-*1.$%@O.1=<2,.0%/5-%
E-(()(,3/%B)8,1=%Z-/U)(N3%U,/5%M,3?1=,21/,).%E))=3$G%
Proceedings of the IEEE International Conference on
Intelligence and Security Informatics, San Diego, CA, US,
May 23 ! 24, 2006.
[21] LH%LH%^1.0$%@Q.')(*1/,).%B51(,.0%1.+%C(,918<%C()/-8/,).%)'%
E-(()(,3/%)(%L(,*,.1=%B)8,1=%Z-/U)(N3$G% Proceeding of IE E E
International Conference on Intelligence and Security
Informatics, Taipei, Taiwan, 2008.
[22] LH%LH%^1.0%1.+%_H%B10-*1.$%@O.1=<3,3%)'%E-(()(,3/%B)8,1=%
Z-/U)(N3%U,/5%I(18/1=%M,-U3$G%Journal of Information
Science, accepted for publication.
[23] X. Ying and X. Wu, "Randomizing Social Networks: A
Spectrum Preserving Approach," in SIAM International
Conference on Data Mining (S DM'08) Atlanta, GA, 2008
[24] B. Zhou and J. Pei, "Preserving Privacy in Social Networks
against Neighborhood Attacks," in IE E E International
Conference on Data Engineering, 2008.
DESIGN OF A TEMPORAL GEOSOCIAL SEMANTIC WEB
FOR MILITARY STABILIZATION AND RECONSTRUCTION
OPERATIONS
BHAVANI
THURAISINGHAM
bhavani.thuraisingha
[email protected]
LATIFUR KHAN
[email protected]
ABSTRACT
The United States and its Allied Forces have had tremendous
success in combat operations. This includes combat in Germany,
Japan and more recently in Iraq and Afghanistan. However not
all of our stabilization and reconstruction operations (SARO)
have been as successful. Recently several studies have been
carried out on SARO by National Defense University as well as
for the Army Science and Technology. One of the major
conclusions is that we need to plan for SARO while we are
planning for combat. That is, we cannot start planning for SARO
after the enemy regime has fallen. In addition, the studies have
shown that security, power and jobs are key ingredients for
success during SARO. It is important to give positions to some of
the power players from the fallen regime provided they are
trustworthy. It is critical that investments are made to stimulate
the local economies. The studies have also analyzed the various
technologies that are needed for successfully carrying out SARO
which includes sensors, robotics and information management. In
this project we will focus on the information management
component for SARO. As stated in the work by the Naval
Postgraduate School, we need to determine the social, political
and economic relationships between the local communities as
well as determine who the important people are. This work has
also identified the 5Ws (Who, When, What, Where and Why)
and the (H).
To address the key technical challenges for SARO, we are
defining a Life cycle for SARO and subsequently developing a
Temporal Geosocial Service Oriented Architecture System (TGSSOA) that utilizes Temporal Geosocial Semantic Web (TGS-SW)
technologies for managing this lifecycle. We are developing
techniques for representing temporal geosocial information and
relationships, integrating such information and relationships,
querying such information and relationships and finally
reasoning about such information and relationships so that the
commander can answer questions related to the 5Ws and H. To
our knowledge we believe that this is the first attempt to develop
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD-WS’09, June 28, 2009, Paris, France.
Copyright 2009 ACM 1-59593-439-1…$5.00.
MURAT
KANTARCIOGLU
[email protected]
VAIBHAV
KHADILKAR
vvk072000@utdallas.
edu
TGS-SW technologies as well as lifecycle management for
SARO.
Categories and Subject Descriptors
H.2.0 [Information Systems]: General – Security, integrity, and
protection.
General Terms
Security.
Keywords
SARO, SAROL
1. INTRODUCTION
According to a GAO Report published in October 2007 [13],
“DOD has taken several positive steps to improve its ability to
conduct stability operations but faces challenges in developing
capabilities and measures of effectiveness, integrating the
contributions of non-DOD agencies into military contingency
plans, and incorporating lessons learned from past operations
into future plans. These challenges, if not addressed, may hinder
DOD’s ability to fully coordinate and integrate stabilization and
reconstruction activities with other agencies or to develop the
full range of capabilities those operations may require.” Around
the same time, the Center for Technology and National Security
Policy at NDU [4] and the Naval Postgraduate School [16]
identified some key technologies crucial for the military
stabilization and reconstruction processes in Iraq and
Afghanistan. These technologies include those in electronics,
sensors, and medical as well as in information management.
As illustrated in Figure1-1 (duplicated from [16]), NPS has
identified three types of reconstruction efforts; one they classify
as the easy activities including building bridges and schools. The
second they identity has sensitive is to develop policies for
governance. The third they identify as hard is to understand the
cultures of the people and the warlords engage in negotiation as
well as get their buy in as well as to sustain long-term security.
In addition, the military needs to get information about the
location of the fiefdoms of the warlords, the security breaches
e.g., IEDs) at the various locations as well as associations
between the different groups.
We are addressing the difficult challenges of military
stabilization (as stated in Figure 1-1) by developing innovative
Department of Computer Science, Erik Jonnson School of Engineering & Computer Science, The University of Texas at Dallas,
800 W. Campbell Road; MS EC31, Richardson, TX 75080 U.S.A.
information management technologies. In particular, we are
EASY
Build schools
SENSITIVE
Counter narcotics
Support Elections
DIFFICULT
Build wells
Build clinics
Support DDR
Influence warlords
Mitigate conflict
Distribute medical supplies
Build agriculture
systems
Support ANA
and
ANP
TE
DA
P
U
Improve
Human
Rights
Implement Job
programs
Foster sustainable economy
Promote stable democracy
Building Self -Sufficiency
‘ Enduring Security’
Improve
Governance
‘Rule of Law’
‘Reconstruction and Development
’
Figure 1-1. Stabilization and Reconstruction Operations
Duplicated from (Guttieri, 2007a)
designing and develop a temporal geosocial semantic web that
will integrate heterogeneous information sources, identify the
relationships between the different groups of people that evolve
over time and location and facilitate sharing of the different types
of information to support SARO. Our goal is to get the right
information to the decision maker so that he/she can make
decisions in the midst of uncertain and unanticipated situations.
The organization of this paper is as follows. Some of the unique
challenges will be discussed in section 2. Supporting
technologies we are utilizing are discussed in section 3. Our
approach to designing and developing the system for SARO is
discussed in section 4. The paper is concluded in section 5.
2. UNIQUE
CHALLENGES
SUCCESSFUL SARO
FOR
A
The technical challenges, our capabilities related to the
challenges and our strategy to solve the problem.
2.1 Ingredients for a Successful SARO
Recently several studies on SARO are emerging. Notable among
them is the study carried out by NDU [4]. In this study the
authors give examples of several successful and failed SAROs.
The authors state that the Iraq war was hugely successful in that
US and Allied forces were able to defeat the enemy in record
time. However the US and the Allied forces were not prepared
for SARO and subsequently National Building. For SARO to be
successful its planning should be carried out concurrently with
the planning of the war. This means as soon as the war ends,
plans are already in place to carry out SARO. Sometimes the
latter part of the war may be carried out in conjunction with
SARO.
The authors have discussed the various SAROs that US has
engaged in including in Germany, Japan, Somalia and the
Balkans and describe the successes of SAROs like Germany and
Japan and the failure of SARO in Somalia. The authors also
discuss why Field Marshall Montgomery’s mission for regime
change in Egypt in 1956 failed. This was because the then Prime
Minister Anthony Eden did not plan for what happens after the
regime change. As a result the operation was a failure. However
overthrowing communism in Malaya in the 1950s was a huge
success by Field Marshall Sir Gerald Templer because he got the
buy in of the locals – the Malayans, Chinese and Indians. He also
gave them the impression that Britain was not intending to stay
long in Malaya. As a result the locals gave him the support and
together they were able to overthrow communism.
Based on the above examples, the authors state that four
concurrent tasks have to carry out in parallel. They are the
following: (i) Security: Ensure that those who attempt to destroy
the emergency of a new society are suppressed. This will include
identifying that are the trouble makers or terrorists and destroy
their capabilities. (ii) Law and order: Military and police skills
are combined to ensure that there are no malicious efforts to
disturb peace. (iii) Repair infrastructure: Utilize the expertise of
engineers and geographers both from allied countries and local
people and build the infrastructure. (iv) Establish an interim
government effectively: Understand the cultures of the local
people, their religious beliefs and their political connections and
establish a government.
The authors also state that there are key elements to success and
they are (i) security (ii) power and (iii) jobs. We have already
explained the importance of security. Power has to be given to
key people. For example, those who were in powerful positions
with the fallen regime may make alliances with the terrorists.
Therefore such people have to be carefully studied and given
power if appropriate. Usually after a regime change people are
left homeless and without jobs. Therefore it is important to give
incentives for foreign nations to invest in the local country and
create jobs. This means forming partnerships with the locals as
well as with foreign investors.
To end this section we will make a quote from General Anthony
Zinny, USMC (Ret). former Commander, US Central Command,
“What I need to understand is how these societies function. What
makes them tick? Who makes the decisions? What is it about
their society that is so remarkably different in their values and
the way they think in my western white man mentality”
Essentially he states what is crucial is getting cultural
intelligence. In fact our project will propose solutions to precisely
capture the cultural and political relationships and information
about the locals, model these relationships and exploit these
relationships for SARO.
2.2 Unique Technology Challenges for SARO
A technology analysis study carried out by Army Science and
Technology states that many of the technologies for SARO are
the same as those for combat operations [5]. However the study
also states that there are several unique challenges for SARO and
elaborates as follows: “The nature of S&R operations also
demands a wider focus for ISR. In addition to a continuing need
for enemy information (e.g., early detection of an insurgency),
S&R operations require a broad range of essential information
that is not emphasized in combat operations. A broader, more
specific set of information needs must be specified in the
Commander’s Critical Information Requirements (CCIR)
portfolio. CCIR about areas to be stabilized include the nature
of pre-conflict policing and crime, influential social networks,
religious groups, and political affiliations, financial and
economic systems as well as key institutions and how they work
are other critical elements of information. Mapping these
systems for specific urban areas will be tedious though much
data is already resident in disparate databases.” The challenges
involved in computational modeling for culturally infused social
networks for SARO in [28].
similarity, relationship similarity and content similarity.
Matching is accomplished through determining the weighted
similarity between the name similarity and content similarity
along with their associated instances. Name similarity between
A and B will exploit name match with the help of Wordnet, and
Jaro-Winkler string metric. Relationship similarity between A
and B takes into account the number of equivalent spatial
relationships, along with their sibling similarity and parent
similarity.
3. SUPPORTING TECHNOLOGIES FOR
SARO
Content similarity We describe two algorithms we have
developed for content similarity. The first is an extension of the
ideas presented in [8] regarding the use of N-grams extracted
from the values of the compared attributes. Despite the utility of
this algorithm, there are situations when this approach produces
deceiving results. To resolve these difficulties, we will present a
second instance matching algorithm, called K-means with
Normalized Google Distance (KM-NGD), which determines the
semantic similarity between the values of the compared
attributes by leveraging K-medoid clustering and a measure
known as the Normalized Google Distance.
We are integrating several of our existing technologies and
building new technologies for SARO. We will describe the
supporting (i.e., existing) technologies we are using in this
section and the new technologies in the next.
3.1 Geospatial semantic web, Police Blotter,
Knowledge discovery
We have developed a geospatial semantic web and information
integration and mining. In particular, we have developed
techniques for (i) Matching or aligning ontologies and (ii) Police
blotter prototype demonstrated at GEOINT and (iii)
Development of GRDF and Geospatial Ontologies for Emergency
Response.
Matching or Aligning Ontologies: In open and evolving
systems various parties continue to adopt differing ontologies
[20], with the result that instead of reducing heterogeneity, the
problem becomes compounded. Matching, or aligning ontologies
is a key interoperability enabler for the semantic web. In this
case, ontologies are taken as input and correspondences between
the semantically related entities of these ontologies are
determined as output, as well as any transformations required for
alignment. It is helpful to have ontologies that we need to match
to refer to the same upper ontology or to conform to the same
reference ontology [12, 25]. Significant number of works have
already been done in database perspective [9, 27, 29] and in
machine learning perspective [10, 11, 19, 29]. Our approach falls
into the latter perspective. However, it focuses not only one
traditional data (i.e., text), it also go beyond text (i.e., geo-spatial
data). The complex nature of this geo-spatial data poses further
challenges and additional clues in matching process. Given a
set of data sources, S1, S2, … each of which is represented by
data model ontologies O1, O2, … the goal is to find similar
concepts between two ontologies (namely, O1 and O2) by
examining their respective structural properties and instances, if
available. For the purposes of this problem, O1 and O2 may
belong to different domains drawn from any existing knowledge
domain. Additionally, these ontologies may vary in breadth,
depth, and through the types of relationships between their
constituent concepts.
The challenges involved in the alignment of these ontologies,
assuming that they have already been constructed, include the
proper assignment of instances to concepts in their identifying
ontology. Quantifying the similarity between a concept from the
ontology, O1 (called concept A) and a concept from the O2
ontology (called concept B) involves computing measures taking
into account 3 separate types of concept similarity: name
Content Similarity Using N-grams Instance matching between
two concepts involves measuring the similarity between the
instance values across all pairs of compared attributes. This is
accomplished by extracting instance values from the compared
attributes, subsequently extracting a characteristic set of N-grams
from these instances, and finally comparing the respective Ngrams for each attribute. In the following, we will use the term
“value type” to refer to a unique value of an attribute involved in
the comparison. We extract distinct 2-grams from the instances
and consider each unique 2-gram extracted as a value type. As an
example, for the string "Locust Grove Dr." that might appear
under an attribute named Street for a given concept, some Ngrams that would be extracted are 'Lo', 'oc', 'cu', 'st', 't ', 'ov','Dr'
and 'r.'
N-gram similarity is based on a comparison between the
concepts of entropy and conditional entropy known as
Entropy Based Distribution (EBD):
EBD =
H (C | T )
H (C )
(1)
In this equation, C and T are random variables where C indicates
the union of the attribute types C1 and C2 involved in the
comparison (C indicates "column", which we will use
synonymously with the term “attribute”) and T indicates the
value type, which in this case is a distinct N-gram. EBD is a
normalized value with a range from 0 to 1. Our experiments
involve 1:1 comparisons between attributes of compared
concepts, so the value of C would simply be C1 U C2. H (C )
represents the entropy of a group of value types for a particular
column (or attribute) while H (C |T ) indicates the conditional
entropy of a group of identical value types.
Intuitively, an attribute contains high entropy if it is impure; that
is, the ratios of value types making up the attribute values are
similar to one another. On the other hand, low entropy in an
attribute exists when one value type exists at a much higher ratio
than any other type. Conditional entropy is similar to entropy in
Figures 2-1a and 2-1b. Distribution of
different value types when EBD is high (2a)
and low (2b). H(C) is similar to H(C|T)
in 2a but dissimilar in 2b
the sense that ratios of value types are being compared. However,
the difference is that we are computing the ratio between
identical value types extracted from different attributes. Figures
2-1a and 2-1b provide examples to help visualize the concept. In
both examples, crosses indicate value types originating from C1,
while squares indicate value types originating from C2. The
collection of a given value type is represented as a cluster (larger
circle). In figure 2a, the total number of crosses is 10 and the
total number of squares is 11, which implies that entropy is very
high. The conditional entropy is also high, since the ratios of
crosses to squares within two of the clusters are equal within one
and nearly equal within the other. Thus, the ratio of conditional
entropy to entropy will be very close to 1, since the ratio of
crosses to squares is nearly the same from an overall value type
perspective and from an individual value type perspective. Figure
2b portrays a different situation: while the entropy is 1.0, the
ratio of crosses to squares within each individual cluster varies
considerably. One cluster features all crosses and no squares,
while another cluster features a 3:1 ratio of squares to crosses.
The EBD value for this example consequently will be lower than
the EBD for the first example because H (C |T ) will be a lower
value.
Problems of N-Gram Approach For Instance Matching
Despite the utility of the aforementioned method, it is susceptible
to misleading results. For example, if an attribute named 'City'
associated with a concept from O1 is compared against an
attribute named 'ctyName' associated with a concept from O2, the
attribute values for both concepts might consist of city names
from different parts of the world. 'City' might contain the names
of North American cities, all of which use English and other
Western languages as their basis language, while 'ctyName',
might describe East Asian cities, all of which use languages that
are fundamentally different from English or any Western
language. According to human intuition, it is obvious that the
comparison occurs between two semantically similar attributes.
However, because of the tendency for languages to emphasize
certain sounds and letters over others, the extracted sets of 2grams from each attribute would very likely be quite different
from one another. For example, some values of 'City' might be
"Dallas", "Houston" and "Halifax", while values of 'ctyName'
might be "Shanghai", "Beijing" and "Tokyo". Based on these
values alone, there is virtually no overlap of N-grams. Because
most of the 2-grams belong specifically to one attribute or the
other, the calculated EBD value would be low. This would most
likely be a problem every time global data needed to be
compared for similarity.
Using Clustering and Semantic Distance for Content
Similarity To overcome the problems of the N-gram, we have
developed method that is free from the syntactic requirements of
N-grams and uses the keywords in the data in order to extract
relevant semantic differences between compared attributes. This
method, known as KM-NGD, extracts distinct keywords from the
compared attributes and places them into distinct semantic
clusters via the K-medoid algorithm, where the distance metric
between each pair of distinct data points in a given cluster (a
data point is represented as an occurrence of one of the distinct
keywords) is known as the Normalized Google Distance (NGD).
The EBD is then calculated by comparing the words contained in
each cluster, where a cluster is considered a distinct value type.
Normalized Google Distance Before describing the process in
detail, NGD must be formally defined:
NGD(x,y ) =
max {log f (x ), log f ( y )} − log f (x ,y )
log M − min{log f (x ),log f ( y )}
(2)
In this formula, f (x ) is the number of Google hits for search
term x, f ( y ) is the number of Google hits for search term y,
f (x ,y ) is the number of Google hits for the tuple of search terms
xy, and M is the number of web pages indexed by Google.
NGD(x, y ) is a measure for the symmetric conditional
probability of co-occurrence of x and y. In other words, given that
term x appears on a web page, NGD(x, y ) will yield a value
indicating the probability that term y also appears on that same
web page. Conversely, given that term y appears on a web page,
NGD(x, y ) will yield a value indicating the probability that term
x also appears on that page. Once the keyword list for a given
attribute comparison has been created, all related keywords are
grouped into distinct semantic clusters. From here, we calculate
the conditional entropy of each cluster by using the number of
occurrences of each keyword in the cluster, which is
subsequently used in the final EBD calculation between the two
attributes. The clustering algorithm used is the K-Medoid
algorithm.
Geospatial Police Blotter We have developed a toolkit that (a)
facilitates integration of various police blotters into a unified
representation and (b) semantic search with various levels of
abstraction along with spatial and temporal view. Software has
been developed for integration of various police blotters along
with semantic search capability. We are particularly interested in
Policy Blotter Crime Analysis. Police Blotter is the daily written
record of events (as arrests) in a police station which is released
by every police station periodically. These records are available
publicly on the web which provides us wealth of information for
analyzing the crime patterns across multiple jurisdictions. The
Blotters may come in different data formats like structured, semistructured (HTML), and un-structured (NL text). In addition,
many environmental criminology techniques assume that data are
locally maintained and the dataset is homogeneous as well as
certain. This assumption is not realistic as data is often managed
by different jurisdictions and therefore, the analyst may have to
spend unusually large amount of time to link related events
across different jurisdictions (e.g., the sniper shootings across
Washington DC, Virginia and Maryland in October 2002).
There are major challenges that a police officer would face when
he wants to analyze different police blotters to study a pattern
(e.g., a spatial-temporal activity pattern) or trail of events. There
is no way a police officer can pose a query where query will be
handled by considering more than one distributed police blotters
on the fly. There is not a cohesive tool for the police officer to
view the blotters from different counties, interact and visualize
the trail of crimes and generate analysis reports. The Blotters can
currently searched only by keyword through current tools and
does not allow conceptual search, and fails to identify spatial –
temporal patterns and connect various dots/pieces. Therefore, we
need a tool that will integrate distributed multiple police blotters,
extract semantic information from a police blotter and provide
seamless framework for queries with multiple granularities.
individual crime form or aggregated (# of crimes) form. In
temporal view, results are shown either in weekly, bi-weekly,
monthly, quarterly or yearly basis. Clicking on a crime location
in spatial view will show the details of the crime along with URL
of original source.
To identify correlation of various occurred crime in multiple
jurisdictions, SQL query is submitted to the database. After the
fetching relevant tuples, subsequent correlation analysis is
performed in Main Memory (MM).
We have developed a toolkit that
•
facilitates integration of various police blotters into a
unified representation and
•
semantic search with various levels of abstraction along
with spatial and temporal view.
With regard to integration, structured, semi-structured and
unstructured data are mapped into relational tables and stored in
Oracle 10g2. For information extraction from unstructured text
we used lingpipe (http://www.alias-i.com/lingpipe/ ). During
extraction and mapping process, we exploited Crime Event
ontology similar to NBIRS Group A and Group B offences.
During extraction process, we extracted crime type, offender sex,
and offender race (if available). This ontology is multi-level
with depth 4.
Figure 2-3. Demonstration of Correlation of crimes across
multiple jurisdictions using Google Maps API. By clicking on
the path (blue line), all relevant crime records similar to each
other are shown in popup windows.
Correlation analysis is accomplished by calculating the pair wise
similarity of these tuples and constructing a directed graph from
the results. Nodes in the graph represent tuples and edges
represent similarity values between tuples. If similarity of two
tuples falls below a certain threshold,
we remove its
corresponding edge from the graph. Finally, a set of paths in the
graph demonstrate correlation of crimes across multiple
jurisdictions. By clicking on the path, all relevant tuples similar
to each other are shown in a popup window (see Figure 2-3). For
implementation, we have developed our application as an applet
in portlet. To show address in map, we have exploited Yahoo's
geocoder that converts latitude/longitude from street address. To
display map, OpenMap and Google Map API both are used. We
used the following data sets in our demonstration and will
examine this data set for our SBIR project.
Figure 2-2. Basic and Advanced query Interface to perform
semantic search using Google Maps API.
Details of a crime are shown in a popup window.
We have facilitated basic and advanced query to perform
semantic search (Figure 2-2). Basic query allows query by crime
type from ontology along with Date Filter. Advanced query
extends basic query facility by augmenting address field (block,
street, city). Ontology allows users to search at various levels of
abstraction. Results are sown in two forms of view: spatial and
temporal. Furthermore, in each view results are shown either in
Police Blotters for Dallas County available online
http://www.dallasnews.com/sharedcontent/dws/news/city/collin
/blotter/vitindex.html.
GRDF and Geospatial Ontologies for Emergency Response:
We have developed a system called DAGIS which utilizes GRDF
(our version of geospatial RDF [1] and associated geospatial
ontologies. DAGIS is a SOA-based system for geospatial data
and we use a SPARQL-like query language to query the data
sources. Furthermore, we have also implemented access control
for DAGIS. We are currently investigating how some of the
temporal concepts we have developed for Policeblotter can be
incorporated into GRDF and DAGIS [30]. We have also
presented this research to OGC (Open Geospatial Consortium)
[31]. In addition to this effort, we have also developed a
Geospatial Emergency Response system to detect chemical spills
using commercial products such as ArcGIS [7]. We will build on
our extensive experience on geospatial technology as well as our
systems DAGIS/GRDF, Policeblotter, Ontology Matching tools,
as well as the Geospatial emergency response system to develop
our SARO prototype for the military.
3.2 Social Networking for fighting against
bioterrorism
Our own model for analyzing various bioterrorism
scenarios is a hybridization of social interactions on a household
scale, situation intervention, and the simplicity of the SIR
approach [24]. The system arose out of a need for a deterministic
model that can balance a desire for accuracy in representing a
potential scenario with computational resources and time.
Recent work has suggested that more detailed models of social
networks have a diminished role over the results in the spread of
an epidemic [24]. We believe we can generalize complex
interactions into a much more concise simulation without
adversely affecting accuracy. The ultimate goal of our research is
to integrate a model for biological warfare with a system that can
evaluate multiple attacks with respect to passive and active
defenses. As a result, we have created a simulation that serves
as an approximation of the impact of a biological attack with
speed in mind, allowing us to explore a large search space in a
relatively shorter amount of time as compared to existing
detailed models.
The base component of our model is the home unit. A home can
range in size from a single individual to a large household.
Within this unit, the probable states of the individuals are
tracked via a single vector of susceptibility, infection, and
recovery. Given a population distribution of a region and basic
statistical data, we can easily create a series of family units that
represent the basic social components from a rural community to
a major metropolitan area. A single home unit with no
interaction is essentially a basic representation of the SIR model.
Interaction occurs within what we call social network theaters. A
theater is essentially any gathering area at which two or more
members of a home unit meet. The probability of interaction
depends on the type of location and the social interaction
possible at it. To capture this, separate infection rates are
assignable to each theater. In the event of a life-threatening
scenario such as a bioterrorist attack, we assume a civil authority
will act at some point to prevent a full-scale epidemic. We
model such an entity by providing means in our models to affect
social theaters and the probabilities associated with state
transitions. For simplicity at this point, we do not consider
resource constraints, nor do we model how an event is detected.
The recognition of an attack will be simulated using a variable
delay. After this delay has passed, the infection is officially
recognized.
The most basic form of prevention is by inoculating the
population against an expected contagion. Several options exist
at this level, ranging from key personnel to entire cities. Anyone
inoculated is automatically considered recovered. Second, a
quarantine strategy can be used to isolate the infected population
from the susceptible population. This requires the explicit
removal of individuals from home units to appropriate facilities,
and can be simulated on a fractional basis representing
probability of removal with varying levels of accuracy. Third, the
infection and recovery rates can be altered, through such means
as allocating more resources to medical personnel and educating
the general public on means to avoid infection. Finally, a
potentially controversial but interesting option is the isolation of
communities by temporarily eliminating social gathering areas.
For example, public schools could be closed, or martial law
could be declared. The motivating factor is finding ways to force
the population at risk to remain at home. Such methods could
reduce the number of vectors over which an infection could
spread.
Based on the above model, we simulated various bioterrorism
scenarios. A powerful factor that we saw in several of our results
in the epidemiology models was the small world phenomenon.
The small world effect is when the average number of hops on
even the largest of networks tends to be very small. In several
cases, the infection spread to enough individuals within 4 days to
pose a serious threat that could not be easily contained. The
results from closing social theatres made this particularly clear,
as many closings beyond the third day did little to slow the
advance of many epidemics. However, not all intervention
methods are available in every country. It is important to
understand how local governmental powers, traditions and ethics
can impact the options available in a simulation. In some
countries, a government may be able to force citizens to be
vaccinated, while others may have no power at all and must rely
on the desire for protection to motivate action. In other
situations, closing any social theatre may be an explicit power of
the state, in contrast to governing entities that may have varying
amount of abilities to do the same but will not consider it due to
a severe social backlash. The impact on society must be
carefully considered beyond economical cost in any course of
action, and there is rarely a clear choice. These answers are
outside the scope of our work, and are better suited to political
and philosophical viewpoints. However, our model helps
governing bodies consider these efforts carefully in light of
public safety and the expenditure of available resources. A major
goal of our research is to provide a means to further our
understanding of how to provide a higher level of security against
malicious activities. This work is a culmination of years of
research into the application of social sciences to computer
science in the realm of modeling and simulation. With detailed
demographic data and knowledge of an impending biological
attack, this model provides the means to both anticipate the
impact on a population and potentially prevent a serious
epidemic. An emphasis on cost-benefit analysis of the results
could potentially save both lives and resources that can be
invested in further refining security for a vulnerable population.
3.3 Assured Information Sharing, Incentives,
Risks:
Our current research is focusing extensively on incentive based
information sharing which are major components of the SARO
system. In particular, we are working on building mechanisms to
give incentives to individuals/organizations for information
sharing. Once such mechanisms are built, we can use concepts
from the theory of contracts [22] to determine appropriate
rewards such as ranking or, in the case of certain foreign
partners, monetary benefits. Currently, we are exploring how to
leverage secure distributed audit logs to rank individual
organizations between trustworthy partners. To handle situations
where it is not possible to carry out auditing, we are developing
game theoretic strategies for extracting information from the
partners. The impact of behavioral approaches to sharing are also
currently considered. Finally we are conducting studies based on
economic theories and integrate relevant results into incentivized
assured information sharing as well as collaboration.
Auditing System. One motivation for sharing information and
behaving truthfully is the liability imposed on the responsible
partners if the appropriate information is not shared when needed
[32]. For example, an agency may be more willing to share
information if it is held liable. We are devising an auditing
system that securely logs all the queries and responses exchanged
between agencies. For example, Agency B’s query and the
summary of the result given by Agency A (e.g., number of
documents, document ids, their hash values) could be digitally
signed and stored by both agencies. Also we may create logs for
subscriber-based information services to ensure that correct and
relevant information are pushed to intended users. Our
mechanism needs to be distributed, secure and efficient. Such an
audit system could be used as a basis for creating information
sharing incentives. First, using such a distributed audit system, it
may be possible to find out whether an agency is truthful or not
by conducting an audit using the audit logs and the documents
stored by the agency. Also, an agency may publish accurate
statistics about each user’s history using the audit system. For
example, Agency B could publish the number of queries it sent to
Agency A that resulted in positive responses, the quantity of
documents that are transmitted, and how useful those documents
were according to scoring metrics. The audit logs and aggregate
statistics could be used to set proper incentives for information
sharing. For example, agencies that are deemed to provide useful
information could be rewarded. At the same time, agencies that
do not provide useful information or withhold information could
be punished. An issue in audit systems is to determine how the
parties involved evaluate the signals produced by the audit. For
example, in public auditing systems, simplicity and transparency
are required for the audit to have necessary political support [2].
Since the required transparency could be provided mainly among
trusted partners, such an audit framework is suitable for
trustworthy coalition partners. Currently, we are exploring the
effect of various alternative parameter choices for our audit
system on the Nash equilibrium [26] in our information sharing
game.
Other research challenges in incentivizing information sharing
are also currently being explored. First, in some cases, due to
some legal restrictions or security purposes, an agency may not
be able to satisfactorily answer the required query. This implies
that our audit mechanisms, rating systems and incentive set up
should consider the existing security policies. Second, we need to
ensure that subjective statistics such as ratings should not be
used for playing with the incentive system. That is, we need to
ensure that partners do not have incentives to falsify rating
information. For example, to get better rankings, agencies may
try to collude and provide fake ratings. To detect such a situation
and discourage collusions, we are working on various social
analysis techniques. In addition, we will develop tools to analyze
the underlying distributed audit logs securely by leveraging our
previous work on accountable secure multi-party computation
protocols [18].
Behavioral Aspects of Assured Incentivized Information
Sharing. A risk in modeling complex issues of information
sharing in the real world to formal analysis is making unrealistic
assumptions. By drawing on insights from psychology and
related complementary decision sciences, we are considering a
wider range of behavioral hypotheses. The system we are
building seeks to integrate numerous sources of information and
provide a variety of quantitative output to help monitor the
system’s performance, most importantly, sending negative alerts
when the probability that information is being misused rises
above preset thresholds. The quality of the system’s overall
performance will ultimately depend on how human beings wind
up using it.
The field of behavioral economics emerged in recent decades,
borrowing from psychology to build models with more empirical
realism underlying fundamental assumptions about the way in
which decision makers arrive at inferences and take actions. For
example, Nobel Prize winner Kahneman’s work focused
primarily on describing how actual human behavior deviates
from how it is typically described in economics textbooks. The
emerging field of normative behavioral economics now focuses
on how insights from psychology can be used to better design
institutions [3]. A case in point is the design of incentive
mechanisms that motivate a network of users to properly share
information. One way in which psychology can systematically
change the shape of the utility function in an information-sharing
context concerns relative outcomes, interpersonal comparisons,
and status considerations in economic decision making. For
example, is it irrational or un-economical to prefer a payoff of 60
among a group where the average payoff is 40, over a payoff of
80 in a group context where the average among others is over
100? While nothing in economic theory ever ruled these sorts of
preferences out, their inclusion in formal economic analysis was
quite rare until recent years.
We are trying to augment the formal analysis of the incentivized
information sharing component of our work with a wider
consideration
of
motivations,
including
interpersonal
comparisons, as factors that systematically shape behavioral
outcomes and, consequently, the performance of informationsharing systems. Another theme to emerge in behavioral
economics is the importance of simplicity [14] and the paradox of
too much choice. According to psychologists and evolutionary
biologists, simplicity is often a benefit worth pursuing in its own
right, paying in terms of improved prediction, faster decision
times, and higher satisfaction of users. We are considering a
wide range of information configurations that examine
environments in which more information is helpful, and those in
which less is more. With intense concern for the interface
between real-world human decision makers and the systems, we
will provide practical hints derived from theory to be deployed by
the design team in the field.
4. OUR APPROACH TO BUILDING A
SARO SYSTEM
Our approach is described in the following eight subsections. In
particular, we will, describe SAROL and our approach to
designing and developing SAROL through a temporal geosocial
semantic web.
4.1 Overview
In our MURI project together with the team, we are developing
what is called an AISL (assured information sharing life cycle).
AISL is focusing on policy enforcement as well as incentive
based information sharing. Semantic web technology is utilized
as the glue. AISL’s goal is to implement DoD’s Information
Sharing strategy set forth by Hon. John Grimes [15]. However
AISL does not capture geospatial information as well as track
social networks based on locations (Note that the main focus of
AISL is to enforce confidentiality for information sharing).
Our design of the SARO system is influence by our work for the
MURI project with our partners. Our goal is to develop
technologies that will capture not only information but also
social/political relationships, map the individuals in the network
to their locations, reason about the relationships and information
and determine how the nuggets can be used by the commander
for stabilization and reconstruction. In doing so, the commander
has to also determine potential conflicts, terrorist activities as
well as any operation that could hinder and suspend stabilization
and reconstruction. In order to reason about the relationships as
well as to map the individuals of a network to locations, we will
utilize the extensive research we have carried out both on social
network models (section 2.3) and geospatial semantic web and
integration technologies (section 2.2). The first task in this
project will be to develop scenarios based on use cases that are
discussed in a study carried out for the Army Science and
Technology [6] as well as by interviewing experts (e.g. Chait et
al). Furthermore, In her work Dr. Karen Guttieri states that
Human Terrain is a crucial aspect and we need hyperlinks to
People, Places, Things and Events to answer questions such as *
Which people are where? Where are their centers and
boundaries? Who are their leaders Who is who in the zoo? What
are their issues and needs? What is the news and reporting?
Essentially the human domain associations builds relationships
between the who, what, where, when and why (5W) [17].
We are designing what we call SAROL (Stabilization and
Reconstruction Operations Lifecycle). SAROL will consist of
multiple phases and will discover relationships and information,
model and integrate the relationships and information, as well as
exploit the relationships and information for decision support.
The system that implements SAROL will utilize geosocial
semantic web technologies, a novel semantic web that we will
develop. The basic infrastructure that will glue together the
various phases of SAROL will be based on the SOA paradigm.
We will utilize the technologies that we have developed in social
networking and geospatial semantic web to develop temporal
geosocial semantic web technologies for SAROL.
It should be noted that in our initial design we are focusing on
the basic concepts for SAROL will involve the development of
TGS-SW. This will include capturing the social relationships
and mapping them to the geolocations. In our advanced design
we will include some advanced techniques such as knowledge
discovery, and risk based trust management for the information
and relationships.
4.2 Scenario for a SARO
In the study carried out for the Army, the authors have discussed
the surveys they carried out and elaborated on the use cases and
scenarios they developed [6]. For example, at a high level, a
“City Rebuild” use case is the following: “A brigade moves into a
section of a city and is totally responsible for all S&R operations
in that area, which includes the reconstruction of an airfield in its
area of responsibility.” They further state that to move from A to
B in a foreign country the commander should consider various
aspects including the following: “Augmenting its convoy security
with local security forces (political and military considerations)”
and “Avoid emphasizing a military presence near political or
religious sites (political and social considerations).” The authors
then go on to explain how a general situation can then be
elaborated for a particular situation say in Iraq. They also discuss
the actions to be taken and provide the results of their analysis.
This work is influencing our development of scenarios and use
cases. The use case analysis will then guide us in the
development of SAROL.
4.3 SARO Lifecycle (SAROL)
SAROL consists
of three major
Relationship/
phases shown in
Information
Figure 3-1: (1)
Discovery
information and
relationship
discovery and
acquisition, (2)
information and
Relationship/
Information
relationship
Exploitation
Relationship/ modeling and
integration and
Information
Modeling/
(3) information
Integration
and relationship
exploitation.
Figure 3-1. SAROL
During
the
discovery and
acquisition phase commanders and key people will discover the
information and relationships based on those advertised as well
as those obtained through inference. During the modeling and
integration phase the information and the relationship have to be
modeled, additional information and relationships inferred as
well as the information and relationships integrated. During the
exploitation phase the commanders and those with authority will
exploit the information, make decisions and take effective actions
SAROL is highly dynamic as relationships and information
change over time and can rapidly react to situations. The above
three phases are executed multiple times by several processes.
For example, during a follow-on cycle new relationships and
SPARQL for TGS-SW
TGS
-SW
-SW
(OWL)
TSG-SW
Ontologies
Ontologies
(OWL)forfor
TSG
-SW
-SW
TRDF+GRDF+SNRDF
TML+GML+SNML
TML+GML+SNML
URI, UNICODE
Figure 3-3. Temporal Geosocial
Semantic Web Technologies
information that could say political in nature could be
discovered, modeled, integrated and exploited. Figure 3-2
illustrates the various modules that will implement SAROL. The
glue consists of the temporal geosocial service oriented
architecture (TGS-SOA) that supports web services and utilizes
temporal geosocial semantic web technologies (TGS-SW). The
high level web services include social networking, geospatial
information management, incentive management and federation
management.
4.4 TGS-SOA
Our architecture is based on services and we will design and
prototype a TGS-SOA for managing the SAROL. Our TGS-SOA
will utilize temporal geosocial semantic web (TGS-SW)
technologies. Through this infrastructure information and
relationships will be made visible, accessible and
understandable to any authorized DoD commanders, allied
forces and external partners. As discussed by Karen Guttieri, it
is important that the partners have a common operational picture,
common information picture and common relationship picture.
Based on the use cases scenarios we will capture the various
relationships, extract additional relationships and also locate the
individuals in the various networks. This will involve designing
the various services such as geospatial integration services and
social network management services as illustrated in Figure 3-2.
The glue that connects these services is based on the SOA
(service oriented architecture) paradigm. However such an
architecture should support temporal geosocial relationships. We
call such an architecture a Temporal Geosocial SOA (TGS-SOA).
It is important that our system captures the semantics of
information and relationships. Therefore we are developing
semantic web technologies for representing, managing and
deducing temporal geosocial relationships. Our current work in
geospatial information management and social networks is
exploring the use of FOAF, GRDF and SNRDF. Therefore we
will incorporate the temporal element into these representations
and subsequently develop appropriate representation schemes
based on RDF and OWL. We call such a semantic web to be
Temporal Geosocial Semantic Web (TGS-SW) (Figure 3-3). We
are using commercial tools and standards as much as possible in
our work including Web Services Description Language (WSDL)
and SPARQL. Capturing temporal relationship and information
e.g., evolving spatial relationships and changes to geospatial
information is key for our systems.
Temporal Social Networking Models
Temporal social networks model, represent and reason about
social networks that evolve over time. Note that in countries like
Iraq and Afghanistan the social and political relationships may be
continually changing due to security, power and jobs; the three
ingredients for a successful SARO. Therefore it is important to
capture the evolution of the relationships. In the first phase, our
design and development of temporal social networks will focus
on two major topics, namely, semantic modeling of temporal
social networks and fundamental social network analysis. For
semantic modeling of temporal social networks, we will extend
the existing semantic web social networking technologies such as
Friend-of-a-Friend (FOAF) ontology (FOAF09) to include
various important aspects such as relationship history that is not
represented in current social network ontologies. For example,
we will include features to model the strength and trust of
relationships among individuals based on the frequency and the
context of the relationship. In addition, we will include features
to model relationship history (e.g., when the relationship has
started and how the relationship has evolved over time) and the
relationship roles (e.g., in the leader/follower relationship, there
is one individual playing the role of the leader and one individual
playing to role of the follower). In essence, by this modeling we
intend to create an advance version of social network such as
Face book specifically designed for SARO objectives. Note that
there are XML based languages for representing social network
data (SNML) and temporal data (TML). However we need a
semantic language based on RDF and OWL for representing
semantic relationships between the various individuals. We are
exploring RDF for repenting social relationships (SNRDF and
extended FOAF). Representing temporal relationships (When) is
an area that needs investigation for RDF and OWL based social
networks.
We are using the social network analysis to identify important
properties about the underlying network address some of the 5W
(who, what, when, where and why) and 1H (how) queries. To
address queries for determining who to communicate with, we
plan to use various centrality measures such as degree centrality
and betweeness centrality [24] to measure the importance of a
certain individuals in a given social network. Based on such
measures that are developed for social network analysis, we can
test which of the centrality measures could be more appropriate
in finding influential individuals in social networks. Especially,
we will test these measures on available social networks such as
Facebook (it is possible to download information about
individuals in your own network on Facebook). For example, if a
centrality measure is a good indicator, than we may expect
individuals with high centrality value to have more posts on their
Walls on Facebook or tagged in more number of pictures.
To answer the queries for determining what information is
needed, we are examining the use of relational naïve Bayes
models to predict which attributes of the individual is a more
reliable indicator for predicting their friendliness to the military.
Since we do not have any open data on such military data, we are
using Facebook data to predict attributes that are more important
indicators for individual’s political affiliation to test our
relational naïve Bayes models. To address queries for
determining when to approach them, we are using various
domain knowledge rules to first address when not to approach
them. For example, in Iraq, it may not be a good idea to approach
Muslim individuals during Friday pray. Later on, we will try to
build knowledge discovery models to predict the best times for
approaching certain individuals based on their profile features
(i.e., their religion, social affiliation and etc.). In addition, we
plan to use similar knowledge discovery models to answer the
queries for understanding how those individuals may interact
with the military personnel. In order to test our knowledge
discovery models, we are analyzing various e-mail logs of our
group members to see whether our model could predict best
times to send an e-mail to an individual to get the shortest
response time and the positive answer. To address queries for
determining why certain individuals’ support is vital, we are
examining community structure mining techniques especially
cluster analysis to see which group a certain individual belongs
to and how the homophily between individuals in the group
affects the link structures. Again, Facebook data could be used to
test some of these hypotheses.
4.5 Temporal
Managements
Geospatial
Information
As part of our MURI research we are developing novel
techniques for information quality management and validation,
information search and integration and information discovery
Figure 3-4. System Architecture for
Geosocial Information Management
and analysis that make “information a force multiplier through
sharing” Our objective is to get the right information at the right
time to the decision maker so that he/she can make the right
decisions to support the SARO in the midst of uncertain and
unanticipated events. Note that while our MURI project focuses
mainly on information sharing of structured and text data, our
work discussed in this paper is focusing on managing and sharing
temporal geosocial information. The system architecture is
illustrated in Figure 3-4. It shows how the various geospatial
information management components are integrated with the
social networking components.
Information management and search. Our system, like most
other current data and information systems, collects and stored
huge amounts of data and information. Such a system is usually
distributed across multiple locations and is heterogeneous in
data/storage structures and contents. It is crucial to organize data
and information systematically and build infrastructures for
efficient, intelligent, secure and reliable access, while
maintaining the benefits of autonomy and flexibility of
distribution. Although such a system is building on top of
database system and Web information system technology, it is
still necessary to investigate several crucial issues to ensure our
system have high scalability, efficiency, fast or real-time
response, and high quality and relevance of the answers in
response to users’ queries and requests, as well as high reliability
and security. We are exploring ways to develop, test, and refine
new methods for effective and reliable information management
and search. In particular, the following issues are being
explored: (1) data and information indexing, clustering, and
organizing in structure way to facilitate not only efficient but also
trustable search and analysis; (2) relevance analysis to ensure the
returning of highly relevant and ranked answers; and (3)
aggregation and summary over time window. For example, a user
may like to ask some information which may involve weapons
and insurgents of a particular location over a particular time
frame. Our design and prototype of the police blotter system is
being leveraged for the SARO system.
Information Integration.
We are examining scalable
integration techniques for handling heterogeneous geosocial data
utilizing our techniques on ontology matching and aligning.
Moreover, to ensure data/information from multiple
heterogeneous sources can be integrated smoothly, we are
exploring data/information conversion and transformation rules,
identify redundancy and inconsistence. In our recent research on
geospatial information integration, we have developed
knowledge discovery methods that resolve semantic disparities
among distinct ontologies by considering instance alignment
techniques [21]. Each ontological concept is associated with a set
of instances, and using these, one concept from each ontology is
compared for similarity. We examine the instance values of each
concept and apply a widely popular matching strategy utilizing
N-grams present in the data. However, this method often fails
because it relies on shared syntactical data to determine semantic
similarity. Our approach resolves these issues by leveraging Kmedoid clustering and a semantic distance measure applied to
distinct keywords gleaned from the instances, resulting in
distinct semantic clusters. We claim that our algorithm
outperforms N-gram matching over large classes of data. We have
justified this with a series of experimental results that
demonstrate the efficacy of our algorithm on highly variable data.
We are exploiting not only instance matching but also name and
structural matching in our project for geosocial data that evolves
over time.
Information Analysis and knowledge discovery. The tools we
have developed for information analysis and knowledge
discovery are being exploited for the SARO system [21] for this
project. We are exploring research in the following directions
which extends our prior research. First, information
warehousing, mining and knowledge discovery are being applied
to distributed information sources to extract summary/aggregate
information, as well as the frequency, discrimination, and
correlation measures, in multidimensional space for items and
item sets. Note that the 5W’s and 1H’s (who, what, why, when,
where and how) are important for SARO. For example, "When to
approach them" might be used where a need exists to meet with
road construction crews in an area that has experienced sporadic
ethnic violence mostly during specific times in the day. For
example, "find all times in a given day along major roads in
Baghdad less than 10 miles from Diyala Governorate where
violent activity has not occurred in at least 2 years." We are
extending our geospatial and social network analysis techniques
to address such questions.
Our work is focusing on the following aspects: (1) scalable
algorithms that can handle large volumes of geosocial
information (information management and search) to facilitate
who, where, when issues; (2) information integration analysis
algorithms to address where and who issues; and (3) knowledge
discovery techniques that can address why and how issues.
4.6 Temporal Geosocial Semantic Web
We are integrating our work on modeling and reasoning about
geospatial information with the geospatial information. First we
are utilizing concepts from concepts from SNML (social network
markup language), GML (geospatial markup language) and TML
(temporal markup language). However to capture and semantics
and make meaningful inferences we need something beyond
syntactic representation. As we have stated, we have developed
GRDF (Geospatial RDF) that basically integrates RDF and
GML. One approach is to integrate SNRDF (Social network RDF
that we are examining) with GRDF and also incorporate the
temporal element. Another option is to make extensions to FOAF
to represent temporal geosocial information.
Next we are integrating the geosocial information across multiple
sites so that the commanders and allied forces as well as partners
in the local communities can form a common picture. We have
developed tools for integrated social relationships as well as
geospatial data using ontology matching and alignment.
However, we are exploring ways of extending our tools to handle
possibly heterogeneous geosocial data in databases across
multiple sites.
We are also exploring appropriate query languages to query
geosocial data. There are query languages such as SPARQL and
RQL being developed for RDF databases. We have adapted
SPARQL to query geospatial databases. We are also exploring
query languages for social network data. Query languages for
temporal databases have been developed. Therefore our
challenge in this subtask is to determine the constructs that are
needed to extend languages like SPARQL to query geosocial data
across multiple sites.
It is also crucial to reason about the information and relationships
to extract useful nuggets. Here we are developing ontologies for
temporal geospatial information. We have developed ontologies
and reasoning tools for geospatial data and social network
relationships based on OWL and OWL-S. We are incorporating
social relationships and temporal data and reason about the data
to uncover new information. For example, if some event occurs at
a particular time at a particular location for 2 consecutive days in
a row and involves the same group of people, then it will like
occur on the third day at the same time and same location
involving the same group of people. We can go on to explore why
such an event has occurred. Perhaps this group of people belong
to a cult and have to carry out activities in such a manner.
Therefore we are developing reasoning tools for geosocial data
and relationships.
5. CONCLUSION
This paper has described the challenges for developing a system
for SARO. In particular, we are designing a temporal geospatial
social semantic web that can be utilized by military personnel,
decision makers, and local/government personnel to reconstruct
after a major combat operation. We essentially develop a
lifecycle for SARO. We will utilize the technologies we have
developed including geospatial semantic web and social network
system as well as build new technologies to develop SAROL. We
believe that this is the first attempt for building such a system for
SARO.
There are several areas that need to be included in our research.
One is s4ecurity and privacy. We need to develop appropriate
policies for SAROL. These policies may include confidentiality
policies, privacy policies and trust policies. Only certain
geospatial as well as social relationship may be visible to certain
parties. Furthermore, the privacy of the individuals involved have
to be protected, Different parties may place different levels of
trust on each other. We believe that building a TGS-SW is a
challenge. Furthermore, incorporating security will make it even
more complex. Nevertheless security has to be considered at the
beginning of 6thd design and not as an afterthought. Our future
work will also include handling dynamic situations where some
parties may be trustworthy at one time and may be less
trustworthy at another time. We believe that the approach we
have stated is just the beginning to building a successful system
for SARO.
6. ACKNOWLEDGEMENTS
This paper is based on the keynote presentation given by the
authors at PAISI (Pacific Asia Intelligence and Security
Informatics) on April 27, 2009. The research is supported by
Texas Enterprise Funds. The authors are grateful to the AFOSR
MURI Project as well as the IARPA KDD Project for influencing
the research described in this paper. We thank our student Pankil
Doshi for reviewing this paper.
7. REFERENCES
[1] Alam Ashraful et al, 2008. GRDF and Secure GRDF. IEEE
ICDE Workshop on Secure Semantic Web, Cancun Mexico,
April 2008. (version also to appear in Computer Standards
and Interfaces Journal)
[2] N. Berg, 2006. “A Simple Bayesian Procedure for Sample
Size Determination in an Audit of Property Value
Appraisals”, Real Estate Economics 34(1), 133-155, 2006
[3] N. Berg, “Normative Behavioral Economics”, Journal of
Socio-Economics 32, 411-423, 2003
[4] H. Binnendjik and S. Johnson, 2004. Transformation for
Stabilization and Reconstruction Operations, Center for
Technology and National Security Policy, National Defense
University Press.
[5] R. Chait, A. Sciarretta, and D. Shorts, October 2006. Army
Science and Technology Analysis for Stabilization and
Reconstruction Operations, Center for Technology and
National Security Policy, NDU Press.
[6] R. Chait, A. Sciarretta, J. Lyons, C. Barry, D. Shorts, and D.
Long, September 2007. A Further Look at Technologies and
Capabilities for Stabilization and Reconstruction
Operations, Center for Technology and National Security
Policy, NDU Press.
[7] Pavan Chittumala et al, November 2007. Emergency
Resposne System to handle Chemical Spills, IEEE Internet
Computing.
[17] Karen Guttieri, Stability, Security, Transition
Reconstruction: Transformation for peace, Quarterly
Meeting of Transformation Chairs, Naval Postgraduate
School, February 2007.
[18] W. Jiang, C. Clifton and M. Kantarcioglu, "Transforming
Semi-Honest Protocols to Ensure Accountability", Data and
Knowledge Engineering (DKE), an Elsevier journal, 2007.
[19] Y. Kalfoglou and M. Schorlemmer. IF-Map: an ontology
mapping method based on information flow theory. Journal
on Data Semantics, 1(1):98–127, Oct. 2003.
[20] Latifur Khan, Dennis McLeod, and Eduard H. Hovy:
Retrieval effectiveness of an ontology-based model for
information selection. VLDB J. 13(1): 71-85 (2004).
[21] Latifur Khan, et al Geospatial Data Mining for National
Security, Proceedings Intelligence and Security Conference,
New Brunswick, NJ, May 2007.
[22] J. Laffont and D. Martimort, “The Theory of Incentives: The
Principal-Agent Model”, Princeton University Press, 2001.
[8] Bing Tian Dai, Nick Koudas, Divesh Srivastava, Anthony
K. H. Tung, and Suresh Venkatasubramanian, "Validating
Multi-column Schema Matchings by Type," 24th
International Conference on Data Engineering (ICDE), pp.
120-129, 2008.
[23] Madhavan, 2005. Corpus-based Schema Matching, ICDE05.
[9] A. Doan and A. Halevy, Spring 2005. Semantic Integration
Research in the Database Community: A Brief Survey, AI
Magazine, Special Issue on Semantic Integration.
[25] N. F. Noy and M. A. Musen. The PROMPT suite:
Interactive tools for ontology merging and mapping.
International Journal of Human-Computer Studies,
59(6):983–1024, 2003.
[10] AnHai Doan, Jayant Madhavan, Pedro Domingos, Alon Y.
Halevy: Ontology Matching: A Machine Learning Approach.
Handbook on Ontologies 2004: 385-404.
[11] Erhard Rahm and Philip A. Bernstein. A survey of
approaches to automatic schema matching. The VLDB
Journal, 10(4):334–350, 2001.
[12] Marc Ehrig, Steffen Staab, York Sure, Bootstrapping
Ontology Alignment Methods with APFEL, In Y. Gil, E.
Motta, V. R. Benjamins, M. A. Musen, Proceedings of the
4th International Semantic Web Conference, ISWC 2005,
Galway, Ireland, November 6-10, 2005., volume 3729 of
LNCS, pp. 186-200. Springer, Nov. 2005
[13] GAO Report, STABILIZATION AND RECONSTRUCTION
Actions Needed to Improve Governmentwide Planning and
Capabilities for Future Operations Statement of Joseph A.
Christoff, Director International Affairs and Trade, and
Janet A. St. Laurent, Director Defense Capabilities and
Management, Oct. 2007
[14] G. Gigerenzer and P. M. Todd , “Simple Heuristics That
Make Us Smart”, Oxford University Press, 1999
[15] John Grimes, DoD Information Sharing Strategy, 2007.
[16] Karen Guttieri, Integrated Education and Training
Workshop, Peacekeeping and Stability Operations Institute,
Naval Postgraduate School, September 2007.
[24] Newman E. J., The structure and function of complex
networks, arXiv, 2003
[26] M. Osborne and A. Rubinstein, “A Course in Game
Theory”, MIT Press, 1999.
[27] A survey of approaches to automatic schema matching. The
VLDB Journal, 10(4):334–350, 2001.
[28] Santos, 2007. "Computational Modeling: Culturally-Infused
Social Networks", Office of Naval Research (ONR)
Workshop on Social, Cultural, and Computational Science,
and the DoD Workshop on New Mission Areas: Peace
Operations, Security, Stabilization, and Reconstruction,
Arlington, VA, 2007.
[29] G. Stumme and A. M¨adche. FCA-Merge: Bottom-up
merging of ontologies. In 7th Intl. Conf. on Artificial
Intelligence (IJCAI ’01), pages 225–230, Seattle, WA, 2001.
[30] Ganesh Subbiah, DAGIS System, MS Thesis, The
University of Texas at Dallas.
[31] Bhavani Thuraisingham, Geospatial data management
research at the University of Texas at Dallas, presented at
OGC meeting for University members and OGC
Interoperability day, Tysons Corner, VA, October 2006.
[32] B. Thuraisingham, Assured Information Sharing, Book
Chapter: Intelligence and Security Informatics, Springer
2008 (Editor: Chris Yang).
On the Efficacy of Data Mining for Security Applications
Ted E. Senator
SAIC1
3811 N. Fairfax Drive, Suite 850
Arlington, VA 22203, USA
1-703-469-3422
[email protected]
ABSTRACT
Data mining applications for security have been proposed,
developed, used, and criticized frequently in the recent past. This
paper examines several of the more common criticisms and
analyzes some factors that bear on whether the criticisms are valid
and/or can be overcome by appropriate design and use of the data
mining application.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications – data
mining, I.5.2 [Pattern Recognition]: Design Methodology –
classifier design and evaluation, feature evaluation and selection,
pattern analysis.
I.6.3 [Simulation and Modeling]:
Applications. J.7.2 [Computers in Other Systems] K.4.1
[Computers and Society]: Public Policy Issues – human safety,
privacy, use/abuse of power.
General Terms
Management, Measurement, Design, Economics, Security, Legal
Aspects.
Keywords
Data mining, security, applications, pattern matching.
1. INTRODUCTION
Many data mining applications for solving security problems have
been proposed and designed in the recent past, especially since
September 11, 2001. [12] Some of these applications have been
developed and deployed and are in use today. Others have been
cancelled before they were deployed. [16] The reasons for
cancellation have typically been because of concerns over
effectiveness and/or concerns over societal impact, particularly
with respect to privacy and civil liberties. Sometimes, the
concerns have been expressed as a tradeoff between the benefits
in terms of the amount of protection afforded and the costs in
terms of the societal impact of using the system. Some critics of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CSI-KDD’09, June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-4 $5.00.
the use of data mining for security applications have
simultaneously criticized systems both for being ineffective and
for threatening civil liberties. Often – especially in the political
and policy communities – the discussion of these issues is based
on less than thorough analyses of the way these systems actually
operate or the way they could operate if they were designed
effectively. 1
This paper analyzes several of the criticisms directed at the
effectiveness of various data mining applications for security by
(1) clearly defining the distinct activities that are performed as
part of data mining projects for security applications and (2)
analyzing the criticisms in the context of these activities, while
proposing alternative designs to those assumed by the critics. This
paper does not address the vital issues of privacy and civil
liberties that are raised by these systems, not because these issues
are not important – they most certainly are fundamental
considerations in the design and adoption of any system involving
data mining for security – but simply because that is a separate
topic that requires far more thorough analyses and discussion than
has been addressed in the research and analysis reported here.
The main purpose of this paper is not to argue for any particular
position with respect to the societal costs and benefits of using
data mining for security applications; rather, it is to suggest ideas
that would be part of a more thorough and principled framework
within which to understand the inherent design issues, impacts,
tradeoffs, and possibilities, in the hope that such a framework and
understanding can be used to support rational and informed
societal choices leading to effective security systems that respect
privacy and civil liberties. This paper is offered in the spirit of [2]
– to contribute to informed public debates and sound policy
making that provide appropriate security and maintain civil
liberties informed by careful analyses of alternatives and
possibilities. It is hoped that the discussion and analysis in this
paper will provide more of a mutual understanding between the
technical and policy communities, at least in terms of the ability
to communicate, discuss, and debate issues with a common
understanding of what different terms mean and alternative
solutions may imply.
The paper is based on actual and proposed data mining
applications in the U.S. with which the author is familiar;
1
Affiliation is provided solely for identification purposes. The
views and opinions expressed herein are not necessarily those of
SAIC, any of is clients, or of any other organization with which
the author has been or is affiliated. This work has been
prepared independently of the author’s duties and
responsibilities as an employee of SAIC.
however, the general ideas discussed should apply equally well to
non-U.S. applications. What may differ across countries is not the
scientific and engineering principles on which such systems are
based, but rather the values on which societal judgments are based
and the corresponding legal, political, regulatory, and policy
environments in which decisions are made regarding the benefits
and costs of such systems.
The paper begins with a review of various definitions of data
mining. It identifies several distinct but related activities that fall
within these definitions. Next, it defines a model of data mining
systems for security applications. This model forms the basis for
the analysis that follows. The model is used to analyze several
criticisms that have been levied at security applications of data
mining. The paper then discusses metrics for the evaluation of
security applications of data mining and finally concludes.
2. WHAT IS A SECURITY APPLICATION
OF DATA MINING?
This section discusses what is meant by an application of data
mining for security. As will be seen from the various examples
that are cited, there is often much confusion about this very issue,
and this confusion is a large contributor to misunderstandings
about the effectiveness of data mining applications. The section
begins with a comparison of definitions of data mining as used in
the technical and the political/policy communities. It continues
with a discussion of what data miners do and suggests a
framework and terminology for distinct but related tasks. It then
uses this framework to understand security applications of data
mining and concludes with a discussion of the sources and role of
patterns in security applications.
2.1 Definitions of Data Mining
There are a number of well-accepted definitions of data mining in
the scientific community. Most of them center on the idea of
pattern discovery. The most widely used definition, from [3] is
that data mining is “the non-trivial process of identifying valid,
novel, potentially useful and ultimately understandable patterns in
data.” A newer definition from Jonas and Harper [5] defines data
mining as “the process of searching data for previously unknown
patterns and often using these patterns to predict future
outcomes.” Note how these definitions, cited in the same order as
they were proposed, both focus on the discovery of patterns while
the new definition adds the emphasis on the use of these
discovered patterns, especially for prediction. 2 Note also how
neither of these definitions mentions data collection, data
aggregation or linking, or particular applications.
In contrast to the definitions used in the scientific community,
politicians have defined data mining both more broadly and more
narrowly. These definitions are broader in so far as they include
search and collection of data, and they are narrower in that they
typically refer to security applications the purpose of which is to
2
Prediction is actually used in two senses here. Prediction can
mean “to infer some value about another entity in the database,”
or it can mean “to suggest something that might occur in the
future based on an analysis of the past.” As the physicist Niels
Bohr said, “Prediction is very difficult, especially about the
future.”
prevent terrorism. “Data Mining is a broad search of public and
non-public databases in the absence of a particularized suspicion
about a person, place or thing. Data mining looks for relations
between things and people without any regard for particularized
suspicion” according to U.S. Senator Russ Feingold on January
16, 2003. The U.S. Department of Defense Technology and
Privacy Advisory Committee in March 2004 defined data mining
as “searches of one or more electronic databases of information
concerning U.S. person by or on behalf of an agency or employee
of the government.” Senator Feingold’s proposed amendment to
HR 5441 defined data mining as “a query or search or other
analysis of 1 or more electronic databases, whereas – (A) at least
1 of the databases was obtained from or remains under the control
of a non-Federal entity, or the information was acquired initially
by another department or agency of the Federal Government for
purposes other than intelligence or law enforcement; (B) a
department or agency of the Federal Government or a non-Federal
entity acting on behalf of the Federal Government is conducting
the query or search or other analysis to find a predictive pattern
indicating terrorist or criminal activity; and (C) the search does
not use a specific individual’s personal identifiers to acquire
information concerning that individual.” Senator Patrick Leahy,
opening the Senate Judiciary Committee Hearing on the
“Balancing Privacy and Security: The Privacy Implications of
Government Data Mining Programs” on January 10, 2007,
defined data mining as “the collection and monitoring of large
volumes of sensitive personal data to identify patterns or
relationships.”
It is important to note how these definitions of data mining differ
from the scientific definitions. First, they assume a particular
purpose, namely, security applications. Second, they include the
concepts of data collection, monitoring, and search. And third,
they assume that pattern-based searches are being conducted to
identify specific individuals who fit the patterns. The focus is not
so much on pattern discovery, but rather on pattern matching; the
end product is not the patterns themselves as in the scientific
definitions, but rather the matches of the patterns to people (and
perhaps also places, things, events, etc.) to predict something of
interest having to do with security. Note that neither of these sets
of definitions covers the primary activity of data mining
researchers, i.e., developing new algorithms for pattern discovery.
In the U.S., the Federal Agency Data Mining Reporting Act of
2007 (“Data Mining Reporting Act”) requires the “head of each
department or agency of the Federal Government” that is engaged
in activities defined as “data mining” to report annually on such
activities to Congress. The Data Mining Reporting Act defines
data mining as “a program involving pattern-based queries,
searches, or analyses of 1 or more electronic databases” in order
to “discover or locate a predictive pattern or anomaly indicative
of terrorist or criminal activity.” According to [7] and [8], “the
limitation to predictive ‘pattern-based’ data mining is significant
because analysis performed … for counterterrorism and similar
purposes is often performed using various types of link analysis
tools. These tools start with a known or suspected terrorist or
other subject of foreign intelligence interest and use various
methods to uncover links between that known subject and
potential associates or other persons with whom that subject is or
has been in contact. The Data Mining Reporting Act does not
include such analyses within its definition of ‘data mining’
because such analyses are not ‘pattern-based.’ Rather, these
!"##$%&%#'()()*#+,-,%./0
$%&%
$%&%
$%&%
Research
12*3.(&04
5"##6$%&%#'()()*7
="#$'#1>>2(/%&(3)
$%&%
$%&%
$%&%
$%&%
Pattern
Discovery
8%&&,.)-
Linking,
Targeting,
Investigating,
etc.
?"#@,/A.(&B
1>>2(/%&(3)
Pattern
Detection
8.,9(/&(3)-:
;)<,.,)/,-
C3223DEF)
1/&(3)-
Figure 1. Data Mining Activities
analyses rely on inputting the ‘personal identifiers of a specific
individual, or inputs associated with a specific individual or group
of individuals,’ which is excluded from the definition of the act.”
2.2 What Data Miners Do
Based on the above definitions, it appears that data miners engage
in three distinct but related activities: (1) data mining research,
the primary focus of which is algorithm development, (2) data
mining itself, whose primary focus is pattern discovery, and (3)
data mining applications, whose primary focus is predicting or
inferring the value of a feature for some purpose. (In the case of
security applications of data mining, the feature is typically a
likelihood that a particular person is high-risk.) Finally, the data
mining application developer may also engage in a fourth
activity: (4) the design and development of other aspects of an
end-to-end system that makes use of the predicted feature for
some particular purpose. This end-to-end system can involve a
variety of data sources and analytical and investigatory
techniques; may result in many alternative downstream analyses,
decisions and actions; and typically involves human analysts and
other actors. Figure 2 of [15] provides an example of how this
end-to-end process may occur in the context of law enforcement
investigations. (It is important to note that the result of a positive
match to a pattern that may be indicative of increased risk is
usually and appropriately a more thorough analysis by a human
analyst; rarely if ever is a pattern match relied on for any
consequential action, nor should it be.)
Each of the activities performed by data miners has distinct data
requirements and distinct products, as depicted in figure 1. The
figure is intended to depict not only the distinct activities, but also
the different data needs and uses for each activity. Data mining
researchers typically identify a set of databases that have a
characteristic that has not been previously exploited for effective
pattern discovery. They acquire several – or as many are as
readily available – databases that share this characteristic and
develop an algorithm that takes advantage of this characteristic
and results in the discovery of more effective patterns. (Note that
the effectiveness of the patterns is defined with respect to some
particular application task.) The databases the data mining
researchers use need not have information that identifies the
entities in the databases, although they often do need to maintain
unique identifiers for some classes of patterns. The databases need
not be complete or even close to complete with respect to the
actual populations, although some degree of representativeness is
highly desirable. Multiple databases that are about a diverse set of
domains are strongly preferred in order to demonstrate the
widespread applicability and utility of the newly developed
algorithm.
The second activity, the actual mining of data, uses various
algorithms to develop models, or, synonymously, to discover
patterns. This step is sometimes called knowledge discovery, and
the resulting patterns or models are referred to as knowledge. This
activity typically is based on a single database, or at least a single
“virtual database” in so far as the analysis is concerned. 3 All
fields of a database are relevant here, as the purpose of pattern
discovery is to determine which of the fields are relevant and
which are not for the detection of the phenomena being modeled;
however, it does not necessarily require records referring to all
elements of the population, just a large enough sample with
enough examples of the phenomena of interest. This activity of
data mining may more generally be termed “data analysis” – it
could use, for example, statistical or other techniques. The result
of this mining of data is a set of patterns that have predictive
value. This is the activity that conforms to the widely accepted
definition of data mining in the technical community.
2.3 Security Applications of Data Mining
The third activity in which data miners engage is the actual
prediction itself. Predictions are made using the patterns
discovered by the second activity on new data elements that had
not been used in the pattern discovery. This activity would
typically be widely applied to all members of the population of
interest, but would require only those fields or attributes that have
been determined to be relevant during the actual data mining
analysis. Each record in the database is matched against the
pattern and an inference or prediction is made. These inferences
or predictions may then be used, typically in conjunction with
additional information that has been collected based on knowing
the identity of the individual whose particular records matched the
pattern, to make a further determination of interest and take
appropriate actions, outside the scope of but resulting from the
data mining application. An example of such inference might be
the assignment of a credit score to a new applicant, based on a
match to patterns that were discovered during the second activity
and were determined to be useful in predicting credit risk.
With these four activities in mind, we see that what is typically
referred to as a security application of data mining may combine
aspects of several activities but usually emphasizes some
combination of the third and fourth – the detection of entities in
the database that match particular patterns of interest and their use
in an end-to-end application process for evaluating risk, initiating
and/or conducting investigations, and taking appropriate actions.
Such an application could involve no automated pattern matching
at all, or it could be totally dependent on such automated pattern
matching. Many real applications, such as the one described in
[14], combine these aspects. While the use of the application for
its intended purpose may not include any algorithm development
or pattern discovery, its development process may benefit from
these activities. The application may also include such activities
to enable continuous updating of the patterns to account for
changes in behavior. It may also occasionally take advantage of
the first activity, if improved algorithms can result in the
discovery of more effective patterns.
The fourth activity type is the actual end-to-end security
application. This activity is not performed by scientists or
engineers, but by some organization with an operational mission.
The organization may be responsible for screening applicants for
some purpose, or in a non-security context it might be responsible
for marketing a particular product or service. Such an
organization does not care about the source of the knowledge
used in its systems; it cares simply about their effectiveness. This
knowledge may be the result of patterns discovered by data
mining, or it may arise from other sources, such as a deep
understanding of the domain, a formal set of regulations, etc.
Often, these domain-specific applications do not incorporate any
data mining algorithms or any patterns that were discovered using
such algorithms; why they do not is an issue both for application
developers, who could potentially build more effective systems
by taking advantage of data mining techniques and results, and for
data mining researchers, who could potentially provide more
useful technology for real applications.
Key issues in the design of such applications are (1) what is the
purpose of the application?, (2) what data sources are available,
appropriate, and useful for the intended purpose?, (3) what
techniques will best accomplish these purposes with the available
data and patterns?, (4) what additional justifications are required
to acquire additional data?, (5) what records are kept after an
analysis is performed?, and (6) what follow-on actions are
allowed as a result of the application? The purpose of the
application is determined by some need, having to do with an
organization’s mission and independent of the consideration of
any use of data mining. The application may use data from which
useful patterns have been mined, or it may not, depending on
whether such sources and patterns exist, whether it is appropriate
to use such sources for the intended purpose, and whether other
sources are more useful for the intended purpose. In the context
of security applications, data that have been collected for security
purposes (e.g., existing law enforcement and intelligence
databases) are often useful and appropriate sources; other data
sources such as commercial transactions are often neither useful
nor appropriate. The selection bias that results in inclusion in
such a security database in the first place may, in fact, be viewed
as a prima facie indicator of a high-risk entity; this is why
techniques that start from known risky individuals and “connect
the dots” are often most effective for security applications. Other
pre-screening techniques, such as observations of suspicious
The next section of this paper explores end-to-end security
applications in more depth.
3
Note that for purposes of simplicity, we ignore the field of
distributed data mining. It is recognized that distributed data
mining techniques can be used to discover local patterns that
can later be merged; the relevant question in this field is not
how databases can be split vertically for pattern discovery, but
rather how they can be split horizontally and the patterns
combined.
Before continuing the discussion, however, two other points are
worth mentioning. First, while this discussion of data mining
activities is depicted in terms of propositional data, the basic ideas
apply to relational data as well. Second, we note that these four
distinct activities are often conflated not only by policymakers,
end users, and other stakeholders, but also by data miners
themselves. In particular, data mining researchers tend to view
every activity other than data mining research as “applications,”
while those responsible for end-user applications tend to group
activities one and two as “research.” 4
4
In fact, this confusion often manifests itself not only in policy
debates, but also in acceptance criteria for data mining
conferences and journals.
behavior or setting off alarms for carrying potentially dangerous
material provide a similar selection bias that may be at least as
effective as more abstract pattern matching for purposes of a
particular security application. Techniques that are useful for
security applications may include pattern matching, link analysis,
anomaly detection, and others; often, some combination of these
techniques is most effective. Often the security application
includes additional data collection about entities for whom more
information is justified based on initial indicia of risk or
suspicion; this additional data collection may involve additional
entities who are somehow “connected” to known entities or it
may involve collection of additional types of information through
a subject-based query on a known entity to enable an accurate
determination of that entity’s status. (Such additional collection
based on subject-based queries is depicted by the bi-directional
arrows in activity 4 in figure 1.) The application may store or
discard various intermediate results about a particular entity; these
data retention issues are crucial because of the potential long-term
effect on an individual. If an individual is determined to be lowrisk only after an extensive analysis of additional data, pertinent
data about that individual could be discarded at a cost of having to
repeat the analysis in the future; however, if such data are
retained, then it would be essential to prevent the use of such
additional data for any purpose other than avoiding a more
detailed analysis of such an individual in the future. Finally, the
application exists in the context of some business process and
some set of authorizations and authorities, both of which
determine what follow-on actions may result from use of the
application. It is typically this last issue that is of most societal
concern, for this is when consequences can occur. It is important
to note that these consequences are not the direct result of the use
of data mining to discover patterns; rather, they are the result of
policies and procedures that are adopted by a user organization
with regard to the results of a security application.
A key feature of all security applications is that they are multistage processes. Each stage passes along the riskiest entities to the
subsequent stage for more detailed analysis and discards the lowrisk entities. The more detailed analysis may incorporate
additional data sources – data sources that are more expensive to
obtain or data sources the use of which is restricted until some
additional justification exists. The more detailed analysis may,
and often does, result in a conclusion that the entity under
consideration is, in fact, not as risky as determined by the
previous stage and, therefore, cause the entity to be removed from
the risky category.
A crucial issue in the design of any security application is the
source and role of patterns. Pattern detection may be an effective
technique, especially when applied to existing law enforcement
and intelligence data and used to detect low-level activities and
combine such low-level activities into higher-level plans or
organizations. It may be less useful when applied to screening of
individuals. An important point is that the utility of any pattern
must be established and verified empirically before such a pattern
is used as a component of a security application. Patterns may
come from data mining, but also from other sources. For example,
patterns used in [13] resulted from an analysis of market
regulations and hypotheses about possible schemes to engage in
improper market behavior, and these patterns were deployed only
after a rigorous and iterative process of modification and
validation. Patterns may also come from external sources; for
example, knowledge of a newly confirmed or suspected
adversarial technique could result in the development and use of a
pattern for its detection. Patterns may also arise from anomaly
detection techniques; in this case, normal patterns of activity are
removed from the data and what remains is considered unusual
and potentially suspicious.
We consider two examples of security applications to illustrate
these ideas. First, imagine an application for screening passengers
at airports. In such an application, there is no a priori reason to
suspect that people who choose to fly on airplanes are more likely
than the general population to be dangerous. Rather, the concern
is with the possibility of any particular, dangerous person being
aboard an aircraft. In such an application, initial screening might
be, and often is, based on a physical inspection, close behavioral
observations, detailed questioning (in the case of El Al Israel
Airlines), or some combination, rather than on pattern matching.
Additional data might be considered for people who somehow
appear suspicious on one of these tests. This situation contrasts
with an application that might be used to determine where to
focus investigatory resources based on people who appear in
lawfully collected intelligence databases. In such an intelligence
application, the initial indicia of risk come directly from the fact
that a person is included in the database itself; hence, patternbased analysis is likely to be a useful tool.
Finally, it must be noted that a security application will almost
always and should always include strict audit functions, controls
on use, and review mechanisms to ensure that the application is
being used solely for its intended purpose and is not being abused
in any way. In fact, data mining techniques independently
applied to the audit logs are themselves one method to detect,
deter, and guard against abuses of security applications
themselves.
3. CRITICISMS OF SECURITY
APPLICATIONS OF DATA MINING
Security applications of data mining that have received the most
criticism include Total Information Awareness (TIA), ComputerAssisted Passenger Prescreening System (CAPPS II), Multistate
Anti-Terrorism Information Exchange (MATRIX). [12] These
systems/projects were all cancelled after expenditures of millions
of U.S. dollars., because of concerns both about privacy and civil
liberties and about their effectiveness. [16] Secure Flight, a
follow-on to CAPPS-II, and the Department of Homeland
Security’s Analysis, Dissemination Visualization Insight and
Semantic Enhancement (ADVISE) System were also cancelled
due to security vulnerabilities and privacy concerns, respectively.
[16] Even research programs that incorporated a full measure of
privacy protection and had the sole purpose of determining
whether a particular algorithm, technique or approach could
develop patterns that indicate terrorist activity were reported,
although they did not meet the requirements of the Data Mining
Reporting Act [7], and were later cancelled according to [8].
These research programs were an example of the first activity
depicted in figure 1.
Criticisms of the effectiveness of data mining security
applications appear in [1], [5], [9], [10], and [11]. Analyses of
some of these criticisms are contained in [4], [6], [14], and [15].
This section of this paper summarizes the criticisms and analyzes
how they may be addressed in security applications of data
mining, using the model presented in section 2 as the basis for
distinguishing separate, and different activities.
3.1 Too Many False Positives
The simplest criticism of security applications of data mining is
frequently expressed as “too many false positives.” In particular,
while it is accurately noted that for events that occur far less
frequently than the accuracy of a classifier, most instances of
positive results will be false positives. This criticism is addressed
in detail in [14]; a multi-stage classification architecture preceded
by a high-risk population selection and followed by link analysis
is shown to be one method of mitigating this problem of too many
false positives. The example of a 99.9 percent accurate classifier
applied to a population of 300 million entities containing only
3,000 true positives, i.e., 0.001 percent, would yield over 100
times more false positives by itself. However, with multi-stage
classification techniques consisting of two independent stages at
99 percent and 99.9 percent accuracy and assuming 5 percent of
the population in a high-risk group that was 10 times more likely
to be positive, almost all groups of reasonable size would be
detected. The “false-positive” criticism is also addressed in [4] in
the context of relational data, ranking classifiers, and multi-pass
inference.
The flaw in the false-positive criticism is that it assumes a singlestage classifier. As should be clear from the discussion in section
2.3 of this paper, no serious security application would be this
simplistic, if only because any credible application designer
would be aware that such an approach would not work. An initial
classifier might be used in some applications as a first level
screener to manage a large workload; such a classifier would have
to be tuned to minimize false negatives. The application would
rely on subsequent stages to rule out false positives. These
subsequent stages would likely employ a combination of
techniques.
3.2 Nobody Does That Anymore
This criticism suggests that matching known patterns is not
useful. The flaw in this criticism is that even past threats are still
dangerous if they can be executed again. It is important to prevent
instances of known attack patterns, or they can be reused by the
attackers. There is no a priori reason to assume that a known
attack pattern will not be reused; in fact, if something is
successful, human nature suggests trying it again. Active
detection of indicators of past patterns – and publicizing the
ability to do so, although not the details of how it is done – will
not only detect such patterns, but also deter individuals from
trying them again. This is why we still have to take off our shoes
at airports even though there have not been any publicized
accounts of attempted use of shoe bombs in quite a while; if we
did not have to have our shoes inspected, then shoe bombing
might return as it is a proven and low-cost attack method. Further,
there may be many potential adversaries who are capable of
executing only a single type of attack; preventing them from
using that method removes them from the potential population of
adversaries. And finally, detecting known attack patterns prevents
potential increases in the population capable of using and
motivated to use that attack type by avoiding the possibility of
copycat attacks. In the context of section 2.3, this criticism relates
to the choice of patterns to use in the security application. These
patterns are easy to specify precisely because they are known, and
therefore, pattern detection is a useful technique for this security
measure.
3.3 It Won’t Be Perfect
Some systems are criticized because they will not be perfect –
there is no way at an acceptable cost to prevent all potential
attacks. This criticism is often explained in terms of the cost of a
false negative – if even one terrorist attack occurs because it is not
detected, the cost to society would be astronomical. (This
situation is frequently contrasted with the cost of a false negative
in a marketing or fraud detection application, in which case, the
right thing to do is minimize the cost across a large number of
cases, in contrast to security applications where the goal is to
prevent all false negatives.) What this criticism ignores is the fact
that no system is or can ever be perfect; rather, the goal is to
maximize effectiveness at a fixed or minimal cost (in terms of
effort to develop and use the system, in terms of disruptions to
normal functions, and in terms of the impact on privacy and civil
liberties). Comparing alternative resource allocations to maximize
effectiveness is the subject of [6]. The right question to ask is not
“Is this system perfect?” but rather “How does this system
increase our overall security in the context of all our other
systems?” An effective security application will be part of a
layered defense that uses a multitude of techniques with
uncorrelated errors; such a design will be most effective at
providing maximum security for a fixed resource allocation.
3.4 It Will Just Make Them Try Something
Else
Many bad guys are intelligent adversaries. They can be very
creative in creating attacks of different types. This criticism
typically suggests that there is no point in preventing one type of
attack because another equally costly attack can easily be devised
and substituted. However, this criticism ignores the fact that not
every bad guy is capable of creating a new attack method.
Preventing known attacks forces adversaries to spend time
developing new attack methods, acquiring new capabilities and
resources, and training new attackers. This prevention of known
attack types, therefore, has a real cost to adversaries. And not
doing so would have a huge cost in morale to those being attacked
repeatedly by the same methods with no effective response. One
technique for increasing security in the face of potential new
attack types includes red-teaming potential new attack types and
incorporating such patterns in the security application. A second,
related technique is to use patterns corresponding to variants of
known attack types, based on the assumption that variants of
previous attacks are likely to be tried by an intelligent adversary
because they involve minimal change. A third technique to detect
new attack types is to decompose the known attacks into required
constituent activities, and then create new patterns based on novel
recombinations of these lower-level activities.
3.5 It Will Make Them Try Something More
Complicated and Serious
This criticism suggests that prevention of low-consequence
attacks will result in more devastating attacks as adversaries
creatively invent new methods. This is a variant of the previously
discussed criticism – not only will prevention of some attack
types cause other attack types to be used, but the new attack types
will be more serious than those that have been prevented. What
this criticism ignores is that more serious attacks are typically
more complicated; they require far more planning, capabilities,
training, and resources that less serious simpler attacks. This
additional complexity typically involves a longer time to plan and
prepare for the attack, the involvement of more people in the plan,
and, perhaps most important, more interactions with nonconspirators. All of these factors make it easier to detect the more
complicated and serious attack before it is executed – only one
starting point is needed and many more are available. Not only is
there a cost of something new, but there is an additional cost of
something more complicated.
3.6 It Will Make Them Try Something New
That You Haven’t Thought Of
A further criticism is that effective detection of known attack
patterns ignores detection of new attack patterns that have not yet
been conceptualized by those responsible for security
applications. This criticism is countered by several observations:
(1) that even novel attack patterns involve low level activities that
arouse suspicion (think of the flight training prior to 9/11), (2)
that starting from known subjects can lead to other bad guys (this
is the essence of link analysis), and (3) that novel attacks are
difficult and expensive to devise. It is this last observation that is
key – by detecting previously used attack patterns, those
responsible for security are forcing the bad guys to adapt
constantly. Every attack they try is new, and is being tried for the
first time. This greatly increases the probability that an attack will
not be successful – who gets everything right on the first try?
Forcing their adversaries to invent more complicated and novel
attacks makes their tasks as difficult as possible. It also forces
adversaries to test components of a new attack, which converts
these component activities from novel actions to repeated ones
and makes them amenable to detection techniques that rely on the
use of automated pattern discovery to detect repeated sequences
of related activities.
3.7 There’s Not Enough Training Data
Often, data mining applications for security are compared to
applications in credit card fraud detection. The discussion
typically has an advocate for data mining applications who cites
the high effectiveness in real time of scoring credit card
transactions and a critic who points out that that there are a
multitude of examples from which a system can learn the
indicators or patterns of fraud in credit cards compared to few
examples of terrorist attacks. 5 Jonas and Harper [5] make this
argument quite effectively, pointing out that there are “a relatively
small number of attempts every year and only one or two major
terrorist incidents every few years – each one distinct in terms of
planning and execution – that there are no meaningful patterns
5
This is similar to a scene from “Fiddler on the Roof.” In the
scene, Tevye hears an argument between his neighbors Perchik
and Mordcha, and after hearing each of their positions, says
“you are right” and “you are also right.” Another character,
Avram, says, “He’s right and he’s right? They can’t both be
right.” Tevye replies, “You know, you are also right.” As in this
scene, who is really right in this situation?
that show what behavior indicates planning or preparation for
terrorism.”
There are certainly and fortunately a small number of examples of
successful terrorist attacks and known disrupted attacks and,
presumably, a larger but not extremely large number of unknown
disrupted attacks, but nowhere near the amount needed for
statistically valid pattern discovery. However, as in many types of
fraud detection applications, the components of such attacks are
similar. They all involve financing, acquisition of material,
recruitment of participants, communication between the
participants, etc. While these activities occur frequently and
predominantly for legitimate reasons, when combined in
particular contexts, they can potentially provide enough cause for
further information collection and analysis, enabling the type of
link analysis that Jonas and Harper advocate. Improvements in
data mining algorithms that would enable the learning of usefully
discriminative patterns from minimal training data is a challenge
for the research community; while pattern-based data mining may
be inadequate at present and even for the foreseeable future, new
techniques may prove to be useful at some point in the future.
Before they would be deployed or even considered for inclusion
in a security application, such techniques would have to be
subject to a rigorous cost-benefit analysis, including
considerations of data use and privacy implications. In all
likelihood, such techniques would be useful only in combination
with link-analysis techniques, referred to as “subject-based data
analysis” and contrasted with “pattern-based data analysis” by
Jonas and Harper. As the first step in a mass screening system,
such predictive data mining is unlikely to be useful for the reasons
pointed out by Jonas and Harper. However, despite their use of
the term “predictive data mining” to describe what would be
ineffective, they really are arguing against only a particular
design choice rather than against the entire set of data mining
techniques described in section 2 of this paper.
3.8 They Can Reverse Engineer the System
The Carnival Booth algorithm has been proposed as a way that
bad guys can reverse engineer a security system. [1] This is a
serious criticism, and it deserves a serious and thorough analysis.
The Carnival Booth algorithm is developed and analyzed in the
context of the CAPPS system. The conclusion is that selecting
individuals for increased scrutiny can actually decrease security
because the individuals can probe the selection algorithm to
determine who is likely to be selected and who is not. The
analysis assumes a fixed percentage of people who can be subject
to secondary screening at airports due to a fixed amount of
screening resources and the need to keep passengers flowing
through the system at a reasonable rate. Using the Carnival Booth
algorithm, a terrorist group can determine who is not likely to be
selected for increased scrutiny and then use that person to execute
an attack. It suggests that this algorithm would be most effective
when there is a diverse population of potential attackers and
presents anecdotal information that this is indeed the case.
Essentially, the Carnival Booth algorithm works if a terrorist can
determine that his chance of being selected for secondary
screening is less than average; for example (using the numbers
from [1]) if 8 percent of passengers are subject to secondary
screening and 2 percent are selected randomly, then a terrorist has
to reduce his chance of being selected for enhanced screening to
less than 6 percent. While an individual potential attacker can not
change his chance of being selected under this model, a terrorist
group leader could use a population of potential attackers and
select those who do not get selected on a large number of probing
flights. A potential attacker is not reducing his actual chance of
being selected; rather, he is decreasing the uncertainty in his
estimate of his chance of being selected by repeatedly probing the
system. Once some potential attacker is determined to have a
lower than average chance of being selected for increased
scrutiny, he is given the mission to execute an attack.
What are the flaws in this strategy? For one, it requires multiple
recruits rather than a single recruit for each position on the attack
team. While there may be one recruit whose profile causes him to
be less likely than average to be selected for increased scrutiny, it
is unlikely that, on average, the recruits will be less likely than
average to meet the selection criteria. In fact, one can make an
argument that people who are subject to terrorist recruitment are
actually, on the average, more likely to be subject to increased
scrutiny, especially if those designing the selection criteria have
insights into what makes someone susceptible to terrorist
recruitment. The Carnival Booth algorithm also assumes that even
if a recruit is selected for increased scrutiny, he will not be
arrested when he is not actually on an attack mission, because he
will not be in possession of any suspicious materiel. This
assumption ignores the fact that the recruit knows he is on a
probing mission for the terrorist group and may behave in a way
that arouses increased suspicion. Even if he is allowed to fly, his
behavior may result in his being the subject of additional
information collection – i.e., the starting point for a link analysis.
The Carnival Booth algorithm, therefore, shares some
characteristics with the classic gambler’s strategy of doubling
every losing bet – without an infinite amount of resources, it will
eventually fail. The fact that there must be a reasonably large
number of recruits for the Carnival Booth algorithm to result in
one recruit with a lower than average selection probability creates
additional risk of mistakes or exposure to the terrorist group. And
the probing activity of the recruits could result in adaptation of
the profiles used for selection, which would defeat the supposed
advantage of the Carnival Booth algorithm, especially if this
adaptation occurred as frequently as the probes.
3.9 You Can’t Catch the Lone Wolf
This criticism is the most serious of all that have been proposed.
A capable individual acting alone, who devises and executes a
serious new attack scheme, will likely be able to evade detection.
Reducing the possibility that this scenario can occur would seem
to be a critical aspect of providing increased security. Tighter
controls on dangerous materials, separating components needed to
create weapons, and reducing the motivations of people to engage
in terrorist activities would seem to be the most effective
strategies for this difficult problem. In a sense, this criticism says
that a security application won’t be able to detect someone who
manages to avoid all of its data sources and analytical techniques.
Such an interpretation is obviously true but not particularly
insightful.
4. METRICS
Any paper with the subject of the efficacy of data mining
applications for any purpose whatsoever is incomplete without at
least a brief discussion of metrics. This paper is no different. We
briefly present a number of metrics that are either implicit or
explicit in the evaluation of data mining applications for security.
Ultimately, it is not analyses of the type discussed herein, but
rather rigorous metrics-based experiments that will establish the
efficacy of alternative designs and techniques for security
applications.
Because of the high cost of a terrorist attack, the typical metric for
a security application is the number (or probability) of a false
negative; i.e., failure to prevent an attack. This is typically traded
off against the probability of a false positive. Because of the
vastly unequal costs of false negatives (extremely high) compared
to false positives and the vastly different numbers of true
positives (extremely low) compared to true negatives in the
population, the likelihood of misclassifications must be weighted
by the costs and frequencies to determine the overall costs of a
security application. The benefits of the security application are
expressed in terms of threats averted.
Because the benefits of a security application depend on the
assumed distribution of threats in a population, it is desirable to
have a metric that illustrates its effectiveness independent of this
distribution. A metric with this property is discussed in [13].
Other metrics that may be used to evaluate alternative system
designs include the distribution of costs (e.g., is it better or worse
to inconvenience one person a lot or many people a little?),
minimizing the worst-case outcome (i..e, maximizing the
likelihood of preventing the most serious threats even if this
means increasing the chances that more instances of less serious
threats will occur).
Another class of metrics relates to the various criticisms. How
quickly can new patterns be discovered, validated, and deployed?
What is the value in preventing previous attack patterns compared
to detecting new ones? How can security forces cause maximum
disruption to attackers while minimizing costs to those whom they
are protecting?
5. CONCLUSIONS
What can we conclude? Is data mining useful for security or not?
What aspects of data mining are likely to be useful and what
aspects are likely to be ineffective? What criticisms are valid
because of the requirements of security applications, and what
criticisms really just point out ineffective designs? Where might
additional research yield useful new techniques, and where is it
unlikely to do so?
In some areas, the jury is still out. Data mining algorithms have
not yet resulted in the ability to discover patterns that can predict
terrorism or other security threats that manifest themselves rarely
and as a complex set of related events. They have been effective
at discovering patterns that can detect common events that occur
more frequently, such as cellular telephone or credit card fraud.
A challenge for the research community is to design algorithms
that can extend the range of feasible applications. Even as this
range is extended, it is extremely unlikely that completely
automated pattern discovery will be useful by itself for the
detection of terrorist events.
However, automated pattern
discovery tools may be able to aid in the discovery of patterns of
activity that are components of such threats and that can be
incorporated into security applications.
These security
applications would have to include other techniques as well in
order to be useful for their specific purposes. So while data
mining will not be an entire solution, it can be a useful component
of such a solution.
The hardest threat to detect is the threat of a capable, intelligent
adaptive adversary acting alone. Therefore, the most effective
strategy is one that makes this threat increasingly unlikely. The
other threats, of less capable adversaries, non-adaptive
adversaries, and less-intelligent adversaries, can be effectively
countered by appropriately designed and deployed data mining
applications as a key part of a multi-layered prevention and
detection system. Data mining can be one technique for pattern
discovery, but it is only a part of the design and deployment of an
effective security application. And other techniques such as
starting from known subjects and performing link analyses as well
as detection of dangerous materials and discouraging terrorist
recruitment are at least as important. While patterns may be
useful to guide a search, following connections from known risky
subjects matters more.
Finally, it is important once again to note that effective security
applications are complex systems that must have a clearly defined
purpose, clearly specified authorities and authorizations,
appropriate, available and useful data, and clear and manageable
business procedures in addition to effective technologies if they
are to succeed. And they must respect all aspects of privacy, civil
liberties, and other considerations regarding the use and retention
of data for specific purposes. Even with all these constraints, it is
possible to design security applications that can be useful and to
continue research into how to do so.
6. ACKNOWLEDGMENTS
I thank the many colleagues with whom I have had the
opportunity to discuss and refine many of the ideas in this paper
over the years. In particular, I thank Henry Goldberg for helping
to develop many of the ideas discussed in the paper and David
Jensen for helping to develop the model used in figure 2 as well
as for much useful discussion and feedback. Responsibility for
the ideas in the paper is, of course, solely that of the author.
7. REFERENCES
[1] Chakrabarti, S. and Strauss, A. "Carnival Booth: An
Algorithm for Defeating the Computer-Aided Passenger
Screening System," First Monday Vol 7., No. 10, 7 October
2002.
http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/art
icle/view/992/913
[2] Executive Committee on ACM Special Interest Group on
Knowledge Discovery and Data Mining. “Data Mining” Is
NOT Against Civil Liberties. June 30, 2003 (revised July 28,
2003). http://www.sigkdd.org/civil-liberties.pdf
[3] Fayyad, U.M., Piatesky-Shapiro, G., and Smyth, P. From
Data Mining to Knowledge Discovery: An Overview. In
Advances in Knowledge Discovery and Data Mining, eds. U.
Fayyad, G. Piatesky-Shapiro, P. Smyth and R. Uthurusamy,
1-30. Menlo Park, CA: AAAI Press 1996.
[4] Jensen, D., Rattigan, M., and Blau, H. Information
Awareness: A Prospective Technical Assessment. In
Proceedings of the Ninth ACM SIGKDD International
Conference o Knowledge Discovery and Data Mining (KDD2003). (Washington, DC, USA, August 24-27, 2003). ACM
Press, New York, NY, 2003, 378-387.
[5] Jonas, J. and Harper, J. Effective Counterterrorism and the
Limited Role of Predictive Data Mining, Policy Analysis No.
584. Cato Institute (December 11, 2006).
http://www.cato.org/pubs/pas/pa584.pdf
[6] McLay, L.A., Jacobson, S.H, and Kobza, J.E. Making Skies
Safer: Applying Operations Research to Aviation Passenger
Prescreening Systems. OR/MS Today. October 2005.
[7] Office of the Director of National Intelligence. Data Mining
Report. 15 February 2008.
http://www.fbiic.gov/public/2008/feb/ODNI_Data_Mining_
Report.pdf
[8] Office of the Director of National Intelligence. Data Mining
Report. January 31, 2009.
http://www.dni.gov/electronic_reading_room/ODNI_Data_
Mining_Report_09.pdf
[9] Paulos, J., Do the Math: Rooting Out Terrorists is Tricky
Business. Los Angeles Times, January 23, 2003.
[10] Schneier, B. and Hawley, K. Interview with Kip Hawley
(July 30, 2007). http://www.schneier.com/interviewhawley.html
[11] Scientific American (editorial). Total Information Overload.
Scientific American, March 2003, 12.
[12] Seifert, J.W., Data Mining and Homeland Security: An
Overview. Congressional Research Service (Order Code
RL31798) Updated January 27, 2006.
http://www.au.af.mil/au/awc/awcgate/crs/rl31798.pdf
[13] Senator, T.E. Ongoing Management and Application of
Discovered Knowledge in a Large Regulatory Organization:
A Case Study of the Use and Impact of NASD Regulation’s
Advanced Detection System (ADS). In KDD-2000: 44-53.
[14] Senator, T.E. Multi-Stage Classification. In Proceedings of
the Fifth IEEE International Conference on Data Mining
(ICDM ’05). (Houston, TX, November 27-30, 2005).
[15] Taipale, K. Data Mining and Domestic Security: Connecting
the Dots to Make Sense of Data, Columbia Science and
Technology Law Review. Vol. 5. No. 2 (Dec 2003).
Available at SSRN: http://ssrn.com/abstract=546782
[16] Vijayan, J. House Committee Chair Wants Info on Cancelled
DHS Data-Mining Programs: Millions Have Been Spent on
Work That Was Eventually Abandoned. ComputerWorld,
September 18, 2007.
http://www.computerworld.com/action/article.do?command=
viewArticleBasic&articleId=9037319
A Study of Online Service and Information Exposure of
Public Companies
S. H. Kwok
Anthony C.T. Lai
Jason C.K. Yeung
Department of ISOM
School of Business Administration
HKUST
[email protected]
Department of ISOM
School of Business Administration
HKUST
Handshake Networking
Hong Kong
[email protected]
[email protected]
ABSTRACT
It is believed that public companies should have put lots of efforts
and resources in designing and implementing effective security
policy in their daily information processing and management
against potential cyber attacks. A company web server accessible
by the general public and attackers is usually a common entry
point for cyber attacks. This paper studies and reports the security
problems in web servers of public companies. We applied several
commonly used tools and systems to collect information from
publicly accessible web servers of selected public companies, and
studied some known security aspects in those public companies.
Our findings will provide an insight to the effectiveness of web
servers in public companies against cyber attacks. This paper also
proposes a risk analysis tool for cyber attacks, which is known as
pyramid risk analysis tool.
and problems in web servers of public companies. Findings in our
paper basically address the security problems of (1) Common and
Necessary Open Ports, (2) Allowed Open DNS, (3) Allowed Open
Mail Relay, (4) Web Server Banner and Version Exposure, (5)
SPAM Mail Black List, (6) SPF Support, (7) Sensitive
Administrative Console and Server Information Exposure. (8)
Opened Vulnerable Network Service Ports, and (9) Online
Internal-use-only Services. Descriptions of the above security
problems are presented in Table 1.
Surveyed Areas
Descriptions and Problems
(1) Common and
Necessary Open
Ports
(3) Allowed Open
Mail Relay
Disclose internal server information and
layout. In the current research, our survey
only included TCP port scanning. More
information can be obtained in
http://en.wikipedia.org/wiki/Domain_name
_system.
Allow external party to query internal
mapping between IP addresses and hosts of
servers.
Allow sending non-legitimate emails from
unauthorized parties.
(4) Web Server
Banner and
Version Exposure
The information could be useful for
malicious hackers and attackers to carry out
further vulnerability exploitation.
Cyber security, public company, ports, web server, public server,
malicious hackers.
(5) SPAM Mail
Black List
1. INTRODUCTION
(6) SPF Support
(Sender Policy
Framework)
Disrupt business operations if mails are
blocked once its domains are put in SPAM
mail black lists.
Restrict other people sending emails with
one's domain with using SPF. In other
words, it prevents sender’s address forgery.
More information can be obtained in
http://www.openspf.org/Introduction.
Categories and Subject Descriptors
K.6.5 [Security and Protection]: Unauthorized access (e.g.,
hacking, phreaking).
General Terms
Cyber security, public company, ports, web server, public server.
Keywords
It comes as hackers, criminals and spies have increased attacks on
information systems and databases in public companies that
contain sensitive information. Targets in public companies
include financial systems, operations systems, management
systems, and so on. An easy and common target is public web
server of public companies because most of the public companies
enable people and users to access their products and services via
the Internet.
This paper addresses a research question of the effectiveness of
the cyber security protections in public companies. An extended
investigation will include an exploration of network diagram of
the public company that includes its sub-networks, domain
addresses, technology and systems being used in the companies.
This requires access to internal network and confidential
information. This paper is primarily focused on the security issues
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CSI-KDD'09, June 28, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-669-4...$5.00.
(2) Allowed Open
DNS
(7) Sensitive
Administrative
Console and
Server
Information
exposure
Enable malicious hackers and attackers to
access the internal applications.
(8) Opened
Vulnerable
Network Service
Ports
Apart from necessary open ports for web
services, there are other common top
vulnerable ports which are easy targeted by
malicious hackers and attackers.
(9) Online
Internal-use-only
Services
Other online portals/entries accessible by
the public may increase threats from
hackers and attackers.
Table 1: Security Check Areas.
In this paper, we randomly selected ten public companies from
Hong Kong Hang Seng Index (HSI) and ten public companies
from Hong Kong China Enterprises Index (CEI). We applied
several commonly used approaches for data collections. The
techniques include DNSstuff (http://www.dnsstuff.com), Google
(http://johnny.ihackstuff.com/ghdb/),
Maltego
(http://www.paterva.com), and Nmap (http://nmap.org/). They are
commonly used in penetration test in IS auditing disciplines and
practices.
With those techniques, we also investigated several common
attacks, including (1) web site defacement, (2) phishing attacks,
(3) unauthorized access to internal system information and
database, and (4) denial of service and vulnerability exploitations
if the systems are not patched immediately (these vulnerabilities
could result in negative impacts to the business operations once
they are exploited and manipulated).
2. Web Sites and Data Collection Tools
2.1 Selected Web Sites
We have selected ten HSI and ten CEI public listed companies,
Those HSI companies comprise of industries in finance, banking,
property, and utility. Meanwhile, those CEI companies are
composed of industries in transportation facility, oil, railway,
manufacturing, telecommunications, and banking and finance [1].
Our surveyed information could be obtained from Google and
various network service provider companies, in addition, protect
the surveyed companies, the figures and comment will not
contribute to identify any company.
Restrict search
engine with
robots.txt
Directory
traversal
Table 2: Selected Google checking criteria for our study.
3. Results and Data Analysis
3.1 Risk Rating Definitions
Based on industry experiences of our authors, we have established
a risk-rating table shown in Table 3 so as to measure the risk
exposure level of various services. (Similar research works may
be referred to the annual Data Breach Report by Verizon Business
(http://www.verizonbusiness.com/products/security/risk/databreac
h).
Risk Rating Legend
The risk item will be marked in the following color to
represent its risk rating
Risk Rating
Description
Critical
The vulnerability could be used to
compromise the application or
infrastructure resulting in a severe
business impact.
High
The data collection tools used in this paper include DNSstuff,
Google [2], Maltego and Nmap.
Common
Online
Services
Possible Impacts
Remote Desktop
(Microsoft)
Allow malicious hackers to take over the
machine by remotely connecting to the
server or workstation.
OWA (Microsoft
Outlook Web
Access)
Allow malicious hackers to get the email
access
Citrix
Metaframe
Login
Allow malicious hackers to access internal
networks
Lotus Notes
Allow malicious hackers to get the email
access
System/source
code/database
data back up
files, server
information and
configuration
files
Various
administrative
consoles
Expose system, internal and business
information
Allow malicious hackers to gather
information about folders that may contain
sensitive information through the
robots.txt file
Allow malicious hackers to gather
sensitive information about folders and
files of a company
A medium to high level of technical
knowledge is required for an attacker to
gain unauthorized access to a
system/data from a single vulnerability.
The ability for an attacker/user to harm
the professional image of a corporation.
Medium
A vulnerability that, by its self will not
allow unauthorized access to
systems/Data. However, two or more
Medium rated vulnerabilities used in
conjunction may allow an attacker
unauthorized access to systems or data.
Low
Information disclosure. The information
gleaned from these vulnerabilities will not
allow an attacker to gain direct access to
systems or data. It may, however, be
used to escalate a separate vulnerability.
Observation
and
Unknown
Software/System that is considered
vulnerable, but the team was unable to
complete testing.
Table 3: Risk Rating Definitions.
3.2 Data Analysis
Ten HSI Companies
Allow malicious hackers to access internal
applications
We have summarized results of surveyed areas and risk analysis
on HSI companies in Table 4.
Surveyed
Areas
Analysis on HSI
Analysis on CEI
(1) Common
and Necessary
Open Ports
For most of the web
servers, only
necessary ports (TCP
port 80 and 443) are
open, which are
considerably good
practice.
Most of the web
servers, only enable
necessary ports (TCP
port 80 and 443),
which are considerably
good practice.
(2) Allowed
Open DNS
Recursive lookups
are enabled on most
of its DNS servers.
This may cause an
excessive load on the
DNS server but does
not consider as a
serious risk.
Recursive lookups are
enabled on most of its
DNS servers. This may
cause an excessive load
on the DNS server but
does not consider as a
serious risk.
(3) Allowed
Open Mail
Relay
There is no open
mail relay found
from the surveyed
companies which are
good practice to
against SPAM mails
and non-legitimate
mail delivery.
There is no open mail
relay found from the
surveyed companies
which are good
practice to against
SPAM mails and nonlegitimate mail
delivery.
(4) Web
Server Banner
and Version
Exposure
Most of the web
servers, especially
Microsoft IIS, are
configured with
default IIS banner. If
the service banners
contain useful
information such as
the software name
and its version
number then hackers
can better target their
exploits. Scriptkiddies often scan
whole IP blocks for a
known vulnerability,
and only attack those
who give back a
banner showing that
they run the
vulnerable service.
Most of the web
servers, especially
Microsoft IIS, are
configured with default
IIS banner. If the
service banners contain
useful information such
as the software name
and its version number
then hackers can better
target their exploits.
Script-kiddies often
scan whole IP blocks
for a known
vulnerability, and only
attack those who give
back a banner showing
that they run the
vulnerable service.
(5) SPAM
Mail Black
List
None of the surveyed
companies puts in
SPAM Mail Black
Lists.
Eight servers are placed
in some popular SPAM
blacklist. Mails sent
from these servers are
blocked. Service
availability is damaged.
(6) SPF
Support
Two public
companies from
banking industry
have adopted SPF so
as to prevent from
any forgery email
Two public companies
from banking industry
have adopted SPF so as
to prevent from any
forgery email sending.
sending.
(7) Sensitive
Administrative
Console and
Server
Information
Exposure
All scanned web
servers are well
protected against
exposing
administrative
console that implied
it was well hidden
and probably just
available in an
intranet environment.
Most of the scanned
web servers are well
protected against
exposing administrative
console that implied it
was well hidden and
probably just available
in an intranet
environment.
(8) Opened
Vulnerable
Network
Service Ports
There is a company
has shown its FTP
port 21 which may
not be necessary and
available to public to
access.
For most of the web
servers, many
unnecessary ports (TCP
port 21, 445 and 3389)
are opened, which are
not expected to be
accessible from the
Internet. Some of them
are even public services
ports (possibly FTP,
and Windows Remote
Desktop services).
Once the port is
exploited, the hacker or
attacker may have full
access to the system.
(9) Online
Internal-useonly Services
Generally acceptable,
but using insecure
protocol like FTP
(ftp.domain.com) and
self-described
naming scheme
(secure.domain.com)
are not
recommended. An
attacker might be
easily identified all
sensitive important
components by
brute-force attack
with a list of
keywords (such as
secure.domain.com,
vpn.domain.com,
ssl.domain.com...etc)
It is generally
acceptable. But using
insecure protocols like
FTP (ftp.domain.com)
and self-described
naming scheme
(secure.domain.com) is
not recommended. An
attacker might easily
identify all sensitive
important components
by brute-force attack
with a list of keywords
(such as
secure.domain.com,
vpn.domain.com,
ssl.domain.com...etc)
However, there is a
highway company
exposed its content
management system
administrative console
that could be targeted
by malicious hackers to
carry out account
guessing.
Table 4: Summary of risk analysis on the ten HSI and CEI
companies.
In our analysis on the ten HSI companies, the general controls
over DNS and mail server are satisfactory. They will not allow
open mail relay for external parties to send non-legitimate emails
with its mail gateway. In addition, very few of them are put to
SPAM mail black list. Two International banks in Hong Kong
have actually used SPF.
However, 50% of surveyed companies show that they allow open
DNS query so that external parties could make enquiry on their
internal IP address mapping, which could expose internal system
infrastructural information. We have rated these items as Medium
risk.
The security problem of opened ports in those surveyed
companies is satisfactory. Those surveyed companies just enabled
necessary ports and did not disclose further information on their
server services. However, over 80% have exposed their web
server banners and versions, and 40% of them exposed their
internal-use-only services including FTP, webmail and intranet
site with password-protected. However, these services should not
be available and checked by the public. We have rated these items
as Medium risk.
From the Google hacking, the results are satisfactory and no
company disclosed sensitive files and enabled administrative
consoles. Nevertheless, there is a local famous property company
exposed its server information via the phpinfo() page, which
exposes its IP address, internal system path mapping, database
management system in use (whether LDAP and MySQL database
are running or not) and versions of web servers. Even it does not
lead to immediate risk, however, we have rated it high as it
implies its control over change management process and review as
well as server hardening is insufficient. This item rated for that
property company is of High risk.
A company even exposed its content management administrative
console to the public without applying secure channel and limited
to specific console to use. These services should not be searched
by and available to the public. We have rated these items as high
risk. In addition, 50% of web servers disclosed their version
information. This is proven to be vulnerable to existent attack [3]
including Cross-Site Scripting (XSS), which allows malicious
attackers to execute his/her program code through those web
servers.
The Google hacking results are satisfactory and no company
disclosed sensitive files and enabled administrative consoles.
Nevertheless, there is a property company exposing its content
management administrative console. We have rated this item as
Medium risk.
Within these ten CEI companies, the railway and oil refinement
companies with government supports, in general, have stronger
controls and server hardening compared with other privately
owned companies is higher.
4. Pyramid Risk Analysis
Based on the top vulnerable ports published in SANS Top 20 in
2007 [5], we may identify common vulnerable services and ports
targeted for malicious hackers. With the use of risk ratings defined
in Section 2, we can rate those vulnerable services and ports
against our risk ratings, and the results are presented in Table 5.
Services
Category
Web server
services
Ten CEI Companies
We have summarized results of surveyed areas and risk analysis
on CEI companies in Table 4.
Our analysis shows that the general controls over DNS and mail
server in the selected DEI companies are satisfactory. They did
not open mail relay for external parties to send non-legitimate
emails with its mail gateway. In addition, very few of them had
put to SPAM mail black list.
However, 70% of surveyed companies allowed open DNS query
so that external parties could make enquiry on their internal IP
addresses mapping, which could expose internal system
infrastructural information. We have rated these items in risk
items as Amber.
50% of surveyed CEI companies opened unnecessary ports
opened and disclosed further information on server services. In
the worst case, two companies had opened nearly all vulnerable
ports in their web servers. Those vulnerable ports include remote
desktop connection, NetBIOS, Remote Procedure Call (RPC),
Database connection etc. They are considered to be easily targets
by worms and determined hackers to compromise the company
services. We rate this risk area of these two CEI companies as
Critical
Over 80% exposed their web server banners and versions, and
80% of them exposed their internal-use only services including
FTP, webmail and intranet sites with password-protected control.
Risk
Ratings
Services and their opened
ports targeted by
malicious hackers
Low
•
HTTP port(80/tcp,
8000/tcp,
8080/tcp,8888/tcp)
Medium
•
DNS (53/udp) to all
machines which
are not DNS
servers
DNS zone
transfers (53/tcp)
LDAP (389/tcp and
389/udp)
SMTP (25/tcp)
POP(109/tcp and
110/tcp),
IMAP (143/tcp)
Login services-telnet (23/tcp)
FTP (21/tcp)
NetBIOS (139/tcp)
rlogin et al (512/tcp
~ 514/tcp)
Portmap/rpcbind
(111/tcp and
111/udp),
Naming
services
•
•
Medium
Mail
High
•
•
•
•
•
Login Services
High
RPC and NFS
•
•
•
•
NFS (2049/tcp cM
2049/udp) , lockd
(4045/tcp and
4045/udp)
High
NetBIOS in
Windows NT
and XP
X Windows
•
•
High
•
High
•
•
•
•
•
•
•
Miscellaneous
Critical
•
•
•
•
•
•
Database
•
•
•
Critical
Backup
servers
•
•
•
135 (tcp and udp),
137 (udp), 138
(udp), 139 (tcp).
Windows 2000 or
earlier ports plus
445(tcp and udp)
6000/tcp ~
6255/tcp
TFTP (69/udp)
finger (79/tcp)
NNTP (119/tcp)
NTP (123/udp)
LPD (515/tcp)
Syslog (514/udp)
SNMP (161/tcp
and 161/udp,
162/tcp cM
162/udp)
BGP (179/tcp)
SOCKS (1080/tcp
Microsoft SQL via
TCP port 1433 and
UDP port 1434
Oracle via TCP
port 1521
IBM DB2 via ports
523 and 50000 up
IBM Informix via
TCP ports 9088
and 9099
Sybase via TCP
4100 or 2025
MySQL via TCP
port 3306
PostgreSQL via
TCP port 5432
Symantec Veritas
Backup Exec
TCP/10000
TCP/8099,
TCP/6106,
TCP/13701,
TCP/13721 and
TCP/13724 (A
listing of ports
used by Veritas
backup daemons
is available here.)
CA BrightStor
ARCServe Backup
Agent
TCP/6050,
UDP/6051,
TCP/6070,
TCP/6503,
TCP/41523,
UDP/41524
Sun and EMC
Legato Networker
TCP/7937-9936
Table 5: Relationships between services and ports, and defined
risk ratings.
This paper proposes a pyramid risk analysis tool to illustrate the
relationships as shown in Figure 1.
The pyramid contains all vulnerable ports and services addressed
by SANS. Risk ratings are inserted to indicate their risk levels.
The apex of the pyramid refers to the most common and necessary
ports to be accessible by the public, whilst the bottom of the
pyramid refers to the ports and services that provide sensitive and
confidential information and should not be accessible by external
parties.
If one company has opened services and ports, for example
database and backup services, it is considered to be dangerous and
the company will be classified into critical risk level.
If one company has only opened web services, the company is of
low risk.
This pyramid risk tool and model is established according to
authors' industry experience in auditing, risk assessment and
penetration as well as incorporating SANS Top 20 vulnerabilities
[5] into it. The idea behind the pyramid risk tool is to present
various vectors and dimensions between risk levels, number of
available services accessible externally and internally, and the
bottom of pyramid represents the most critical and important
systems/components as it could contain enterprise sensitive
data/information, once it is compromised, the business operation
could be highly impacted and interrupted. For example, when a
company web site being attacked, it does not cost any legal and
company operational impacts, however, attackers could
manipulate the listed opened and vulnerable services from the top
of pyramid to the bottom of it and they could eventually steal
company or customer confidential information.
In addition, from 2009 Data Breach Investigations Report
published by Verizon Business [6], according to 90 confirmed
data breach cases from Verizon Business in 2008, database server
with online data has contained 75% of total number of data
records among other company assets including POS system,
application server, kiosk system, Web server, File server and
Workstation. The percentage of breaches is 30%, it proves that it
is one of major target of malicious hacker to break in.We applied
the pyramid risk tool for evaluating the risk levels of those ten
HSI companies and CEI companies. The results are presented in
Figure 2.
The results show that the risk level for those ten HSI companies is
lower than those ten CEI companies. It is in particular dangerous
for those two CEI companies that opened their database and
backup services ports. They will attract attentions from bots and
worms to carry out future attacks. On the contrary, it is regarded
as safer for those ten HSI companies because their servers enabled
least and necessary services to be accessible by the public.
From 2009 Data Breach Investigations Report published by
Verizon Business[6], remote access and management and web
application are the most popular and easy attack pathway with
contribute 75% of the total number of breach cases. It readily
cause concerns over how an enterprise to protect its online assets.
For those HIS companies, most of them are local with foreign
investors, their awareness and knowledge on web server
hardening and server protection are far better than those in
surveyed CEI companies. Especially, the bank’s online server
security could be rated the most secure one among all of other
companies in other industries because monetary authority in Hong
Kong has imposed strict compliance over technology risk control
[4].
Figure 1: Pyramid risk tool.
To better and improve the security gaps among CEI companies,
relevant security training and awareness should be given to the
operational and administrative staffs. In addition, government
should impose compliance over online security and data privacy
protection of companies similar to those in banking and finance
industry. Furthermore, regular security audits and penetration tests
should be engaged so as to detect potential risks and flaws earlier.
We also suggest that companies should put more efforts on
preparing detailed security policies and standards for daily
operations and management. For more detailed recommendations,
readers should refer to SANS Top 20 [5] and Data Breach
Investigation Report 2009 [6] for in depth control
implementation.
The proposed pyramid risk analysis tool is used to rate the risk
level of company. Our results indicate that the risk level of those
CEI companies is higher than HSI companies because those CEI
companies open high-risk ports and services to the public. Those
high-risk ports and services are dangerous because hackers and
attackers are good at exploiting them and may cause significant
impacts to the company being attacked. The pyramid risk analysis
tool can be used extensively in surveying more public and private
companies in understand the risk level of companies in a
particular graphical location, or a particular industry, or a
particular nation.
6. ACKNOWLEDGMENTS
Our thanks to security researchers from Valkyrie X Research Lab
including C.K. Huen, Tony Miu, Alan Chung, William Cheung
and Jason So, Jason Yeung, Eddie Lau and Leng Lee for their
help in data collections.
Figure 2: Risk levels of the selected ten HSI and ten CEI
companies.
5. Conclusions
In this paper, we collected data from twenty public companies
(ten HSI companies and ten CEI companies) and analyzed their
risk levels.
In general, the web server and other online services of those CEI
companies are rather open and vulnerable to attacks. In addition,
some CEI companies do not own its server but host its services
via third-party service providers. It implies that they do not have
adequate controls over outsourcing vendors. Generally speaking,
the server hardening is weak in those CEI companies as the
banners and server version could be important information for
malicious attackers to carry out further attacks. In addition, those
published and exposed services are with existing exploits, which
have not been upgraded to the latest version, reducing the risk of
attacks.
7. REFERENCES
[1] Hang Seng Indexes http://www.hsi.com.hk/HSI-Net/
[2] Johnny Long, 2008. Google Hacking for Penetration Testers,
Volume 2.
[3]
Vulnerable apache web server versions
http://httpd.apache.org/security/vulnerabilities_13.html
[4] Technology Risk Management imposed by Hong Kong
Monetary Authority
http://www.info.gov.hk/hkma/eng/bank/spma/index.htm
[5] SANS Top 20, 2008 http://www.sans.org/top20/
[6] 2009 Data Breach Investigations Report , May 2009
http://www.verizonbusiness.com/resources/security/reports/2
009_databreach_rp.pdf