Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Homeland Security Data Mining using Social (Dark) Network Analysis ISI 2008, Keynote Address Hsinchun Chen, Ph.D. Director, Artificial Intelligence Lab Director, NSF COPLINK and Dark Web Research Centers University of Arizona Acknowledgements: NSF, LOC, ITIC/KDD, DHS, DOJ © 2005 1 Overview © 2005 2 September 11, 2001, London Subway bombing …Iraqi and Afghan Wars… Spain Madrid bombing, Dutch Hofstad group, Cairo bombing, Toronto plot, German terrorists… (After 2004) All relying on Internet… Leaderless Jihad…Al Qaeda University on the Web…Cyber seduction for terrorist recruiting… Eurabia… ASEAN Regional Forum on fighting terrorism, separatism, and extremism… © 2005 The World is Flat…for good or for worse 3 Social Movement Organizations (SMO) (Political) Activism: Political movement, e.g., Young Democrats, Internet petition, global warming Extremism: Radical ideological movement, e.g., KKK, Skin Head, Militia, animal rights, FLG Terrorism: Violent political movement, e.g., ELF, ALF, Aum, Al Qaeda …They are all using the web… © 2005 4 Terrorism, Terrorist Networks, Terror on the Internet, Leaderless Jihad © 2005 5 • Islamic fanatics in the Global Salafi Jihad (with roots in Egypt) • Based on data about 172 Jihadists…social bonds predated ideological commitment • small-world network…network robustness…geographical distribution…fuzzy boundaries…cliques…the strength of weak bonds…the power of the Internet M. Sageman, former foreign service officer in Islamabad, forensic psychiatrist (2004) © 2005 6 • Drawing on an eight-year study of WWW • Terrorist organizations and their supporters maintain hundreds of web sites • Terrorist organizations exploit the Internet to raise funds, recruit members, plan and launch attacks, and publicize their chilling their results. • New terrorism, new media…The war over minds…cyberterrorism • Balancing security and civil liberties © 2005 G. Weimann, communication and mass media study, U. of Haifa (2006) 7 • “The process of radicalization in a hostile habitat but linked through the Internet leads to a disconnected global network, the Leaderless Jihad.” • From anecdote to data and from journalism to social sciences • Going beyond incident databases; detailed evidence-based terrorist (500+) database • Before 2004, face-to-face interactions, 26-year old • After 2004, interaction on the Internet: Madrid, Dutch Hifsatd, Cairo, Toronto…Irhabi007 and Muntada, 20-year old © 2005 M. Sageman (2007) 8 Intelligence and Security Informatics for International Security: Information Sharing and Data Mining © 2005 9 ISI: Overview Intelligence and Security Informatics (ISI) • development of advanced information technologies, systems, algorithms, and databases for international, national and homeland security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a) © 2005 10 10 The World After 9/11, 2001 • • • • • • • • • © 2005 WTC, Pentagon attacks Afghanistan, Iraqi wars Bali, Madrid, London bombing Sunni, Shia, sectarian wars Jihad, E-Jihad, infectious ideas Worms, viruses Infectious diseases, bioagents, WMDs International, regional, cultural, religious conflicts, … Traditional crimes, cyber crimes, narcotics, gangs (MS 13), smuggling, domestic extremists (Oklahoma bombing), cyber security, … 11 11 Related ISI Fields • Network security (Corporate/DOD -- intrusion detection) • System and information security (Corporate/DOD -firewalls, viruses, worms, hacking) • Cyber security (NSF/DOD -- network security, cyber crime) • Forensics, computer forensics (FBI/Police – fingerprint, DNA, IP addresses, voiceprint, writeprint) • Crime analysis (Police/FBI -- information sharing, data mining) • Intelligence analysis (CIA/NSA – surveillance, intelligence collection, multilingual data mining) • Terrorism study (White House -- policy, incident analysis) • Defense information warfare (DOD -- propaganda, counterintelligence, psychological warfare) © 2005 12 12 Crime and Security Concerns © 2005 Crime types and security concerns 13 13 A knowledge discovery research framework for ISI © 2005 A knowledge discovery research 14 framework for ISI 14 ISI Research: KDD Techniques • • • • • • © 2005 Information Sharing and Collaboration Crime Association Mining Crime Classification and Clustering Intelligence Text Mining Crime Spatial and Temporal Mining Criminal Network Analysis 15 15 National Security Critical Mission Areas and AI Lab Projects • • • • • • © 2005 Intelligence and Warning: Dark Web Border and Transportation Security: BorderSafe Domestic Counter-terrorism: COPLINK, Dark Web Protecting Critical Infrastructure and Key Assets Defending Against Catastrophic Terrorism: Dark Web, BioPortal Emergency Preparedness and Responses 16 16 • Intelligence and Security Informatics (ISI): Development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a) • Data, text, and web mining • From COPLINK to Dark Web H. Chen, computer scientist, artificial intelligence, U. of Arizona (2006) © 2005 17 The ISI Communities • IEEE Intelligence and Security Informatics (ISI) Conference, 2003 (Tucson), 2004 (Tucson), 2005 (Atlanta), 2006 (San Diego), 2007 (New Brunswick), 2008 (Taiwan) • Pacific-Asia ISI Workshop (PAISI): 2006 (Singapore), 2007 (Chengdu, China), 2008 (Taiwan) EU-ISI Workshop: 2008, Denmark • © 2005 18 COPLINK • • • • • • • 1996-, DOJ, NIJ, NSF, ITIC, DHS Connect Detect Agent STV (Spatio-Temporal Visualization) CAN (Criminal Activity Network) BorderSafe (Mutual Information) • AI Lab Knowledge Computing Corporation • Tucson, Phoenix AZ 1600 agencies in US © 2005 19 •The New York Times November 2, 2002 •ABC News April 15, 2003 •Newsweek Magazine March3, 2003 © 2005 20 New York Times, Nov 2, 2002 Human, field intelligence Artificial intelligence © 2005 21 Dark Web • 2002-, ITIC, NSF, LOC • Discussions: FBI, DOD/Dept of Army, NSA, DHS • Connection: – Web site spidering – Forum spidering – Video spidering • Analysis and Visualization: – Link and content analysis (web sites) – Web metrics analysis (web sites sophistication) – Authorship analysis (forums; CyberGate) – Sentiment analysis (forums; CyberGate) – Video coding and analysis (videos; MCT) © 2005 22 The Dark Web project in the Press Project Seeks to Track Terror Web Posts, 11/11/2007 Researchers say tool could trace online posts to terrorists, 11/11/2007 Mathematicians Work to Help Track Terrorist Activity, 9/14/2007 Team from the University of Arizona identifies and tracks terrorists on the Web, 9/10/2007 © 2005 23 黑網緝恐嫌 陳炘鈞擔綱 研製網路偵測軟體 自動追蹤恐怖頭子 亞利桑納大學人工智慧實驗室主任 華裔科學家陳炘鈞 (Dr. Hsinchun Chen) © 2005 24 24 (Social/Criminal) Network Analysis © 2005 25 Existing Network Analysis Tools • First generation — manual approach – Anacapa Chart (Harper & Harris, 1975) • Second generation — graphics-based approach – Analyst’s Notebook, Netmap, Watson – COPLINK hyperbolic tree view, network view • Third generation — structural analysis approach © 2005 26 26 Anacapa Chart (1st generation) Association Matrix Link chart © 2005 27 27 Analyst’s Notebook, Netmap, Watson (2nd generation) Analyst’s Notebook. Network nodes are automatically arranged for easy interpretation. Source: i2, Inc. Netmap. Different colors are used to represent different entity types. Source: Netmap Analytics, LLC. Watson. Relations among a group of people (the central sphere) based on telephone records. Source: Xanalysis, Ltd. © 2005 28 28 A 9/11 Terrorist Network © 2005 29 29 Analyst’s Notebook & Starlight • Analyst’s Notebook, by i2: A 2D graph and timeline layout tool for crime and intelligence analysis • Startlight, by Pacific Northwest Lab (PNL): A 3D network visualization and navigation tool for intelligence analysis © 2005 30 Analyst’s Notebook, i2 © 2005 Starlight, PNL 31 SNA • Social Network Analysis (SNA) has been widely used to study real-world networks including dark networks (Kaza, 2005; Koschade, 2006, Xu & Chen, 2005). • These include qualitative studies that study the facilitators of link formation and quantitative studies that use statistical methods to measure existing networks. © 2005 32 32 Characterizing Topological Properties • L: Average path length – The average of all-pair shortest path lengths L=2 L=1 • C: Clustering coefficient – The tendency to form clusters and groups Ci 2mi ki (ki 1) • p(k): Degree distribution – The probability that a randomly selected node has exactly k links © 2005 Ci= 1.0 p(k) Ci= 3/6 = 0.5 p(k) = ? 33 k 33 Network Topology Models • Random model (Erdős & Rényi, 1959) • Small-world model (Watts & Strogatz, 1998) • Scale-free model (Barabási & Albert, 1999) © 2005 34 34 Random Network • The probability that two arbitrary nodes are connected is a fixed number, p • As a result, all nodes have roughly the same number of links (characterizing degree = average degree <k>) • Random networks are characterized by – Short distance – Low clustering coefficient – Poisson degree distribution (bell-shaped) © 2005 35 35 Small-World Network • Average path length: – Lsw ~ Lrandom • Clustering coefficient: – Csw >> Crandom • Degree distribution – Similar to that of random networks • Applications – The “19 degrees of separation” on the Web (Albert, Jeong, & Barabási, 1999) – The small-world properties of metabolic networks in cell implies that cell functions are modulized and localized © 2005 36 36 Scale-Free Network • “Scale free” means there is no single characterizing degree in the network – Growth • Instead of having a fix number of nodes, the network can grow and include new nodes – Preferential attachment • p~ki/ki • A node that already has many links is more capable of attracting links from new nodes—”Rich get richer” • • The degree of scale-free networks follows the power-law distribution with a flat tail for large k, p(k) ~ k- The ubiquity of SF networks leads to a conjecture that complex systems are governed by the same self-organizing principle © 2005 37 37 Other Topological Properties • • • – • The number of actual links divided by the possible number of links in the network Assortativity – • The Pearson correlation between the degrees of two adjacent nodes Global efficiency – © 2005 Number of nodes (n), number of links (m) Average degree— k 2nm Density— d n(2nm 1) The average of the inverses of the shortest path lengths over all pairs of nodes 38 38 Robustness of SF Networks • Many complex systems display a surprising degree of robustness against errors, e.g., – Organisms grow, persist, and reproduce despite drastic changes in environment – Although local area networks often fail, they seldom bring the whole Internet down • In addition to redundant rewiring, what else can play a role in the robustness of networks? Is it because of the topology (structure)? © 2005 39 39 Robustness Testing • How will the connectivity of a network be affected if some nodes are removed from the network? • How will random node removal (failure) and targeted node removal (attack targeting hubs) affect – S: the fraction of nodes in the giant component – L: the average path length of the giant component © 2005 40 40 Robustness Testing (Cont’d) • SF networks are more robust against failures than random networks due to its skewed degree distribution • SF networks are more vulnerable to attacks than random networks, again, due to its skewed degree distribution • The power-law degree distribution becomes the Achilles’ Heel of SF networks Failure Attack Adapted from (Albert, Jeong, & Barabasi, 2000) © 2005 41 41 Dynamic SNA Methods • Previous studies focused on static network structures rather than dynamic processes due to: – lack of reliable data recovery techniques (Kossinets & Watts, 2006; Moody et al., 2005) – few appropriate network measures (Kossinets & Watts, 2006; Wasserman & Faust, 1994) – little application of statistical methods for evolving networks © 2005 42 42 Network Measurement • Most empirical studies on longitudinal data plot descriptive measures over time. • Three main types of measures are used in dynamic SNA – deterministic measures – probabilistic measures – temporal measures © 2005 43 43 Criminal Networks: Structured Information, Police Reports, Criminal Associations © 2005 44 COPLINK Connect Consolidating & Sharing Information promotes problem solving and collaboration Records Management Systems (RMS) Gang Database Mugshots Database © 2005 45 45 COPLINK Detect Consolidated information enables targeted problem solving via powerful investigative criminal association analysis © 2005 46 46 COPLINK Detect 2.0/2.5 © 2005 47 47 Association Retrieval and Visualization © 2005 48 48 System Architecture Structural Analysis Criminal -justice Data Network Partition Hierarchical Clustering Network Creation Network Visualization Concept Space Centrality Measures Networked Data Blockmodeling © 2005 MDS 49 Network Partition—Hierarchical Clustering • Major algorithm selection criterion—time complexity • RNN-based CLINK algorithm (Murtagh 1984) – O(n2) time – O(n2) space • Algorithm modification – Observation: the original network may not be a connected graph but consists of several disjoint sub-networks, between which no link exists – Output contains multiple hierarchies © 2005 50 SNA and Network Visualization • SNA – Central member identification • Degree – Counting direct links a node has • Betweenness – Using Dijkstra’s Shortest-path algorithm (1959) • Closeness – Using results from betweenness calculation – Blockmodeling • Network visualization—MDS – Calculating the location (x-y coordinates) of each node based on distance measure (Torgerson’s algorithm) © 2005 51 System Interface Nodes represent individual criminals labeled by their names Links represent relationships between criminals Adjust the slider to perform clustering and blockmodeling © 2005 52 System Interface The reduced star structure found using blockmodeling • Circles represent groups. • The size of a circle is proportional to the number of group members. • Each group is labeled by its leader’s name. © 2005 53 System Interface The rankings of each group member in terms of centrality measures The first one of each column is the leader, gatekeeper, and outlier, respectively The inner structure of a selected group Adjust the slider to do further blockmodeling © 2005 54 The 744-Member Narcotics Network The “Meth World”-Red nodes represent criminals who had been involved in methrelated crimes since 1995 © 2005 55 Subgroup Detection • Subgroups detected have different characteristics: The subgroups found are consistent with the groups’ specializations or responsibilities in a network White gang members who were involved in assaults and murders © 2005 White gang members who were involved in crack cocaine Drug dealers Offenders who were responsible for stealing, counterfeiting, and cashing checks and providing money to other groups to carry out drug transactions 56 Central Member Identification • A member who scores the highest in degree can be a group leader A group leader identified by the system This person has a lot of money and plays important roles in drug transactions © 2005 57 Interaction Pattern Identification • Frequency of interaction (represented by thickness of lines) between subgroups can indicate the strength of between-group relationship Frequent interactions between the two groups (their leaders were good friends) © 2005 58 Extraction of Overall Network Structure A chain structure found in a 60-member network using blockmodel analysis © 2005 59 Usefulness • Saving investigation time • Saving training time for new investigators • Suggesting investigative leads that might otherwise be overlooked • Helping prove guilt of criminals in court © 2005 60 Temporal Network Analysis • Research objectives – Applying various measures to capture and predict changes in criminal networks over time • Unit of analysis – Individual level • Centrality measures: Who will be the next key members? – Group level • Density: What does the change in density imply about group membership (recruitment and turnover)? • Cohesion: Do groups become more cohesive or less cohesive over time? Who does this change imply about the operation of the criminal groups? – Network level • Overall structure: How does the overall network structure of a criminal enterprise change over time? What does this imply about changes in the organization of a criminal enterprise? © 2005 61 The Evolution of “Meth World” The network in Year: 1995, 1996, 1998, 1999, 2002 © 2005 62 The Evolution of “Meth World” Both density and cohesion of the highlighted group dropped in 1994, possibly indicating a turnover No connections with people outside of the group existed during 1995 and 1996. The group stayed highly cohesive In 1998, 1999, and 2001, group cohesion dropped while density remained high, indicating a tendency to connect to people outside of the group and to recruit new members © 2005 63 Dynamic Network Analysis Research Testbed • Two related real-world datasets: – police incident reports from Tucson Police Department (TPD) • 2.03 million individuals • 1.34 million vehicles • 1990-2005 – inmate information from the Arizona Department of Corrections (ADOC) © 2005 • 165,540 jailed individuals • 1986 to 2006 64 64 Facilitator Identification • In this study, the facilitators included three individual attributes and five shared affiliations. • Individual attributes: age, race, gender • Shared affiliations: mutual acquaintance, inmate affiliation, vehicle affiliation, phone affiliation, residential address • These facilitators were selected based on previous studies and input by domain experts. © 2005 65 65 Statistical Analysis • Cox survival analysis was used to examine the significance of facilitators. h(t, x1 , x2 , x3 ...) h0 (t ) exp( 1 x1 2 x2 3 x3 ...) • h(t,x1,x2,x3…) is instantaneous hazard - the probability that the event will happen at time t – given that the event has not happened up until time t – with the observations of various independent variables (x1, x2, x3…) • The dependent variable indicates if a pair of individuals i and j with dij = 2 would subsequently form a new link at time t. © 2005 66 66 Experimental Results Vehicle Mutual acq. Age Race Gender 0 1 5 10 15 20 25 30 35 40 45 Hazard Ratio (g) Results of multivariate survival analysis (Cox regression) of triadic closure for pairs of individuals. On the X-axis, the figure shows the hazard ratios and their 95% confidence intervals. The probability of the triadic closure would increase by a factor of hazard ratio (g) when the corresponding independent variable increases by one unit. © 2005 67 67 Experimental Results (cont.) Facilitator Significant/Insignificant in predicting future cooffending Mutual Acquaintances Significant, criminals with shared ‘friends’ are likely to cooffend in crimes in the future Shared Vehicles Significant, common vehicles point to hidden/future operational links Homophily in age, race, gender Insignificant, crime crosses race, gender boundaries (especially in an immigrant city like Tucson). Common jails Insignificant, ADOC’s jail segregation system appears to work. Important policy implications. © 2005 68 68 Link Prediction • Cox regression can also be used to determine the scale of influence for each of the facilitators. • Sharing the same vehicle in different crimes increases the probability of triadic closure by a factor of 9.38 and each additional mutual acquaintance increases it by a factor of 10.79. • Therefore, if two unconnected criminals have used the same vehicle in different crimes and have five mutual acquaintances then they are 9.381 x 10.79(5-1) ≈ 127141.88 times more likely to co-offend in the future. © 2005 69 69 Terrorist Networks: Unstructured and Multilingual, Intelligence Reports, Family/Friendship/Disciple Affiliations © 2005 70 • Islamic fanatics in the Global Salafi Jihad (with roots in Egypt) • Based on data about 172 Jihadists…social bonds predated ideological commitment • small-world network…network robustness…geographical distribution…fuzzy boundaries…cliques…the strength of weak bonds…the power of the Internet M. Sageman, former foreign service officer in Islamabad, forensic psychiatrist (2004) © 2005 71 • “The process of radicalization in a hostile habitat but linked through the Internet leads to a disconnected global network, the Leaderless Jihad.” • From anecdote to data and from journalism to social sciences • Going beyond incident databases; detailed evidence-based terrorist (500+) database • Before 2004, face-to-face interactions, 26-year old • After 2004, interaction on the Internet: Madrid, Dutch Hifsatd, Cairo, Toronto…Irhabi007 and Muntada, 20-year old © 2005 M. Sageman (2007) 72 A 9/11 Terrorist Network © 2005 73 73 The Global Salafi Jihad (GSJ) Network • Based on Dr. Marc Sageman’s book and data • Data collected and cross-validated from open sources regarding 366 GSJ members • Background – 75% From upper or middle class – Average age is 26 – Affiliation through friendship, kinship, discipleship, and worship • Four clumps (based on geographical location) – – – – © 2005 Central Staff Core Arab Maghreb Arab Southeast Asian 74 GSJ (Cont’d) • Each clump has its Hubs: important, popular members with many links (high degree) • The Central Staff clump connected with other three clump through Lieutenants: important connectors (high betweenness) • A clump may contain Cliques: members are nearly fully connected • Clumps have different structures – Scale free network – Hierarchical network © 2005 75 GSJ (Cont’d) Clump Central Staff Hub Osama bin Laden Lieutenants Network Structure - - Core Arab Khalid Sheikh Mohammed (KSM) Waleed Mohd Tawfiq bin Attash, Abdal Rahim al Nashiri, Ramzi Mohd Abdullah bin al Shibh, Scale free Maghreb Arab Southeast Asian Zain al Abidin Mohd Hussein Fateh Kamel, Amar Makhlulif Scale free Abu Bakar Baasyir Encep Nurjaman, Ali Ghufron Hierarchical © 2005 76 The Dataset • EXCEL spreadsheet containing the information about the 366 GSJ members. • Data characteristics – Node individual terrorist • Short name, full name, DOB, education, marital status, etc. – Link relation • Operational link (based on attacks) • Personal link – – – – – © 2005 Acquaintance Friends Family Relative Religious • Post join tie 77 The Network (with all links) A lieutenant acting as a gatekeeper to connect two clumps Southeast Asians are lead by their own leader Clumps Three clumps (central, core, Maghreb) are lead by members from the “central staff” clump Central Staff Core Arab Southeast Asian Maghreb Arab Node Size Leader An important person linking two groups Lieutenant Other people Fate Dead © 2005 Captured 78 Operational Links Bali, 2002 Jakarta, 2003 Singapore Plot, 2001 9/11, 2001 Strasbourg, 1999 LAX,. 1999 France, 1995 Casablanca, 2003 Emb, 1998 Morocco, 1994 Istambul, 2003 © 2005 79 All Personal Links © 2005 80 Personal Links v.s. Operational Links How did they get involved in 9/11? © 2005 81 Finding the Path Resulting in an Attack © 2005 82 Finding the Network Structure © 2005 83 Use PageRank Algorithm to Calculate the Importance Value of Each Terrorist • PageRank (Brin & Page, 1998) is a very famous algorithm designed to calculate the authoritativeness of Web pages based on the Web link structure. • We borrowed the main idea of PageRank to calculate the importance value of each terrorist in the terrorist network based on their relationships: – Step 0: Initially, assign equal PageRank scores (importance value) to every terrorist in the network. – Step 1: For every terrorist p in the network, calculate its RageRank score as follow: • PageRank(p) = PageRank (q) 1 d d n c( q ) All q link to p – Step 2: Repeat Step 1 until the changes of the PageRank score are smaller than a threshold value (convergence). © 2005 84 Build Authority Derivation Graph: Reveal the Social Hierarchy among Terrorists q2 • For each terrorist p in the original full network: – Find the all the terrorists {q1, q2, …, qm} that have relationships to p – Find qi who has the highest importance value among {q1, q2, …, qm} – Draw a directed link from p to qi indicating that qi is the direct leader of p. © 2005 q1 p q3 q6 q4 q5 0.14 0.18 p 0.13 0.12 0.13 0.15 q1 p 85 The ADG of the GSJ Network (n = 1) Central Staff Core Arab Southeast Asian Maghreb Arab © 2005 86 The ADG of the GSJ Network • From the ADG of the GSJ network, we can clearly see that: – The network has a fanning-out hierarchical structure. – The people who were stated as leaders in Dr. Sageman’s book also appear to be leaders in each level of the hierarchy in the network (their names are marked in red). – Some leaders has many directly related underlings (e.g. Hambali) while some others has less directly related underlings, but many levels of underlings (e.g. bin Laden). – The whole network seems to be composed of two parts: the north parts led by Hambali and the south part led by bin Laden. © 2005 87 Dark Networks: Topology, Disruption Strategy © 2005 88 Introduction • Many “Dark Networks” (e.g., terrorist networks, drugtrafficking networks, arms smuggling networks, etc.) are hidden from our view yet could bring devastating impacts to our society and economy • Traditionally, due to the difficulty of collecting and accessing reliable data sources, the topology of these networks are largely unknown – Do dark networks share the same topological properties with other empirical networks? – Do they follow the same self-organizing principle? – How do they achieve efficiency under constant surveillance and threats from authorities? – How robust are they against attacks? © 2005 89 89 The Four Dark Networks • The Global Salafi Jihad (GSJ) terrorist network – Nodes: terrorists from four terrorist groups: Central Staff, Core Arabs, Meghrab Arabs, and Southeast Asian – Links: personal links (kinship, friendship, religious ties) and relations formed after joining the GSJ • The narcotics-trafficking network (Meth World) – Nodes: criminals involved in meth-related crimes between 1985-2002 – Links: co-occurrence relations extracted from crime incident reports • The gang network – Nodes: criminals involved in gang-related crimes between 1985-2002 – Links are co-occurrence relations extracted from crime incident reports • The terrorist web sites (Dark Web) – Nodes: Web sites created by four terrorist groups: Al-Gama’a al-Islamiyya, Hizballa, Al-Jihad, and Palestinian Islamic Jihad and their supporters – Links: composite hyperlinks © 2005 90 90 The Dark Networks (Cont’d) © 2005 91 91 Basic Statistics GSJ Meth World Gang Network Dark Web Number of Nodes 366 1349 3917 104 Number of Links 1247 2392 9051 156 Size of the Giant Component 356 (97.3%) 924 (68.5%) 2231 (57.0%) 80 (77.9%) Link Density 0.02 0.01 0.003 0.05 Average Degree 6.97 4.62 2.87 1.94 44 37 51 33 0.41** -0.14** 0.17** -0.24* Maximum Degree Assortativity For the giant component * p < 0.05 ** p < 0.01 © 2005 92 92 Small-World Properties GSJ Meth World Gang Network Dark Web Data Random Data Random Data Random Data Random 9 6.00 (0.263) 17 9.57 (0.556) 22 16.40 (0.516) 12 13.16 (0.830) Average Path Length L 4.20 3.23 (0.040) 6.49 4.52 (0.056) 9.56 4.59 (0.034) 4.70 3.15 (0.108) Global Efficiency 0.28 0.33 (0.004) 0.18 0.23 (0.003) 0.12 0.23 (0.001) 0.30 0.34 (0.019) Clustering Coefficient C 0.55 0.020 (0.0029) 0.60 0.005 (0.0014) 0.68 0.002 (0.0005) 0.47 0.049 (0.0155) Diameter © 2005 93 93 Findings about SW Properties • Dark networks are sparse • Dark networks are small worlds – The average path length (and diameter) is small relative to the network size but slightly larger than that in random graph counterpart – The clustering coefficient is significantly greater than that in random graph counterpart • Network members are extremely close to their leaders – GSJ: 2.5 steps to Bin Laden, on average – Meth World: 3.9 steps to its leader, on average • GSJ and the gang network are assortative, while the Meth World and the Dark Web are disassortative © 2005 94 94 Scale-Free Property p(k) R2 GSJ p(k ) 0.45k 1.38 0.74 Meth World p(k ) 0.86k 1.86 0.89 Gang Network p(k ) 1.14k 1.95 0.81 Dark Web © 2005 p(k ) 0.35k 1.10 0.82 95 95 Cumulative Degree Distributions GSJ Meth World 1 1 1 10 100 1 Data 10 100 Data Pow er-Law 0.1 Pow er Law P(k) P(k) 0.1 0.01 0.01 0.001 0.001 k k Gang Network Dark Web 1 1 1 10 100 1 Data 0.1 10 100 Data Pow er Law Pow er Law P(k) P(k) 0.1 0.01 0.01 0.001 0.0001 0.001 k © 2005 k 96 96 Findings about SF Properties • All four networks display scale-free characteristics • The power-law distributions fit especially well with the data for large degrees • The three human networks show somewhat two-regime scaling behavior which may be due to (Barabasi et al., 2002) – New links between existing members – Rewiring © 2005 97 97 Implications • Sparseness and short paths between network members – Enhanced efficiency in flow and transmission of information and goods – Reduced risks of being detected and captured by authorities • High clustering coefficient – High tendency to form groups and teams – Enhanced efficiency in flow of resources within the local group • High closeness to network leaders – Short chain of command and high communication efficiency • The Dark Web is a special case with relatively large path length (4.70) – Reluctance to share potential resources with other terrorist groups • Dark networks may form following the self-organizing principle © 2005 98 98 Robustness against Attacks • Two types of attacks – Simultaneous attacks (the degree/betweenness of nodes are not updated after each removal) – Progressive attacks (the degree/betweenness of nodes are updated after each removal) • Two attack strategies – Attack on hubs (highest degree) – Attack on bridge (highest betweenness) © 2005 99 99 Simultaneous vs. Progressive Attacks 1 12 S (Simultaneous attacks) 0.8 10 Average path length S (Progressive attacks) S 0.6 0.4 0.2 fp fs Simultaneous attacks 8 Progressive attacks 6 4 2 0 0 0 0.2 0.4 0.6 Fraction of nodes removed 0.8 1 0 0.1 0.2 0.3 0.4 Fraction of nodes removed Bridge attack on the GSJ network © 2005 100 100 Hub vs. Bridge Attacks Meth World GSJ 1 1 0.9 0.8 S (Hub attacks) 0.6 S (Bridge attacks) 0.5 S (Hub attacks) S and <s> S and <s> 0.7 0.4 S (Bridge attacks) 0.3 0.2 0.1 0 0 0 0.2 0.4 0.6 0.8 0 1 0.2 Fraction of nodes removed Fraction of nodes removed Gang Netw ork 1 1 0.8 S (Hub attacks) 0.8 S (Bridge attacks) 0.6 S and <s> S and <s> S (Hub attacks) S (Bridge attacks) 0.4 0.2 0.6 0.4 0.2 0 0 0.2 0 0 Fraction of nodes removed © 2005 0.1 0.2 0.3 0.4 0.5 Fraction of nodes rem oved 101 101 Findings and Implications • Dark networks are more vulnerable to progressive attacks than simultaneous attacks • Dark networks are more vulnerable to bridge attacks than to hub attacks © 2005 102 102 How Well has the Authority Done? (close to random!) Disruption of the GSJ Network S l 7 1.2 Preferential 1 Preferential Real Random 6 Real Random 0.8 5 4 0.6 3 0.4 2 0 0 a © 2005 19 93 19 95 19 95 19 98 19 99 20 00 20 01 20 01 20 01 20 01 20 02 20 02 20 03 20 03 20 03 1 19 93 19 95 19 95 19 98 19 98 19 99 20 01 20 01 20 01 20 01 20 02 20 02 20 02 20 03 20 03 20 03 0.2 b 103 103 Dark Web: Unstructured and Multilingual, Web 1.0 and 2.0, Multifaceted Analysis (Content, Authorship, Sentiment) © 2005 104 • Drawing on an eight-year study of WWW • Terrorist organizations and their supporters maintain hundreds of web sites • Terrorist organizations exploit the Internet to raise funds, recruit members, plan and launch attacks, and publicize their chilling their results. • New terrorism, new media…The war over minds…cyberterrorism • Balancing security and civil liberties © 2005 G. Weimann, communication and mass media study, U. of Haifa (2006) 105 • “The process of radicalization in a hostile habitat but linked through the Internet leads to a disconnected global network, the Leaderless Jihad.” • From anecdote to data and from journalism to social sciences • Going beyond incident databases; detailed evidence-based terrorist (500+) database • Before 2004, face-to-face interactions, 26-year old • After 2004, interaction on the Internet: Madrid, Dutch Hifsatd, Cairo, Toronto…Irhabi007 and Muntada, 20-year old © 2005 M. Sageman (2007) 106 Web Site Example: Links to Multimedia and Manuals Link to “The General of Islam” Radio Station Azzam Speeches Berg beheading others videos of Zarqawi Source: http://www.al-ghazawat.110mb.com/, © 2005 French and Arabic Web Site Complete 65 pages manual of a 50 caliber rifle in pdf 107 Web Site Example: Links to Web Sites and Forums • Links to Several Iraqi Jihadist Web Sites and Forums • Source: http://almaaber.jeeran.com/, Arabic Web Site © 2005 108 Web Link Analysis – Generating Hyperlink Diagrams 1. Calculate a similarity measure for a pair (A,B) of Web sites based on: 1. Number of hyperlinks between the two Web sites 2. Level of the hyperlinks in the Web site hierarchy 1 Similarity ( A, B) All links L 1 lv(L) b/w A and B where L is a hyperlink between site A and B; lv(L) is the level of hyperlink L in the Web site hierarchy. 2. © 2005 Similarity matrix is fed to the multidimensional scaling algorithm (MDS), which generates a 2-dimensional graph of Web sites with embedded distance (similarity) information. 109 Proposed Approach - Content Analysis Coding Scheme High Level Attribute Low Level Attribute Communications Email Telephone High Level Attribute Slogans Propaganda (insiders) Dates Multimedia Fundraising Low Level Attribute Online Feedback Form Martyrs External Aid Mentioned Leaders Fund Transfer Banners and Seals Donation Narratives of Operations and Events Charity Support Groups Propaganda (outsiders) Sharing Ideology Mission Doctrine Justification of the Use of Violence Pin-pointing Enemies References to Western Media Coverage High Level Attribute Low Level Attribute Command and Control Tactics Organization Structure Recording or Videos from Senior Members of the Group Documentation of Previous Operations Recruitment and Training Operations’ Geographical Area Explicit Invitation to Join the Movement or Group News Reporting Virtual Community Listserv Text Chat Room Message Board E-conferencing © 2005 Webring 110 U.S. Domestic and Middle Eastern Terrorist/Extremist Web Site Testbed Category U.S . Domestic # URLs Example: Group Category MiddleEastern # URLs Example: Group Black Separatist 2 “Nation of Islam” Sunni 24 “Al-Qaeda” Christian Identity 13 “Kinsman Redeemer Ministries” Shi’a 5 “Hizbollah” Militia 8 “Michigan Militia” Secular 10 Neo Confederate 4 “Texas League of the south” Total 39 White Supremacy 7 “Ku Klux Klan” Neo-Nazis 9 “American Nazi Party” Ecoterrorism/Animal Rights 1 “Animal Liberation Front” Total © 2005 “Al-Aqsa Martyr’s Brigades” 44 111 Results – Hyperlink Diagram of U.S. Domestic Groups’ Web sites © 2005 112 Results - Hyperlink Diagram of Middle Eastern Groups’ Web sites Hizb-ut-Tahrir Jihad Sympathizers Tanzeem-e-Islami Hizbollah Al-Qaeda linked Web sites Palestinian terrorist groups © 2005 113 Results – Web Usage Patterns for U.S. Domestic Groups 0.9 Communications Normalized Content Levels 0.8 Fundraising 0.7 Ideology 0.6 0.5 Propaganda (insiders) 0.4 0.3 Propaganda (outsiders) 0.2 Virtual Community 0.1 Command and Control 0 Black Separatists © 2005 Christian Identity Militia Neoconfederates NeoEco-Terrorism Nazis/White Supremacists Recruitment and Training 114 Results – Summary of Web Usage Patterns for U.S. Domestic Groups • “Ideology” and “Propaganda towards insiders” were allocated the highest amount of Web site resources, followed by “Communications.” • For eco-terrorism and animal rights groups, they allocated more Web site resources for “Communications” and “Command and Control”. • “Propaganda towards outsiders” and “Virtual Community” had very limited appearance in U.S. domestic group Web sites. © 2005 115 Normalized Content Levels Results – Web Usage Patterns for Middle Eastern Groups 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 © 2005 Communications Fundraising Sharing Ideology Propaganda (Insiders) Propaganda(outsider) Virtual Community Command and Control Recruitment and Training Hizb-ut-Tahrir Hizbollah Al-Qaeda Linked Websites Jihad Sympathizers Palestinian terrorist groups 116 Results – Summary of Web Usage Patterns for Middle Eastern Groups • “Sharing Ideology” was the attribute with the highest frequency of occurrence in Middle Eastern terrorist/extremist groups’ Web sites. • Clandestine groups (e.g., Al-Qaeda) tended to emphasize “Propaganda towards outsiders” while the more established groups (e.g., Hizballah, Hamas) directed their propaganda towards insiders. • The less covert militant groups tended to conduct fundraising on their Web sites, for instance, both Hizb-ut-Tahrir and Hizbollah used the Web to support their fundraising activities. © 2005 117 Forum Interaction Network Analysis: ClearGuidance.com • Background – Forum with some members affiliated with Toronto terror plot. – Reportedly had as many as 15,000 members. – Unfortunately, the site went offline in February of 2004, before we began spidering forums. – We were able to retrieve selected content from various blogs. Authors 269 © 2005 Messages 877 Duration 9/2002-2/2004 118 ClearGuidance.com • Member locations – Shown to the right are the self-reported member locations. – Approximately 2/3 of the 269 members specified a location country. – Breakdowns for those that did specify: • Majority located in USA, UK, Canada, and the Middle East. © 2005 Members Reporting Location 33% Unspecified Specified 67% Member Locations by Region USA 8% 28% 16% Canada UK Australia 3% Europe-Other 6% 12% 27% Middle East Other 119 ClearGuidance.com • Toronto plot forum Member Interaction Network – Blue nodes indicate members with the greatest number of in-links. – These members are the core set of forum “experts” and propagandists © 2005 120 Arabic Feature Set Feature Set (418) Violence Race/Nationality Technical Structure Word Structure Word Roots Function Words Punctuation Hyperlinks Embedded Images Font Size Font Color Contact Information Paragraph Level Message Level Elongation Word Length Dist. Vocab. Richness Word-Level Special Char. Letter Frequency Char-Level (4) Word-Based Char-Based 121 (7) (8) (4) (29) (3) (6) (5) (8) (15) (2) (6) (9) (35) © 2005 (4) (11) (48) (14) (50) (200) (12) (31) (48) (15) (62) (262) (79) Content Specific Structural Syntactic Lexical Arabic Feature Extraction Component 1 Incoming Message 2 Count +1 Elongation Filter Degree + 5 Filtered Message Feature Set Similarity Scores (SC) Root Dictionary 3 max(SC)+1 Root Clustering Algorithm All Remaining Features Values Generic Feature Extractor © 2005 4 122 Sliding Window Algorithm Illustration Message Text 2. 1. Compute eigenvectors for 2 principal components of feature group x 0.533 -0.541 0.034 0.653 0.975 0.143 Extract feature usage vectors y 0.956 0.445 0.089 0.456 -0.085 -0.381 1,0,0,2,1,2 Eigenvectors 3. Repeat steps 2 and 3 © 2005 Transform into 2dimensional space Feature Usage Vector Z 0,1,3,0,1,0 y x = Zx y = Zy x 123 Author Writeprints © 2005 Anonymous Messages Author A 10 messages Author B 10 messages 124 Forum “Experts” The series of overlapping circular patterns for bag-of-word features indicates that the author’s discussion revolves around a related set of topics. Bag-of-words are predominantly related to religious topics. Many large red blots indicative of the presence of features unique to this author. This author attempts to use his religious “expertise”. © 2005 125 This author was later arrested as a major culprit in the Toronto terror plot (“Soldier of God”). He uses many violent affect terms. Radar chart showing violent affect feature usages. Text annotation view showing key bag-of-words highlighted. Comparison to mean shows several high occurrence terms (e.g., jihad, martyrdom). Selected feature (i.e., “jihad”) is shown in red. Selected feature is use of term “jihad” which is the highest in the forum . © 2005 This author constantly attempts to justify acts of violence and terrorism. “…there are so many paid sheikhs stuck in this life….no point going to them for fatwas…personally speaking…cuz they don’t even agree with jihad in the first place” 126 www.albasrah.net Major Iraqi resistance web site www.geocities.com/m_ale3dad4 Training materials www.saaid.net www.geocities.com/maoso3ah Major Dark Web site Training materials © 2005 IED Dark Web Network 127 Extraction: Retrieved Pages • Using the lexicon, we used a search engine to extract all web pages with these terms from our collection. – A total of 2541 relevant web pages were collected from 30 web sites. – Over 90% of these pages came from a core set of 7 web sites Total Web Sites Frequency Distribution No. Web Sites No. Web Pages 30 2541 3000 Core Web Sites Web Site No. Web Pages www.qudsay.com 1209 www.albasrah.net 332 www.khayma.com 162 www.jamaat.org 141 www.hilafet.com 66 www.geocities.com © 2005 51 No. Web Pages 2500 2000 1500 1000 500 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Web Site 128 Link Analysis: Hubs www.albasrah.net Is a website with links to the former Iraqi Baathist regime. It contains a large collection of war images and reports of military operations by Iraqi insurgents www.geocities.com/m_ale3dad4 Is a collection of training material. Topics include weapons, their usage, and the manufacturing of IEDs. The website contains video demonstrations, books, and other documents in English and Arabic. www.geocities.com/maoso3ah Is an “encyclopedia” of military training and preparation for Jihad. www.saaid.net Is an Islamic directory. Much of the content pertains to unrelated topics. However, some of the contributors support the Jihadi Salafi movement. © 2005 129 Segmentation: Implanting and setting off the device • Video 28: Insurgents prepare the IED • Location: Mushahad region, Iraq • Insurgents: Islamic Front of the Iraqi Resistance Planting the device © 2005 Hiding the device Setting it off 130 Categorization Extended Arabic Feature Set Group Category Lexical Word-Level 5 total words, % char. per word Character-Level 5 total char., % char. per message Character N-Grams Digit N-Grams Syntactic Topical © 2005 < 18,278 < 1,110 Description/Examples count of letters, char. bigrams, trigrams (e.g., اب,)کك count of digits, digit bigrams, digit trigrams (e.g., 1, 12, 123) Word Length Distribution 20 frequency distribution of 1-20 letter words Vocabulary Richness 8 richness (e.g., hapax legomena, Yule’s K, Honore’s H) Special Characters 21 occurrences of special char. (e.g., @#$%^&*+=) Function Words 300 frequency distribution of function words (e.g., of, for, to) Punctuation 12 occurrence of punctuation marks (e.g., !;:,.?) Word Root N-Grams Structural/HTML Quantity varies roots, bigrams, trigrams (e.g., كتب, )كسب Message-Level 6 e.g., has greeting, has url, requoted content Paragraph-Level 8 e.g., number of paragraphs, sentences per paragraph Technical Structure 50 e.g., file extensions, fonts, use of images, HTML tags HTML Tag N-Grams < 46,656 Word N-Grams varies e.g., <head>, <br>, <td>, <message> bag-of-words n-grams (e.g., “explosive”, “explosive device”) 131 IED Site Signatures • Using feature selection, we were able to get 88.8% accuracy. • We were also able to isolate a subset of approximately 9,000 key features. Technique Features Mean Accuracy Standard Deviation SVM 21,333 81.938 5.313 65.00 – 92.50 SVM-IG 9,268 88.838 3.238 80.00 – 96.25 Range Classification Results 100 95 Accuracy (%) • The table and graph summarize the 100 bootstrapping instance results. 90 85 80 75 SVM 70 SVM-IG 65 0 © 2005 20 40 60 Instance 80 132 Recommendation: Terrorism Informatics Methodology • Anecdote Data Data Mining (SNA) • Journalism Social Sciences Computational Sciences • Field, classified, and human intelligence Open source, web and artificial intelligence © 2005 133 Recommendation: Databases and Tools • Developing evidence-based, open source collections for the international intelligence community • Developing advanced open source, web and artificial intelligence tools and linguistic resources for the international intelligence community • Leveraging best existing web intelligence and data mining tools • Monitoring and analyzing radical forums and Web 2.0 • Advancing multilingual and multimedia analysis techniques for intelligence analysis © 2005 134 Recommendation: “Soft Power” • Identifying and promoting moderate sites, forums, opinion leaders, and statements • Removing targeted radical sites and forums based on community tagging and automated, “refreshed” spidering • Enabling stakeholders and “cultural intelligence” through digital libraries • Promoting positive alternatives, role models, and local heroes in the Muslim worlds © 2005 135 Hsinchun Chen … Artificial Intelligence Lab, COPLINK and Dark Web Teams … [email protected] … http://ai.arizona.edu … © 2005 136 EuroISI 2008: December 3-5, 2008, Copenhagen, Denmark; CFP Deadline: July 8, 2008 PAISI 2009: Co-locating with PAKDD, April 27-30, 2009, IEEE ISI 2009: June 8-11, 2009, Dallas, Texas © 2005 137