Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining for Cyber Threat Analysis Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar Project Participants: A. Lazarevic, V. Kumar, J. Srivastava H. Ramanni, L. Ertoz, M. Joshi, E. Eilertson, S. Ketkar Mining Large Data Sets - Motivation Examples: Computational simulations Information Assurance & Network intrusion Sensor networks Computational Simulations Homeland Defense There is often information “hidden” in the data that is not readily evident. Sensor Networks Human analysts may take weeks to discover useful information. Much of the data is never analyzed at all. Network Intrusion Detection Data Mining for Homeland Defense “Data mining and data warehousing are part of a much larger FBI plan to … discover patterns and relationships that indicate criminal activity” (network intrusions, cyber attacks, terroristic calls, …)” Federal Computer Week, June 3, 2002 FBI Director Robert Mueller: “New information technology is critical top conducting business in a different way, critical to analyzing and sharing information on a real time basis” Vision Statement to Guide Research in Question & Answering (Q&A) and Text Summarization by John D. Prange1, Eduard Hovy2, and Steve Maiorano3 1. INTRODUCTION Recent developments in natural language processing R&D have made it clear that formerly independent technologies can be harnessed together to an increasing degre order to form sophisticated and powerful information delivery vehicles. Information retrieval engines, text summarizers, question answering systems, and language translators provide complementary functionalities which can be combined to serve a variety of users, ranging from the casual user asking questions of the web (such as a schoolchild doing an assignment) to a sophisticated knowledge worker earning a livin (such as an intelligence analyst investigating terrorism acts). A particularly useful complementarity exists between text summarization and questio answering systems. From the viewpoint of summarization, question answering is one way to provide the focus for query-oriented summarization. From the viewpoint of question answering, summarization is a way of extracting and fusing just the relevant information from a heap of text in answer to a specific non-factoid question. Howeve both question answering and summarization include aspects that are unrelated to the other. Sometimes, the answer to a question simply cannot be summarized: either it i brief factoid (the capital of Switzerland is Berne) or the answer is complete in itself (g me the text of the Pledge of Allegiance). Likewise, generic (author’s point of view summaries) do not involve a question; they reflect the text as it stands, without input the system user. This document describes a vision of ways in which Question Answering and Summarization technology can be combined to form truly useful information delivery tools. It outlines tools at several increasingly sophisticated stages. This vision, and t staging, can be used to inform R&D in question answering and text summarization. T purpose of this document is to provide a background against which NLP research sponsored by DRAPA, ARDA, and other agencies can be conceived and guided. An important aspect of this purpose is the development of appropriate evaluation tests a measures for text summarization and question answering, so as to most usefully focu research without over-constraining it. 1 Technical Director, Advanced Research and Development Activity (ARDA), R&E Building STE 6530, 9800 Savage Road, Fort Meade, MD 20755-6530. [email protected] 2 Natural Language Group, University of Southern California-Information Sciences Institute, 4676 Admiralty Wa Marina del Rey, CA 90292-6695. [email protected] 3 Advanced Analytic Tools, LF-7, Washington, DC 20505. [email protected] Homeland Defense: Key issues Information fusion from diverse data sources including intelligence, agencies, law enforcement, profile … Data mining on this information base to uncover latent models and patterns Visualization and display tools for understanding the relationships between persons, events and patterns of behavior S h e l v e s Dor ChairBasebordHat ChCahirair Plant Window Event recog Assoc analysis Threat Visualizer Threat predictor C r e d e n z a Information Fusion Cultural Intelligence Law enforcement Data Data Data Information Assurance: Introduction As the cost of the information processing and Internet accessibility falls, more and more organizations are becoming vulnerable to potential cyber threats “unlawful attacks and threats of attack against computers, networks, and the information stored therein when done to intimidate or coerce a government or its people“ – D. Denning Incidents Reported to CERT/CC 60000 50000 40000 30000 20000 10000 0 90 91 92 93 94 95 96 97 98 99 00 01 Information Assurance: Intrusion Detection Intrusion Detection: Detecting a set of actions that compromise the integrity, confidentiality, or availability of information resources. Viruses and Internet worms Theft of classified information from DOD computers Problem of identifying individuals who are using computers without authorization who have legitimate access but are abusing their privileges Intrusion Detection System (IDS) combination of software and hardware that attempts to perform intrusion detection raise the alarm when possible intrusion happens Data Mining on Intelligence Databases Purpose: Develop methods to identify potential threats Mine intelligence database Example: Forecasting Militarized Interstate Disputes (FORMIDs). Data: social, political, economic, geographical information for pairs of countries ratio of military capability democracy index level of trade distance Predict: the likelihood of militarized interstate disputes (MIDs). Overall Objective: predict likely instabilities involving pairs of countries. Collaborators: Sean O’Brien, Center of Army Analysis (CAA), Kosmo Tatalias (NCS). Data Mining in Commercial Word Given its success in commercial applications, data mining holds great promise for analyzing large data sets. Employed No NO Yes # of years 2 <2 4 # of years in school NO Yes >4 Married YES Classification / Predictive Modeling {Direct Marketing, Fraud Detection} TID Items 1 2 3 4 5 Bread, Milk Beer, Diaper, Bread, Eggs Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Bread, Diaper, Milk Clustering (Market segmentation) {Diaper , Milk } {Beer } Association Patterns Marketing / Sales Promotions Key Technical Challenges Large data size Gigabytes of data are common High dimensionality Thousands of dimensions are possible Spatial/temporal nature of the data Data points that are close in time and space are highly related Skewed class distribution “Mining needle in a haystack. So much hay and so little time” Interesting events are very rare looking for the “needle in a haystack” Data Fusion & Data Preprocessing Data from multiple sources Data of multiple types (text, images, voice, … ) Data cleaning – missing value imputation, scaling, mismatch handling Intrusion Detection Research at AHPCRC Misuse Detection - Predictive models Mining needle in a haystack – models must be able to handle skewed class distribution, i.e., class of interest is much smaller than other classes. Learning from data streams - intrusions are sequences of events Anomaly and Outlier Detection Able to detect novel attacks through outlier detection schemes Detect deviations from “normal” behavior as anomalies Construction of useful features that may be used in data mining Modifying signature based intrusion detection (SNORT) systems to incorporate anomaly detection algorithms Summer Institute Projects Implementing Anomaly/Outlier detection algorithms Investigating algorithms for classification of rare classes Visualizing tool for monitoring network traffic and suspicious network behavior Learning from Rare Class Key Issue: Examples from rare classes get missed in standard data mining analysis Over-sampling the small class or under-sampling the large class PNrule and related work [Joshi, Agarwal, Kumar, SIAM 2001, SIGMOD 2001] RareBoost [Joshi, Agarwal, Kumar, ICDM 2001, KDD 2002] SMOTEBoost [Lazarevic, et al, in review] Classification based on association - add frequent items as “meta-features” to original data set PN-rule Learning and Related Algorithms* P-phase: cover most of the positive examples with high support seek good recall N-phase: remove FP from examples covered in P-phase N-rules give high accuracy and significant support C NC C NC Existing techniques can possibly PNrule can learn strong signatures learn erroneous small signatures for for presence of NC in N-phase absence of C * - SIAM 2001, SIGMOD 2001, ICDM 2001, KDD 2002 SMOTE and SMOTEBoost SMOTE (Synthetic Minority Oversampling Technique) generates artificial examples from minority (rare) class along the boundary line segment Generalization of over-sampling technique Combination of SMOTE and boosting further improves the prediction performance or rare classes SMOTE and SMOTEBoost Results Experimental Results on modified KDDCup 1999 data set Final values for recall, precision and F-value for U2R class on KDDCup-99 intrusion data set Method Standard RIPPER SMOTE Nu2r=100, Nr2l=100 + Nu2r=300, Nr2l=100 RIPPER Nu2r=500, Nr2l=100 Recall Precision F-value 57.35 61.76 80.15 75.74 84.78 86.60 85.16 88.03 Method Recall Precision F-value 68.42 Standard Boosting 80.147 90.083 72.1 N =100, Nr2l=100 87.5 87.5 SMOTE- u2r 82.58 Boost Nu2r=300, Nr2l=100 88.24 90.91 81.42 Nu2r=500, Nr2l=100 87.5 92.97 84.83 87.5 89.55 90.15 Final values for recall, precision and F-value for R2L class on KDDCup-99 intrusion data set Method Standard RIPPER SMOTE Nu2r=100, Nr2l=100 + Nu2r=300, Nr2l=100 RIPPER Nu2r=500, Nr2l=100 Recall Precision F-value 75.98 87.94 77.45 71.75 96.72 94.47 92.03 89.38 Method 85.11 Standard Boosting 86.90 N =100, Nr2l=100 SMOTE- u2r 84.11 Nu2r=300, Nr2l=100 Boost 79.06 Nu2r=500, Nr2l=100 Recall Precision F-value 95.46 97.02 97.07 97.38 96.83 96.54 95.25 95.84 96.14 96.78 96.15 96.72 Classification Based on Associations Current approaches use confidence-like measures to select the best rules to be added as features into the classifiers. This may work well only if each class is well-represented in the data set. For the rare class problems, some of the high recall itemsets could be potentially useful, as long as their precision is not too low. Our approach: Apply frequent itemset generation algorithm to each class. Select itemsets to be added as features based on precision, recall and F-Measure. Apply classification algorithm, i.e., RIPPER, to the new data set. Experimental Results (on modified KDD Cup 1999 data) RIPPER with high Precision rules Original RIPPER Precision dos u2r r2l probe normal 99.50% 84.78% 96.72% 97.22% 95.52% Recall F-Measure 99.16% 57.35% 75.98% 90.27% 99.30% 99.33% 68.42% 85.11% 93.62% 97.37% RIPPER with high Recall rules Precision dos u2r r2l probe normal 99.51% 90.09% 92.90% 96.96% 96.29% Recall 98.95% 73.53% 75.88% 96.69% 98.88% Precision dos u2r r2l probe normal 99.53% 88.57% 96.54% 97.40% 96.80% Recall 99.65% 68.38% 78.91% 94.89% 99.25% F-Measure 99.59% 77.18% 86.84% 96.13% 98.01% RIPPER with high F-measure rules F-Measure 99.23% 80.97% 83.53% 96.83% 97.57% Precision dos u2r r2l probe normal 99.48% 94.17% 96.14% 98.25% 97.18% Recall 99.79% 83.09% 84.31% 92.11% 99.26% F-Measure 99.63% 88.28% 89.84% 95.08% 98.21% For rare classes, rules ordered according to F-Measure produce the best results. Anomaly and Outlier Detection Anomalous activities Missed attacks False alarm Normal profile Main Assumptions All anomalous activities need closer inspection Determine “normal activity profile” and flag an alarm when the state differs from the “normal profile” Expert analyst examines suspicious activity to make final determination whether activity is indeed an intrusion Drawbacks Possible large number of false alarms and not recognizing attacks Supervised (with access to normal data) vs. Unsupervised (with NO access to normal data) determining “normal behavior” Outlier Detection Outlier is defined as a data point which is very different from the rest of the data (“normal data”) based on some measure of similarity Outlier detection approaches: Statistics based approaches Distance based techniques Clustering based approaches Density based schemes Distance and density based schemes Distance based approaches (NN approach) - Outliers are points that do not have enough neighbors Density based approach (LOF approach) finds outliers based on the densities of local neighborhoods Concept of locality becomes difficult to define due to data sparsity in high dimensional space Clustering based approaches define outliers as points which do not lie in clusters Implicitly define outliers as background noise Outlier Detection Results (on DARPA’98 data) Detection Rate (False alarm rate was fixed to 1%) NN approach LOF Bursty attacks 15/19 (78.9%) 14/19 (73.7%) Non-bursty attacks 9 / 25 (36.0%) 14 / 25 (56.0%) 1 0.9 The score values assigned to network connections from bursty attacks Connection score 0.8 0.7 0.6 0.5 0.4 LOF approach 0.3 0.2 NN aproach 0.1 0 0 10 20 30 40 50 60 Connection number 70 80 90 100 Modifying SNORT SNORT contains simple SPADE (Statistical Packet Anomaly Detection Engine) SPADE only compares the statistics of packets Our approach Integrate our implemented outlier detection schemes into the SNORT since it is and open source code Improve the detection of novel intrusions and suspicious behavior by using sophisticated outlier detection schemes SNN Clustering – Finding Patterns in Noisy Data • Finds clusters of arbitrary shapes, sizes and densities • Handles high dimensional data • Density invariant • Built-in noise removal Topics from Los Angeles Times (Jan. 1989) • 3204 articles, 31472 words (LA Times, January 1989) • afghanistan embassi guerrilla kabul moscow rebel soviet troop ussr withdraw • chancellor chemic export german germani kadafi kohl libya libyan plant poison weapon west • ahead ball basket beate brea chri coach coache consecut el final finish foul fourth free game grab half halftim hill host jef lead league led left los lost minut miss outscor overal plai player pointer quarter ralli rank rebound remain roundup scor score scorer season shot steal straight streak team third throw ti tim trail victori win won • ab bengal bowl cincinnati craig dalla denver esiason field football francisco giant jerri joe miami minnesota montana nfl oppon pass pittsburgh quarterback rice rush super table taylor terri touchdown yard Topics from FBI web site • uss rocket cabin aircraft fuel hughes twa missile redstone • (1999 - Congressional Statement - TWA 800) • theft art stolen legislation notices recoveries • (FBI - Art Theft Program) • signature ink writings genuine printing symbols • (forensic science communications) • forged memorabilia authentic bullpen • (FBI Major Investigation - operation bullpen) • arabia saudi towers dhahran • (June 25 1996 bombing of the Khobar Towers military housing complex in Dhahran Kingdom of Saudi Arabia REWARD) • classified philip quietly ashcroft hanssen drop cia affidavit tenet dedication compromised kgb successor helen volunteered • (Agent Robert Philip Hanssen - espionage) Topics from FBI web site • afghanistan ramzi bin yousef islamic jihad egyptian bombings egypt pakistan hamas yemen headquartered usama laden kenya tanzania nairobi embassies dar salaam rahman mohamed abdel affiliated camps opposed deserve legat enemies vigilance plots casualties • enterprise asian chinese enterprises korean vietnamese italian cartels heroin cosa nostra sicilian lcn • firearms firearm bullets ammunition cartridges • perpetrating bioterrorism responders credible exposed biological articulated covert hoax wmd assumes Future Applications Unclassified telephone calls data to be provided by INSCOM Goal: to determine a terrorist in a haystack Nodes people Edges telephone calls (date / time / duration) Conclusions Predictive models specifically designed for rare class can help in improving the detection of small attack types Simple outlier detection approaches appear capable of detecting anomalies Clustering based approaches show promise in identifying novel attack types Integration data mining techniques into SNORT should improve the detection rate Data Mining Process • Data mining – “non-trivial extraction of implicit, previously unknown and potentially useful information from data” Back Modified KDDCup 1999 Data Set • KDDCup 1999 data is based on DARPA 1998 data set • Remove duplicates and merge new train and test data sets • Sample 69,980 examples from the merged data set – Sample from neptune and normal subclass. Other subclasses remain intact. • Divide in equal proportion to training and test sets Back DARPA 1998 Data Set DARPA 1998 data set (prepared and managed by MIT Lincoln Lab) includes a wide variety of intrusions simulated in a military network environment 9 weeks of raw TCP dump data 7 weeks for training (5 million connection records) 2 weeks for training (2 million connection records) Connections are labeled as normal or attacks (4 main categories of attacks - 38 attack types) DOS- Denial Of Service Probe - e.g. port scanning U2R - unauthorized access to root privileges, R2L - unauthorized remote login to machine, Two types of attacks Bursty attacks - involve multiple network connections Non-bursty attacks - involve single network connections Back to KDDCup Back to Experiments Terrorist Threat Analyzer & Predictor (T-TAP) Terrorist Threat Analyzer & Predictor Operational Capability: Event recog Assoc analysis Threat Visualizer Threat predictor S h e l v e s Dor ChairBasebordHat ChCahirair Plant Window C r e d e n z a T-TAP System Cultural Data Information Fusion Intelligence Data AHPCRC University of Minnesota •Ability to match data from multiple sources, resolving structural and semantic conflicts. •Ability to recognize events and behaviors, each of whose (partial) information is available in different data streams. •Ability to identify latent associations between suspects and their linkages to events/behaviors. •Capability to predict threatening activities and behaviors with high probability. •Ability to visualize interesting and significant events and behaviors. Law enforcement Data Proposed Technical Approach: New Effort Rough Order of Magnitude Cost and Schedule: Key Technologies: •High dimensional data clustering – METIS. •Spatio-temporal change point detection. •Association analysis and belief revision. •High dimensional classification – SVM, NN, boosting. Task List: •T1: Develop data fusion algorithms to match diverse intelligence, law enforcement, and cultural data. •T2: Develop event & behavior recognition algorithms across multiple, multi-media, data streams. •T3: Association analysis algorithms to determine hidden connections between suspects, events and behaviors; develop networks of associations. •T4: Predictive models of threatening events and behaviors. •T5: Interestingness & relevance based visualization models of significant events and behaviors. •Tasks 1, 2, 3 will each proceed in parallel for the first 12 months, with version 1 released at the end of 6 months. At this point tasks 4, 5 will start and proceed in parallel. Cross feedback across various tasks will lead to final, refined tools at the end of the 18 month period. Deliverables: •Database schema and structures to store terrain info. •Software implementation of concealed cavities prediction model. •User manuals, test reports, database schema •Quarterly technical and status reports, and final report. Corporate Information: •Vipin Kumar (POC) •Army Research Center, University of Minnesota, 1100 Washington Avenue SE, Minneapolis, MN 55455 •Phone (612)626-8095; E-mail: [email protected] AHPCRC University of Minnesota Distributed Virtual Integrated Threat Analysis Database (DVITAD) Distributed Virtual Integrated Threat Analysis Database S h e l v e s Dor ChairBasebordHat ChCahirair Plant Window C r e d e n z a Information integration: matching, clustering, profiles, networks, tracks Database Connectors Threat Analysis Tools Operational Capability: •Global integrated view across multiple databases: integrated schema, global dictionary and directory, common ontology •Threat analysis object repository •comprehensive suspect dossier •temporal activity tracks •association networks •Interactive database exploration •Field and value based querying •Approximate match based retrieval Virtual Integrated Database Proposed Technical Approach: New Effort Key Technologies: •Semantic object matching •Clustering large datasets – METIS •Latent/Indirect association analyzer Task List: •T1: Data normalization through the development of wrappers/connectors for various databases. •T2: Integrated schema creation to model information found in all databases. •T3: Matching entities (suspects, events, etc.) across multiple data sources; resolving conflicting attributes values for entities across databases. •T4: Clustering of suspect profiles. •T5: Building networks of hidden associations between suspects. •T6: Constructing temporal activity tracks of events and linkage of suspects to the tracks. Rough Order of Magnitude Cost and Schedule: •Tasks 1 and 2 will each proceed in parallel for the first 12 months, with version 1 released at the end of 6 months. At this point tasks 3,4,5,6 will start and proceed in parallel. Cross feedback across various tasks will lead to final, refined tools at the end of the 18 month period. Deliverables: •Global schema, dictionary, and directory for the integrated database. •Software that realizes the virtual integrated DB view. •User manuals, test reports, database schema •Quarterly technical and status reports, and final report. Corporate Information: •Vipin Kumar (POC) •Army Research Center, University of Minnesota, 1100 Washington Avenue SE, Minneapolis, MN 55455 •Phone (612)626-8095; E-mail: [email protected]