Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data mining for crime detection Sunita Sarawagi IIT Bombay http://www.it.iitb.ac.in/~sunita Digital age Data about people, organizations, operations increasingly available online – – – – – – – Phone calls, credit card and ATM usage Birth records, employment records Residence, land owned Countries visited Types of licenses Places visited: monitoring cameras Money transfers Cheap storage, fast machines make it easy to store and analyze these. Digital crimes Crimes are increasingly technical – Credit card frauds – Stock market scams – Hacker attacks on government computers and networks – Insurance frauds Data mining Process of semi-automatically analyzing large databases to find patterns that are: – valid: hold on new data with some certainity – novel: non-obvious to the system – useful: should be possible to act on the item – understandable: humans should be able to interpret the pattern Existing applications Banking: loan/credit card approval – predict good customers based on old customers Customer relationship management: – identify those who are likely to leave for a competitor. Targeted marketing: – identify likely responders to promotions Medicine: disease outcome, effectiveness of treatments – analyze patient disease history: find relationship between diseases Applications in crime investigation Fraud detection: telecommunications, financial transactions – from an online stream of event identify fraudulent events Interpret insurance claims (in text format) – Classify claim as valid or not Detect attacks and intrusions on computers and networks by profiling normal behaviour Health insurance frauds – Cohorts of doctors that ping-pong patients to each other Identify links amongst people The KDD process Data warehouse Extract data via ODBC Preprocessing utilities •Sampling •Attribute transformation Scalable algorithms • association • classification • clustering • sequence mining Mining operations Visualization Tools Mining operations Sequence mining Clustering Classification Time series similarity hierarchical Regression Temporal patterns Classification trees EM Neural networks density based Bayesian learning Nearest neighbour Itemset mining Radial basis functions Association rules Support vector Causality machines Meta learning methods – Bagging,boosting Classification Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Age Salary Profession Location Customer type Classifier Decision rules Salary > 5 L Prof. = Exec New applicant’s data Good/ bad Classification methods Goal: Predict class Ci = f(x1, x2, .. Xn) Regression: (linear or any other polynomial) – a*x1 + b*x2 + c = Ci. Nearest neighour Decision tree classifier: divide decision space into piecewise constant regions. Probabilistic/generative models Neural networks: partition by non-linear boundaries Nearest neighbor Define proximity between instances, find neighbors of new instance and assign majority class Case based reasoning: when attributes are more complicated than real-valued. • Pros + Fast training • Cons – Slow during application. – No feature selection. – Notion of proximity vague Decision trees Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Degree = BSc Score > 90 Good Rank < 6 Bad Bad Good Neural network Set of nodes connected by directed weighted edges Basic NN unit x1 w1 x2 w2 x3 w3 A more typical NN n x1 i 1 x2 o ( wi xi ) 1 ( y) 1 e y x3 Output nodes Hidden nodes Used for face recognition and other image recognition tasks Association rules T Given set T of groups of items Milk, cereal Example: set of baskets of items Tea, milk purchased Goal: find all rules on itemsets of the Tea, rice, bread form a-->b such that – support of a and b > user threshold s – conditional probability (confidence) of b given a > user threshold c Example: Milk --> bread Lot of work done on scalable algorithms cereal Applications of fast itemset counting Find correlated events: Applications in medicine: find redundant tests Cross selling in retail, banking Intrusion detection Case study: data mining for network instrusion detection Fighting intrusion Prevention: isolate from network, strict authentication measures, encryption Preemption: – “do unto others before they do unto you” Deterrence: dire warnings, – “we have a bomb too.” Deflection: diversionary techniques to lure away Detection Counter attacks Intrusion detection methods Anomaly-based – study typical patterns of normal use and detect abnormal usage – cannot distinguish illegal from abnormal Signature-based – model signature of previous attacks and flag matching patterns. – Cannot detect new intrusions Use hybrid Intrusion detection methods Automatic rules – Use historical audit trails and an intelligent learning technique to model normal and intrusion traffic – May not provide full coverage Policy-driven rules – A security expert codifies rules – Manually intensive, might miss patterns, may not evolve as normal usage pattern slowly drifts. Use hybrid Current Intrusion Detection Approaches Main problems: manual and ad-hoc – Misuse detection: • Known intrusion patterns have to be handcoded • Unable to detect any new intrusions (that have no matched patterns recorded in the system) – Anomaly detection: • Selecting the right set of system features to be measured is ad hoc and based on experience • Unable to capture sequential interrelation between events Data Mining Why is it applicable to intrusion detection? – Normal and intrusive activities leave evidence in audit data – From the data-centric point of view, intrusion detection is a data analysis process – Successful applications in related domains, e.g., fraud detection, fault/alarm management Relevant data mining algorithms Classification: maps a data item into one of several pre-defined categories Link analysis: determines relations between fields in the database Sequence analysis: models sequence patterns Intrusion detection Intrusions could be detected at – Network-level (denial-of-service attacks, open port-scans, etc) lseek • Sequence of TCP-dumps – Host-level (attacks on privileged programs like lpr, sendmail) • Sequence of system calls • |S| = set of all possible system calls ~100 lstat mmap execve ioctl ioctl close execve close unlink Classification Models on sendmail Philosophy: – For most privileged programs the short sequences of system calls made during its normal executions are very consistent, yet different from the sequences of its abnormal (exploited) executions as well as the executions of other programs The sendmail data: – Each trace has two columns: the process ids and the system call numbers – Normal traces: sendmail and sendmail daemon – Abnormal traces generated by known sendmail attacks: sunsendmailcap, syslog-remote, syslog-local, decode, sm5x and sm56a attacks. Classification Models on sendmail Data preprocessing (most challenging) – Converting sequential data to record data – Use sliding window to create sequence of consecutive system calls – Label the sequences to create training data: sequences (length 7) class labels 4 2 66 66 4 138 66 “normal” 5 5 5 4 59 105 104 “abnormal” … … Classification Models on sendmail Learning patterns of normal sequences: – Each record: n consecutive system calls plus a class label, “normal” or “abnormal” – Training data: sequences from 80% of the normal traces plus some of the attack traces – Testing data: traces not used in training, including some unknown attacks – Use RIPPER to learn specific rules sendmail Experiment 1 Examples of output RIPPER rules: – if the 2nd system call is vtimes and the 7th is vtrace, then the sequence is “normal” – if the 6th system call is lseek and the 7th is sigvec, then the sequence is “normal” –… – if none of the above, then the sequence is “abnormal” sendmail Experiment 1 Using the learned rules to analyze a new trace: – label all sequences according to the rules – define a region as l consecutive sequences – define a “abnormal” region as having more “abnormal” sequences than normal ones – calculate the percentage of “abnormal” regions – the trace is “abnormal” if the percentage is above a threshold sendmail Experiment 1 – Training data includes sequences from intrusion traces in, and sequences from 80% of the normal sendmail traces – Sequence length = 7 – Percentage of abnormal “regions” of each trace (showed in the table) is used as the intrusion indicator – The output rule sets contain ~250 rules, each with 2 or 3 attribute tests. This compares with the total ~1,500 different sequences. sendmail Experiment 1 traces sscp-1 sscp-2 sscp-3 syslog-remote-1 syslog-remote-2 syslog-local-1 syslog-local-2 decode-1 decode-2 sm565a sm5x sendmail % OF ABNORMAL TRACES 32.2 30.4 30.4 21.2 15.6 11.1 15.9 2.1 2.0 8.0 6.5 0.1 Illegal use of sendmail has much higher fraction of abnormal traffic as found by the rule-learner Mining Models on tcpdump data Packets of incoming, out-going, and internal broadcast traffic Much richer than previous sendmail data Needs extensive preprocessing to convert to usable form Preprocessing Steps Raw audit data converted to ASCII level packets Packets aggregated to connection level records – Record connection attempts – Monitor data packets and count: # of bytes in each direction – Watch how connection is terminated Use feature selection methods from data mining to augment connection-records with temporal features Structure of connection record Each record has: – start time and duration – participating hosts and ports (applications) – statistics (e.g., # of bytes) – flag: “normal” or a connection/termination error – protocol: TCP or UDP – Collection of temporal features extracted using data mining, example in PortScan multiple rejected connections to same host Manually chosen features “generic” features: – – – – protocol (service), protocol type (tcp, udp, icmp, etc.) duration of the connection, flag (connection established and terminated properly, SYN error, rejected, etc.), – whether the connection is from/to the same ip/port pair. “content” features (only useful for TCP connections): – – – – – – # of failed logins, successfully logged in or not, # of root shell prompts, “su root” attempted or not, # of access to security control files, # of compromised states (e.g., “Jumping to address”, “path not found” …), – # of write access to files, – # of hot (the sum of all the above “hot” indicators), Features from mined patterns: temporal and statistical “traffic” features: • # of connections to the same destination host as the current connection in the past 2 seconds, and among these connections, • # of rejected connections, • # of different services, • rate (%) of connections that have the same service, • rate (%) of different (unique) services. Experimental setup DARPA provided: normal traffic mixed with simulated intrusions in a military network 4 GB of tcpdump of 7 week network traffic 5 million connection records of 100 bytes each Test records: 2 million over two weeks, 38 attack types of which 14 only in test data Attack types: • DOS, denial of service attack • PROBING, port-scan • Unauthorized access from a remote machines, guessing password • Unauthorized access to local superuser: buffer-overflow attacks Example rules buffer_overflow :– hot >= 3, compromised >= 1, su_attempted <= 0, root_shell >= 1. ipsweep :– protocol = eco_i, srv_diff_host_rate > = 0.5, count <= 2, srv_count >= 6. smurf :– protocol = ecr_i, count > =5, srv_count >= 5. Results Compared with four other groups using knowledge engineering approaches This method was – best for PROBING attacks. – one of the best in DOS and user to root attacks. – No system good at detecting remote to local attacks Difficulties of other methods – Large amount of data – Does not generalize Mining market Generic data mining tools (Around 20 to 30 mining tool vendors) – SAS’s Enterprise Miner, – Clementine, – IBM’s Intelligent Miner, Many DBMS and data warehousing systems come packaged with standard data mining tools Several fraud detection products http://www.kdnuggets.com/solutions/fraud-detection.html Data warehousing and analysis Data warehousing Direct Query Merge Clean Summarize Reporting tools OLAP Crystal reports Essbase Mining tools Intelligent Miner Relational DBMS+ e.g. Redbrick Data warehouse Detailed transactional data GIS data Bombay branch Delhi branch Oracle Calcutta branch IMS Operational data Census data SAS Multidimensional Data analysis Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Office Month Month Week Day A Sample Data Cube 2Qtr 3Qtr 4Qtr sum U.S.A Canada Mexico sum Country TV PC VCR sum 1Qtr Date Total annual sales of TV in U.S.A. Typical OLAP Operations Roll up (drill-up): summarize data – by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up – from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: – project and select Pivot (rotate): – reorient the cube, visualization, 3D to series of 2D planes. OLAP Navigational operators: Pivot, drilldown, roll-up, select. Hypothesis driven search: E.g. factors affecting defaulters – view defaulting rate on age aggregated over other dimensions – for particular age segment detail along profession Demo in the afternoon Link data from several data sources DERYCK ,D ,SOZA SOUZA ,D ,D ,, GORA VILLA ,03 ,GERA VILLA CHAFEKAR ,RAMCHANDRA …. ,VIMAN NAGAR PUNE ,411014 ,1 VIMAN NAGAR 411014 , Taxpayers Duplicates: Land records Passport Transport Telephone SOUZA ,D ,D ,,GORA VILLA ,,VIMAN NAGAR ,,,411014 , DERYCK ,D ,SOZA ,03 ,GERA VILLA ,,VIMAN NAGAR PUNE, 411014 Non-duplicates: CHAFEKAR ,RAMCHANDRA ,DAMODAR ,SHOP 8 ,H NO 509 NARAYAN PETH PUNE 411030 CHITRAV ,RAMCHANDRA ,D ,FLAT 5 ,H NO 2105 SADASHIV PETH PUNE 411 030 Machine Learning approach Given examples of duplicates and non-duplicate pairs, learn to predict if pair is duplicate or not. Input features: Various kinds of similarity functions between attributes Edit distance, Soundex, N-grams on text attributes Absolute difference on numeric attributes Capture domain-specific knowledge on comparing data The learning approach Example labeled pairs Similarity functions f1 f2 …fn Record 1 D Record 2 1.0 0.4 … 0.2 1 Record 1 N Record 3 0.0 0.1 … 0.3 0 Record 4 D Record 5 Unlabeled list Record Record Record Record Record Record 6 7 8 9 10 11 Similarity functions YearDifference > 1 Non-Duplicate All-Ngrams 0.48 Non Duplicate AuthorTitleNgrams 0.4 Classifier TitleIsNull < 1 PageMatch 0.5 0.3 0.4 … 0.4 1 Duplicate Duplicat e AuthorEditDist 0.8 Duplicate Mapped examples Non-Duplicate 0.0 1.0 0.6 0.7 0.3 0.0 0.3 0.6 0.1 0.4 0.2 0.1 0.4 0.1 0.8 0.1 … 0.3 … 0.2 … 0.5 … 0.6 … 0.4 … 0.1 … 0.1 … 0.5 ? ? ? ? ? ? ? ? 0.0 0.1 … 0.3 Duplicate 1.0 0.6 0.7 0.3 0.0 0.3 0.6 0.4 0.2 0.1 0.4 0.1 0.8 0.1 … 0.2 … 0.5 … 0.6 … 0.4 … 0.1 … 0.1 … 0.5 0 1 0 0 1 0 1 1 Experiences with the learning approach Too much manual search in preparing training data – Hard to spot challenging and covering sets of duplicates in large lists – Even harder to find close non-duplicates that will capture the nuances examine instances that are similar on one attribute but dissimilar on another Active learning is a generalization of this! The active learning approach Example labeled pairs Similarity functions f1 f2 …fn Record 1 D Record 2 1.0 0.4 … 0.2 1 Record 3 N Record 4 0.0 0.1 … 0.3 0 Unlabeled list Record Record Record Record Record Record 6 7 8 9 10 11 Classifier 0.0 1.0 0.6 0.7 0.3 0.0 0.3 0.6 0.1 0.4 0.2 0.1 0.4 0.1 0.8 0.1 … 0.3 … 0.2 … 0.5 … 0.6 … 0.4 … 0.1 … 0.1 … 0.5 ? ? ? ? ? ? ? ? Active learner 0.7 0.1 … 0.6 1 0.3 0.4 … 0.4 0 0.7 0.1 … 0.6 ? 0.3 0.4 … 0.4 ? Architecture of ALIAS Lp Initial training records Similarity Functions (F) D Unlabeled Input records Mapper Mapped labeled instances Training data T Infer pairs using transitivity Train classifier Dp Mapper Pool of mapped unlabeled instances Select instances S Similarity Indices Predicate for uncertain region A Large record lists Deduplication function Evaluation engine Groups of duplicates in A Active Learner Benefits of active learning Learning de-duplication function on Bibtex entries With 100 pairs: – Active learning: 97% (peak) – Random: only 30%