Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Data Mining Pat Talbot Ryan Sanders Dennis Ellis 11/14/01 Background TRW • 1998 Scientific Data Mining at the JNIC (CRAD) • Tools: Mine Set 2.5 was used. Trade study =>Clementine scored highest, but $$$ • Training: Knowledge Discovery in Databases conference (KDD ’98) • Data Base: Model Reuse Repository with 325 records, 12 fields of metadata • Results: “Rules Visualizer” algorithm provided “prevalence” and “predictability” Clustering algorithm showed correlation of fields • 1999 Wargame 200 Performance Patterns (CRAD) • Tools: Mine Set 2.5 was again used • Training: KDD ’99 tutorials and workshops • Data Base: numerical data from Wargame 2000 performance benchmarks • Results: insight into the speedup obtainable from a parallel discrete event simulation • 2000 Strategic Offense/Defense Integration (IR&D) • Tools: Weka freeware from the University of New Zealand • Training: purchased Weka text: Data Mining by Ian Witten and Eibe Frank • Data Bases: STRATCOM Force Readines, heuristics for performance Metrics • Results: Rule Induction tree provided tabular understanding of structure • 2001 Offense/Defense Integration (IR&D) • Tools Weka with GUI. Also evaluated Oracle’s Darwin • Database: Master Integrated Data Base (MIDB), 300 record, 14 fields (uncl) • Results: Visual display of easy to understand IF_THEN rule is a tree structure Performance benchmarking of file sizes and execution time 2 TRW Private/Proprietary Success Story TRW • Contractual Work • Joint National Integration Center (1998 – 1999) • Under the Technology Insertion Studies and Analysis (TISA) Delivery Order • Results: “Rules Visualizer” algorithm provided “prevalence” and “predictability” Clustering algorithm showed correlation of fields • Importance: • Showed practical applications of the technology • Provided training and attendance at at Data Mining Conferences • Attracted the attention of the JNIC Chief Scientist who used it for analysis 3 TRW Private/Proprietary Data Mining Techniques Uses algorithmic* techniques for information extraction: Shallow Data Multi-Dimensional Data (discover with SQL) - Rule Induction - Neural Networks - Regression Modeling - K Nearest Neighbor Clustering - Radial Basis Functions (discover with OLAP) Hidden Data (discover with Weka) Preferred when an explanation is required Deep Data (discover only with clues) * Non-parametric statistics, machine learning, connectionist Data Mining Process Data Sources: • STRATCOM Data Stores • Requirements Database • Simulation Output • MIDB Data Mining Patterns & Models Transformation & Reduction Storage/Retrieval Mechanisms Indexing: • Hypercube • Multicube Database Datamart Warehouse Preprocessing & Cleaning Situation Assessment Visualization Evaluation Target Data Knowledge Data Mining Flow Diagram External Inputs Weka Data Mining Software Force Readiness Input Processing Data File Defaults: • Algorithm Choice Quinlan Rule Induction Tree • Output Parameters Clustering Algorithm • Control Parameters TRW Output Rule Tree / Clusters • underlying structure • patterns • hidden problems - consistency - missing data - corrupt data - outliers - exceptions - old data Link to Effects 6 TRW Private/Proprietary Current Objectives TRW • Automated Assessment and Prediction of Threat Activity • Discover patterns in threat data • Predict future threat activity • Use data mining to construct Dempster-Shafer Belief Networks • Integrate with Conceptual Clustering, Data Fusion, and Terrain Reasoning Terrain Reasoning Conceptual Clustering new concepts text Noun Geographic data on routes threat Movements isa automate automate attributes outcome Relations Rules 7 evidence attributes cluster new patterns evidence ? likely activities Data Fusion Data Mining IF THEN Belief in hypotheses ? automate Hypothesis Impact TRW Private/Proprietary Rule Induction Tree Format TRW • Example: Rule induction tree for weather data Data Base: # Outlook Humidity Winds 1 Sunny 51% True 2 Rainy 70% False 3 . . n . Had no effect! Temp 78 98 Deploy Yes No Decision Supported: Weather influence on asset deployment no (3.0/2.0) 3 met the rule, 2 did not 8 Deploy Asset? TRW Private/Proprietary Results: READI Rule Tree TRW Reference: Talbot,P., TRW Technical Review Journal, Spring/Summer 2001, Pages 92,93. Quinlan C4.5 Classifier C-0 training C- 0 rating 48 occur >4 - Links are IFs Overall Training C-1 training Mission Capable C-3 training - White Nodes are THENs - Yellow nodes are C-ratings C-2 training C-3 rating 16 occur <=4 Authorized Platforms C-5 rating 16 occur. >9 <= 2 >2 <=9 Category C-1 rating Comments C-3 rating 564 occur Edited ICBM 16 occur Spares C-5 rating C-3 rating Training C-1 rating 4 occur 16 occur C-2 rating 16 occur C-2 rating 16 occur 60 occur Platforms Ready Example Rule: In the “Overall Training” database, if C-1 training was received, then all but 4 squadrons are Mission Capable and all but 2 are then Platform Ready. If Platform Ready, 564 are 9 then rated C-1. TRW Private/Proprietary Training Results: READI Clustering TRW STRATCOM Force Readiness Database • 1136 instances C=3 Readiness Ratings C=2 C=1 Category 10 TRW Private/Proprietary TRW Results: Heuristics Rule Tree Six Variable Damage Expectancy Variable Type SVbr ARbr ARrb DMbr Intent 90% 14 occur. Weapon Type Destroy Defend Preempt 100% 4 occur. 80% 2 occur. C 60% 4 occur. Deny Retaliate 90% 14 occur./2 misses 70% 8 occur 5 misses DMrb N 90% 6 occur./1 miss 70% 14 occur./1 miss Deny 40% 2 occur./1 miss Destroy Retaliate 100% 2 occur. 90% 2 occur. Arena Strategic Preempt Arena 11 Intent Defend 50% 2 occur. IF Variable type is DMbr AND Weapon type is conventional THEN DMbr=80% occurs 8 times AND does not occur 5 times SVrb Strategic 20% 2 occur./1 miss 40% 2 occur./1 miss Tactical 100% 2 occur. 20% 2 occur./1 miss Tactical 10% 6 occur./1 miss TRW Private/Proprietary Results: MIDB – 1 TRW MIDB Data Table: difficult to see patterns! 10001001012345,TI5NA,80000,DBAKN12345,000001,KN,A,OPR,40000000N,128000000E,19970101235959,0290,SA-2 10001001012346,TI5NA,80000,DBAKN12345,000002,KN,A,OPR,39500000N,127500000E,19970101225959,0290,SA-2 10001001012347,TI5CA,80000,DBAKN12345,000003,KN,A,OPR,39400000N,127400000E,19970101215959,0290,SA-3 . . 10001001012345,TI5NA,80000,DBAKN12345,000001,KN,A,OPR,40000000N,128000000E,19970101235959,0290,SA-2 Rules that determine if a threat Surface-to-Air site is operational (OPR) or not (NOP): 1) if SA-2 and lat <= 39.1 then 3 are NOP 2) if SA-2 and lat > 39.1 then 9 are OPR 3) if SA-3 and lat<= 38.5 then 3 are OPR 4) if SA-3 and lat > 38.5 then 9 are NOP 5) If SA-13 then 6 are NOP MIDB rule tree: easy to see patterns! 12 TRW Private/Proprietary TRW Results: MIDB 300 Records Rules that determine if a threat Surface-to-Air site is operational (OPR) or not (NOP): 1) if lat > 39.4 then 59 are OPR, 2 aren’t 2) if lat <= 39.4 and lon >127.2 then 56 are NOP, 2 aren’t 3) if lon <= 127.2 and lat > 39.1 then 31 are OPR, 2 aren’t 4) If lon <=127.2 lat <= 39.1 and if: SA-2 then 30 are NOP SA-3 then if lat <= 38.5 then 30 are OPR, 1 isn’t SA-3 then if lat < 38.5 then 30 are NOP, 1 isn’t SA-13 then if lat <= 36.5 then 2 are OPR SA-13 then if lat >36.5 then 60 are OPR, 6 aren’t 13 TRW Private/Proprietary Current Work: SUBDUE TRW Hierarchical Conceptual Clustering • Structured and unstructured data • Clusters attributes w/graphs • Hypothesis generation Subclass 7: SA-13s (29) at (34.3 N, 129.3 E) are not operational 14 TRW Private/Proprietary TRW Applicability Benefits: • Automatically discovers patterns in data • Resulting rules are easy to understand in “plain English” • Quantifies rules in executable form • Explicitly picks out corrupt data, outliers, and exceptions • Graphic user interface allows easy understanding Example Uses: • Database validation • 3-D sortie deconfliction • Determine trends in activity • Find hidden structure and dependencies • Create or modify belief networks 15 TRW Private/Proprietary Lessons Learned TRW • Data Sets: choose one that you understand – makes cleaning, formatting, default parameter settings, and interpretation much easier. • Background: knowledge of non-parametric statistics helps determine what patterns are statistically significant • Tools: many are just fancy GUIs with database query and plot functionality. Most are overpriced (100K/seat for high end tools for mining business data) • Uses: new one discovered in every task; e.g., consistency & completeness of rules. May be be useful for organizing textual evidence • Algorithms: must provide understandable patterns; e.g., some algorithms do not! • Integration: challenging to interface these inductive and abductive methods with deductive methods like belief networks 16 TRW Private/Proprietary TRW Summary • TRW has many technical people in Colorado with data mining experience • Hands-on with commercial and academic tools • Interesting and useful results have been produced • Patterns in READI and MIDB using rule induction algorithms • Outliers, corrupt data, and exceptions are flagged • Novel uses, such as consistency and completeness of rule sets, demonstrated • Lessons learned have been described • Good starting point for future work • Challenge is interfacing data mining algorithms with others 17 TRW Private/Proprietary