Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab Background • sporadic use of KDD techniques in civil infrastructure • relative youth of data mining research • difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under development • KDD process highly domain dependent • time consuming to teach data mining analysts domain knowledge Research Objectives • develop a framework for systematically applying KDD process to civil infrastructure data analysis needs – set of guidelines for inexperienced analysts – checklist for more experienced analysts • describe intersection of KDD process characteristics and civil infrastructure – what problems are well-suited to KDD? – what characteristics are unique to infrastructure? Summary • increased data collection => increased need to intelligently analyze data • KDD process as a “power tool” for analyzing data for high-level knowledge • civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results • proposed framework will help researchers to systematically apply KDD process to their data analysis problems Data Quality • What is it? – in this talk, “accuracy” – how close is the observed value to the true value? – “ground truth” is rare – look for anomalous patterns • Why is it important? – poor quality data may taint analyses – patterns of poor quality data may overwhelm data mining/machine learning algorithms Mn/ROAD Data • weigh-in-motion data – axle spacings and weights, speed, lane, error codes • derived quantities courtesy Mn/ROAD – equivalent standard axle loads (ESALs) – FHWA vehicle type – gross vehicle weight – total vehicle length • trucks only (type >= 4) • Jan 1 ‘98 to Dec 31 ’00 • about 3 million vehicles Sample Data Overview of Approach • use statistical analysis and data mining algorithms to separate anomalies from normal data – clustering – regression – physical constraints – statistical properties • focus on differences between anomalies and normal data to help discover causation Clustering • group data into “natural classes” • anomalies separated from normal data • used Autoclass clustering algorithm Clustering Results Regression ∑ ESAL = (3.531±0.176) ∑vehicles – (1.252±0.099) ∑axles + (0.066±0.003) ∑GVW – 139.000 ± 79.813 • confidence interval of 95% • R-square (fit) = 0.923 • if error > 15% then identify as anomaly Regression Results Binary Constraints (1) constraint # violations (3,068,384 total) offscale hit error 61,129 (1.99%) significant weight difference error 11,107 (0.36%) different axle counts error 69,521 (2.27%) tailgating 10,211 (0.33%) speed >= 64.37 km/h 51,114 (1.86%) speed <= 128.74 km/h 3,723 (0.12%) Binary Constraints (2) constraint # violations (3,068,384 total) gross weight <= 45,359kg 24,897 (0.81%) length <= 22.86 m 79,454 (2.59%) unknown vehicle type 190,191 (6.20%) number of axles != 0 number of axles <= 8 47 (0.00%) 57,114 (1.86%) Constraint Interactions c1 c2 % interactions slow speed length over limit 63.5% length over limit slow speed 45.7% tailgating unknown type 31.7% high speed unknown type 28.7% overweight diff axle counts 25.2% tailgating slow speed 21.1% tailgating length over limit 15.2% Distribution Constraints • use a goodness-offit test to compare distributions from the same day of week – – – – length gross weight ESALs lane Anomaly Identification • identify days with higher than normal concentrations of binary constraint violations • identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane Binary Constraints Results Distribution Constraints Results A Quick Refresher • used four different procedures to detect anomalies – clustering – regression – binary (physical) constraints – distribution constraints • next up – what is causing the anomalies? – can we fix them? Gross Vehicle Weight Lane What Happened? • two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle • lightweight vehicles are tailgating cars – cars not supposed to be in database – mis-classified because of tailgating – this causes the “high” vehicle counts • very heavy vehicles are tailgating trucks • lane 1 (right-hand side) data is missing for all “low” vehicle count days Can It Be Fixed? (1) • removed all tailgating cars – – – – lightweight short 2 or 3 axles error code • “halved” all tailgating trucks – very long – very heavy – more than 9 axles – error code Can It Be Fixed? (2) • inserted lane 1 vehicles from same time period in 2000 • “shifted” days to make sure day of week was constant – Tuesday Sept 8 1998 => Tuesday Sept 5 2000 Summary • statistical analysis and data mining algorithms can be used to detect systematic anomalies in data – focus on differences between anomalies and normal data to discover differences – need domain knowledge to understand causation Current Progress/Future Work • integrate algorithms into data quality assessment program == automation – – – – – physical constraints distribution constraints other statistical characteristics of data clustering regression, neural networks • will support infrastructure-related data collection activities • use algorithms to identify and “clean” anomalies Acknowledgements • Minnesota Department of Transportation, especially Maggi Chalkline • based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380