Download Data Mining in Civil Infrastructure

Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data Rebecca Buchheit AIS Lab Background • sporadic use of KDD techniques in civil infrastructure • relative youth of data mining research • difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under development • KDD process highly domain dependent • time consuming to teach data mining analysts domain knowledge Research Objectives • develop a framework for systematically applying KDD process to civil infrastructure data analysis needs – set of guidelines for inexperienced analysts – checklist for more experienced analysts • describe intersection of KDD process characteristics and civil infrastructure – what problems are well-suited to KDD? – what characteristics are unique to infrastructure? Summary • increased data collection => increased need to intelligently analyze data • KDD process as a “power tool” for analyzing data for high-level knowledge • civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results • proposed framework will help researchers to systematically apply KDD process to their data analysis problems Data Quality • What is it? – in this talk, “accuracy” – how close is the observed value to the true value? – “ground truth” is rare – look for anomalous patterns • Why is it important? – poor quality data may taint analyses – patterns of poor quality data may overwhelm data mining/machine learning algorithms Mn/ROAD Data • weigh-in-motion data – axle spacings and weights, speed, lane, error codes • derived quantities courtesy Mn/ROAD – equivalent standard axle loads (ESALs) – FHWA vehicle type – gross vehicle weight – total vehicle length • trucks only (type >= 4) • Jan 1 ‘98 to Dec 31 ’00 • about 3 million vehicles Sample Data Overview of Approach • use statistical analysis and data mining algorithms to separate anomalies from normal data – clustering – regression – physical constraints – statistical properties • focus on differences between anomalies and normal data to help discover causation Clustering • group data into “natural classes” • anomalies separated from normal data • used Autoclass clustering algorithm Clustering Results Regression ∑ ESAL = (3.531±0.176) ∑vehicles – (1.252±0.099) ∑axles + (0.066±0.003) ∑GVW – 139.000 ± 79.813 • confidence interval of 95% • R-square (fit) = 0.923 • if error > 15% then identify as anomaly Regression Results Binary Constraints (1) constraint # violations (3,068,384 total) offscale hit error 61,129 (1.99%) significant weight difference error 11,107 (0.36%) different axle counts error 69,521 (2.27%) tailgating 10,211 (0.33%) speed >= 64.37 km/h 51,114 (1.86%) speed <= 128.74 km/h 3,723 (0.12%) Binary Constraints (2) constraint # violations (3,068,384 total) gross weight <= 45,359kg 24,897 (0.81%) length <= 22.86 m 79,454 (2.59%) unknown vehicle type 190,191 (6.20%) number of axles != 0 number of axles <= 8 47 (0.00%) 57,114 (1.86%) Constraint Interactions c1 c2 % interactions slow speed length over limit 63.5% length over limit slow speed 45.7% tailgating unknown type 31.7% high speed unknown type 28.7% overweight diff axle counts 25.2% tailgating slow speed 21.1% tailgating length over limit 15.2% Distribution Constraints • use a goodness-offit test to compare distributions from the same day of week – – – – length gross weight ESALs lane Anomaly Identification • identify days with higher than normal concentrations of binary constraint violations • identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane Binary Constraints Results Distribution Constraints Results A Quick Refresher • used four different procedures to detect anomalies – clustering – regression – binary (physical) constraints – distribution constraints • next up – what is causing the anomalies? – can we fix them? Gross Vehicle Weight Lane What Happened? • two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle • lightweight vehicles are tailgating cars – cars not supposed to be in database – mis-classified because of tailgating – this causes the “high” vehicle counts • very heavy vehicles are tailgating trucks • lane 1 (right-hand side) data is missing for all “low” vehicle count days Can It Be Fixed? (1) • removed all tailgating cars – – – – lightweight short 2 or 3 axles error code • “halved” all tailgating trucks – very long – very heavy – more than 9 axles – error code Can It Be Fixed? (2) • inserted lane 1 vehicles from same time period in 2000 • “shifted” days to make sure day of week was constant – Tuesday Sept 8 1998 => Tuesday Sept 5 2000 Summary • statistical analysis and data mining algorithms can be used to detect systematic anomalies in data – focus on differences between anomalies and normal data to discover differences – need domain knowledge to understand causation Current Progress/Future Work • integrate algorithms into data quality assessment program == automation – – – – – physical constraints distribution constraints other statistical characteristics of data clustering regression, neural networks • will support infrastructure-related data collection activities • use algorithms to identify and “clean” anomalies Acknowledgements • Minnesota Department of Transportation, especially Maggi Chalkline • based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining in Civil Infrastructure