Download Knowledge Discovery/Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Sensors & Knowledge Discovery
(a.k.a. Data Mining)
H. Scott Matthews
April 6, 2004
(originally presented by Rebecca
Buchheit)
Civil and Environmental Engineering
Carnegie Mellon University
Recap
Sensors - what are they?
Sensor Networks - how they help us
Sensor Signal Acquisition and Use
Next - how to use the data!
Civil and Environmental Engineering
Carnegie Mellon University
Life Cycles of Sensor Networks
Currently, sensors and sensor systems
are fairly proprietary
e.g. a ‘Johnson Controls’ HVAC sensor
system uses only their equipment
Need to design more robust networks that
are standards-driven and open
Civil and Environmental Engineering
Carnegie Mellon University
Life Cycles (2)
In addition, sensor networks then to have very
short ‘lifetimes’
i.e. We build one, use it for a few years, and then
replace it with a newer/better one
Need to plan for, and design architectures for
sensor networks that will last the life of the
infrastructure we are monitoring
e.g. 50-100 years for bridges (to manage LCC)
Civil and Environmental Engineering
Carnegie Mellon University
A Knowledge Discovery
Framework for Civil
Infrastructure Contexts
Rebecca Buchheit
Department of Civil and Environmental
Engineering
Carnegie Mellon University
Civil and Environmental Engineering
Carnegie Mellon University
Motivation
• condition and usage patterns of critical
infrastructure attracting increased
attention
• deteriorating infrastructure + cheap data
collection methods = health monitoring,
transportation management, other data
intensive civil infrastructure techniques
Civil and Environmental Engineering
Carnegie Mellon University
Motivation
• amount of data, relationships between
attributes, context-sensitivity,
observational collection methods =>
data mining and knowledge discovery in
databases (KDD) process
• our ability to collect data far outstrips
our ability to analyze and understand
the data at a high level of abstraction
Civil and Environmental Engineering
Carnegie Mellon University
Databases + Statistics + and
Machine Learning = Data Mining
statistics
databases
machine
learning
data mining
Civil and Environmental Engineering
Carnegie Mellon University
Definitions
Data Mining
• algorithms to extract patterns from large
data sets
Knowledge Discovery in Databases
• “... the non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in data.”
[Fayyad, et al]
• Uses observational, not controlled, data
Civil and Environmental Engineering
Carnegie Mellon University
Knowledge Discovery Process
Steps
domain understanding
data understanding
data preparation
data modeling (a.k.a “data mining”)
results evaluation
deployment
Civil and Environmental Engineering
Carnegie Mellon University
CRISP-DM
CRoss-Industry Standard Process for
Data Mining
high-level, hierarchical, iterative process
model for KDD
provides framework for applying KDD
consistently
Civil and Environmental Engineering
Carnegie Mellon University
Domain Understanding
evaluate fit between KDD and the problem
•
•
•
•
•
•
how much data?
what type of data?
perceived quality of data?
what is being measured?
right data to answer the question?
organizational support?
Civil and Environmental Engineering
Carnegie Mellon University
Data Understanding
summary statistics
plotting and visualization
missing values
• randomly missing
• influenced by a measured factor
• influenced by an unmeasured factor
evaluate quality of existing data
• what is “good” data?
• what do we do with “bad” data?
Civil and Environmental Engineering
Carnegie Mellon University
Data Preparation
most time-consuming part of KDD
data selection
• which records (“rows”) to use
• which attributes (“columns”) to use
data cleaning
• do something to bad and missing data
integrate data from different sources
transform data
Civil and Environmental Engineering
Carnegie Mellon University
Data Modeling/Data Mining
choose an algorithm
• choose parameters for that algorithm
• apply algorithm to data
• evaluate results
– predictive accuracy
– descriptive coverage
• repeat as necessary
repeat as necessary
Civil and Environmental Engineering
Carnegie Mellon University
Data Mining Goals
Prediction
• predict the value of one or more variables
based on the values of other variables
Description
• describe the data set in a compact, humanunderstandable form
Civil and Environmental Engineering
Carnegie Mellon University
Data Mining Tasks
•
•
•
•
•
•
Classification
Regression
Clustering
Deviation detection
Summarization
Dependency modeling
Civil and Environmental Engineering
Carnegie Mellon University
Classification
learn how to
classify data
items into
predefined
groups
Civil and Environmental Engineering
Carnegie Mellon University
Regression
map a realvalued
dependent
variable to
one or more
independent
variables
Civil and Environmental Engineering
Carnegie Mellon University
Clustering
learn
“natural”
classes or
clusters of
data
Civil and Environmental Engineering
Carnegie Mellon University
Deviation Detection
detect
changes or
deviations
from
“normal” or
baseline
state
Civil and Environmental Engineering
Carnegie Mellon University
Summarization
summarize
subsets of
data set
computer industry
mean salary = $65k
service industry
mean salary = $20k
Civil and Environmental Engineering
Carnegie Mellon University
Dependency Modeling
learn relationships between attributes or
between items in the data set
• pattern recognition
• time series analysis
• association rules
In 80% of the cases, an engineer with a
PE and 10 years experience is a project
manager.
Civil and Environmental Engineering
Carnegie Mellon University
Data Mining in the IW
concept description using classification
environmental conditions affect hot water
energy consumption
• used outside temperature, solar radiation and
wind speed
• solar radiation and wind speed not significant
above 80F and below 50F
• IF temperature between 20F and 30F
THEN energy usage between 47,393 kJ and
131,875 kJ
• describes >50% instances in energy usage range
Civil and Environmental Engineering
Carnegie Mellon University
Results Evaluation
do results meet client’s criteria?
novel?
understandable?
valid (modeling phase)?
useful?
Civil and Environmental Engineering
Carnegie Mellon University
Results Deployment
explain results to client
improvements to data collection?
ongoing process applied to new data?
Civil and Environmental Engineering
Carnegie Mellon University
Benefits of KDD
Intelligent Workplace
• confirmation that system is (not) working
• continue to monitor control system
• in future, predict missing values to complete
energy studies
Civil and Environmental Engineering
Carnegie Mellon University
Apply Data Mining to
Civil Infrastructure?
• civil infrastructure meets guidelines for
selecting potential data mining problems
•
•
•
•
•
•
significant impact
no good alternatives exist
prior/domain knowledge
effects of noisy data are mitigated
sufficient data
relevant attributes are being measured
Civil and Environmental Engineering
Carnegie Mellon University
Background
• sporadic use of KDD techniques in civil
infrastructure
• relative youth of data mining research
• difficult to systematically apply KDD process
• KDD process tools (CRISP-DM) still under
development
• KDD process highly domain dependent
• time consuming to teach data mining analysts
domain knowledge
Civil and Environmental Engineering
Carnegie Mellon University
Research Objectives
• develop a framework for systematically
applying KDD process to civil
infrastructure data analysis needs
• set of guidelines for inexperienced analysts
• checklist for more experienced analysts
• describe intersection of KDD process
characteristics and civil infrastructure
• what problems are well-suited to KDD?
• what characteristics are unique to
infrastructure?
Civil and Environmental Engineering
Carnegie Mellon University
Summary
• increased data collection => increased need
to intelligently analyze data
• KDD process as a “power tool” for analyzing
data for high-level knowledge
• civil infrastructure problems are well-suited to
data mining but will need to apply entire KDD
process to get good results
• proposed framework will help researchers to
systematically apply KDD process to their
data analysis problems
Civil and Environmental Engineering
Carnegie Mellon University