Download Data Mining: My view - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Pattern recognition wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Time series wikipedia , lookup

Transcript
NTNU
Data Mining: My view
Helge Langseth
[email protected]
Group of Statistics,
Dept. of Mathematical Sciences
Norwegian University of Science and Technology
Data Mining: My view – p.1/12
NTNU
Contents
What is Data Mining
Representation
Complexity problems
Examples
Conclusions
Data Mining: My view – p.2/12
NTNU
What is Data Mining
Data Mining is the automatization of the process of
finding interesting patterns in huge datasets.
Huge datasets: Enormous number of objects; each
having a comprehensive description
Interesting patterns: The patterns must be
represented, validated, and evaluated
Automatization: The patterns are to be discovered
in a (semi-)automatic way
Data Mining: My view – p.3/12
NTNU
What is Data Mining
Data Mining is the automatization of the process of
finding interesting patterns in huge datasets.
Validation: User initiated. “Is it true that . . . ”
Exploration: Data-driven. “See what we find . . . ”
Predictive ⇐ All examples are like this
Explanation
Data Mining: My view – p.3/12
NTNU
Data Mining is interdisciplinary
Statistics
Other
things . . .
Machine learning
Artificial
intelligence
DATA MINING
Computer
science
Databasetechnology
Domainknowledge
Data Mining: My view – p.4/12
NTNU
Example: Stock market
Huge amounts of historic data available:
Prizes of stocks
Other relevant quantities:
Interest rates
Exchange rates
Oil prize
...
Can vi learn interesting patterns from the dataset
that can help us decide which stocks that are
“good buys”?
Data Mining: My view – p.5/12
NTNU
Example: Stock market
Data Mining: My view – p.5/12
NTNU
Example: Stock market
Data Mining: My view – p.5/12
NTNU
The Data Mining process
Database
Stock prizes
Interest rates
Exchange rates
Oil prize
...
Data Mining
Stock prizes
Interest rates
Exchange rates
Oil prize
...
Decision rules
Buy?
Data Mining: My view – p.6/12
NTNU
The Data Mining process
Other examples:
The analysis of text (finding
relevant information, e.g.,
personalized news, . . . )
Recommender system: Find
books that fits my taste
(amazon.com)
Examine medical images: Finding cancer
Online monitoring of technical equipment to predict failures
...
Data Mining: My view – p.6/12
NTNU
Representation
The patterns must be represented in the computer
Lots of different representations are available, like
if – then – else rules
Neural networks
Fuzzy rules
Probability distributions
Kohonen maps
Decision trees
Instance-based methods
...
I have my own favourite as well. . .
Data Mining: My view – p.7/12
NTNU
Bayesian networks
Battery
Fuel
Headlights
Fuel Gauge
Engine turns
Nodes: Attributes of the problem
Arcs: “Cause-and-effect-structures”
Car starts
Well-defined syntax
Well-defined semantics
Data Mining: My view – p.8/12
NTNU
Main challenge: Complexity
We are working with large datasets:
Many attributes (p > 102 )
Many records (N > 106 )
Problems with computational complexity will
dominate:
The “classic” problems are N P hard
We will typically not look at algorithms of
complexity higher than O(N · p3 )
We need to make some approximations:
Look at sub-classes of models
Heuristic search
Data Mining: My view – p.9/12
NTNU
Example: Classification
Classify students as either boy or girl based on p
college-grades
The number of possible model structures grows
hyper-exponential in p:
p
1
2
3
4
5
# structures
3
25
543
29, 281
3, 781, 503
10
3 · 1022
We have to make assumptions to reduce the size of the
search-space
Data Mining: My view – p.10/12
NTNU
Example: Classification
1
Best modell
0.8
0.6
0.4
0.2
0
Best search-space?
Data Mining: My view – p.10/12
NTNU
Example: Classification
No assumptions:
2
∼ O 3p structures. Irregular search-space
makes the heuristic search difficult
Naïve Bayes:
O(1) structures. Direct calculation, “safe” choice
Tree Augmented Naïve Bayes:
O(p2 ) structures. Direct calculation
Hierarchical Naïve Bayes:
O(p!) structures. Regular search-space;
heuristic search reasonably efficient
Data Mining: My view – p.10/12
40
21_FT0151:value [m3/h]
21_FT0101:value [m3/h]
440
420
400
380
360
340
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
21_FT0451:value [m3/h]
21_FT0401:value [m3/h]
180
160
140
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
21_PDY0103:value [barg]
21_FT1000:value [m3/h]
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
200
180
160
1.38
400
380
360
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1.36
1.34
1.32
1.3
1.28
1.26
10000
21
21_PST0104:value [barg]
1.4
1.35
1.3
1.25
0
220
140
10000
420
340
0
−20
240
200
120
20
−40
10000
220
21_PDY0153:value [barg]
NTNU
Example: Analysis of sensor-data
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
20.5
20
19.5
Data Mining: My view – p.11/12
NTNU
Example: Analysis of sensor-data
Desired functionality:
On-line monitoring of the process system
Early warning of critical failures
Optimize maintenance by identifying
“unimportant” failures
Available data:
Thousands of sensors
Seansors read with frequency 1Hz
Data Mining: My view – p.11/12
NTNU
Example: Analysis of sensor-data
Difficult because:
We lack system knowledge
Knowledge about the overall process lacking
We don’t know all failure modes, and not the
signatures of unseer failures
Complexity problems:
Failures cannot be detected by considering a single
sensor; must look at several dimensions evolving
over time
Calculations to be done in “real time”.
Data Mining: My view – p.11/12
The Task Reality
NTNU
Conclusions
Data Mining: My view – p.12/12
NTNU
Conclusions
Data Mining is the automatization of the process
of finding interesting patterns in huge datasets.
The field borrows concepts and ideas from other
scientific subjects. Major challenges in
— computer science (DB management, AI/ML)
— traditional statistics
Data Mining is among the fastest increasing
business technologies in the world (wrt. turnover)
Data Mining applications are becoming integral
parts of our daily lives
Data Mining: My view – p.12/12