Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NTNU Data Mining: My view Helge Langseth [email protected] Group of Statistics, Dept. of Mathematical Sciences Norwegian University of Science and Technology Data Mining: My view – p.1/12 NTNU Contents What is Data Mining Representation Complexity problems Examples Conclusions Data Mining: My view – p.2/12 NTNU What is Data Mining Data Mining is the automatization of the process of finding interesting patterns in huge datasets. Huge datasets: Enormous number of objects; each having a comprehensive description Interesting patterns: The patterns must be represented, validated, and evaluated Automatization: The patterns are to be discovered in a (semi-)automatic way Data Mining: My view – p.3/12 NTNU What is Data Mining Data Mining is the automatization of the process of finding interesting patterns in huge datasets. Validation: User initiated. “Is it true that . . . ” Exploration: Data-driven. “See what we find . . . ” Predictive ⇐ All examples are like this Explanation Data Mining: My view – p.3/12 NTNU Data Mining is interdisciplinary Statistics Other things . . . Machine learning Artificial intelligence DATA MINING Computer science Databasetechnology Domainknowledge Data Mining: My view – p.4/12 NTNU Example: Stock market Huge amounts of historic data available: Prizes of stocks Other relevant quantities: Interest rates Exchange rates Oil prize ... Can vi learn interesting patterns from the dataset that can help us decide which stocks that are “good buys”? Data Mining: My view – p.5/12 NTNU Example: Stock market Data Mining: My view – p.5/12 NTNU Example: Stock market Data Mining: My view – p.5/12 NTNU The Data Mining process Database Stock prizes Interest rates Exchange rates Oil prize ... Data Mining Stock prizes Interest rates Exchange rates Oil prize ... Decision rules Buy? Data Mining: My view – p.6/12 NTNU The Data Mining process Other examples: The analysis of text (finding relevant information, e.g., personalized news, . . . ) Recommender system: Find books that fits my taste (amazon.com) Examine medical images: Finding cancer Online monitoring of technical equipment to predict failures ... Data Mining: My view – p.6/12 NTNU Representation The patterns must be represented in the computer Lots of different representations are available, like if – then – else rules Neural networks Fuzzy rules Probability distributions Kohonen maps Decision trees Instance-based methods ... I have my own favourite as well. . . Data Mining: My view – p.7/12 NTNU Bayesian networks Battery Fuel Headlights Fuel Gauge Engine turns Nodes: Attributes of the problem Arcs: “Cause-and-effect-structures” Car starts Well-defined syntax Well-defined semantics Data Mining: My view – p.8/12 NTNU Main challenge: Complexity We are working with large datasets: Many attributes (p > 102 ) Many records (N > 106 ) Problems with computational complexity will dominate: The “classic” problems are N P hard We will typically not look at algorithms of complexity higher than O(N · p3 ) We need to make some approximations: Look at sub-classes of models Heuristic search Data Mining: My view – p.9/12 NTNU Example: Classification Classify students as either boy or girl based on p college-grades The number of possible model structures grows hyper-exponential in p: p 1 2 3 4 5 # structures 3 25 543 29, 281 3, 781, 503 10 3 · 1022 We have to make assumptions to reduce the size of the search-space Data Mining: My view – p.10/12 NTNU Example: Classification 1 Best modell 0.8 0.6 0.4 0.2 0 Best search-space? Data Mining: My view – p.10/12 NTNU Example: Classification No assumptions: 2 ∼ O 3p structures. Irregular search-space makes the heuristic search difficult Naïve Bayes: O(1) structures. Direct calculation, “safe” choice Tree Augmented Naïve Bayes: O(p2 ) structures. Direct calculation Hierarchical Naïve Bayes: O(p!) structures. Regular search-space; heuristic search reasonably efficient Data Mining: My view – p.10/12 40 21_FT0151:value [m3/h] 21_FT0101:value [m3/h] 440 420 400 380 360 340 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 21_FT0451:value [m3/h] 21_FT0401:value [m3/h] 180 160 140 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 21_PDY0103:value [barg] 21_FT1000:value [m3/h] 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 200 180 160 1.38 400 380 360 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1.36 1.34 1.32 1.3 1.28 1.26 10000 21 21_PST0104:value [barg] 1.4 1.35 1.3 1.25 0 220 140 10000 420 340 0 −20 240 200 120 20 −40 10000 220 21_PDY0153:value [barg] NTNU Example: Analysis of sensor-data 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 20.5 20 19.5 Data Mining: My view – p.11/12 NTNU Example: Analysis of sensor-data Desired functionality: On-line monitoring of the process system Early warning of critical failures Optimize maintenance by identifying “unimportant” failures Available data: Thousands of sensors Seansors read with frequency 1Hz Data Mining: My view – p.11/12 NTNU Example: Analysis of sensor-data Difficult because: We lack system knowledge Knowledge about the overall process lacking We don’t know all failure modes, and not the signatures of unseer failures Complexity problems: Failures cannot be detected by considering a single sensor; must look at several dimensions evolving over time Calculations to be done in “real time”. Data Mining: My view – p.11/12 The Task Reality NTNU Conclusions Data Mining: My view – p.12/12 NTNU Conclusions Data Mining is the automatization of the process of finding interesting patterns in huge datasets. The field borrows concepts and ideas from other scientific subjects. Major challenges in — computer science (DB management, AI/ML) — traditional statistics Data Mining is among the fastest increasing business technologies in the world (wrt. turnover) Data Mining applications are becoming integral parts of our daily lives Data Mining: My view – p.12/12