Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Database Issues in Smart Homes Pervasive Intelligent Environments Spring 2004 March 2, 2004 CRESCENT TCU Dept. of Computer Science Topics: Lecture 3 • Preparing for prediction & decision making: Data Mining/KDD • An example of some of the issues we’ve discussed – “Towards Sensor Database Systems”, Bonnet, Gehrke, Seshadri Data mining taken from Elmasri & Navathe, 4th edition CRESCENT TCU Dept. of Computer Science Data Warehouses (1 more thing) • Repositories for data mining activities – Aggregates/summaries of data help efficiency • Optimized for decision-support, not transaction processing • Definition (Elmasri, page 900) – A subject-oriented, integrated, non-volatile, time-variant collection of data in support of management’s decisions” • Replace “management”, with “smart home agents” CRESCENT TCU Dept. of Computer Science Data Mining Definition • Discovery of new information in terms of patterns or rules from vast amounts of data • Extracts patterns that can’t readily be found by asking the right questions (queries) – TOO MUCH DATA FOR HUMANS • Emerged from – Artificial Intelligence:Machine learning, Neural nets, Genetic Algorithms – Statistics – Operations Research CRESCENT TCU Dept. of Computer Science 6 STEPS TO DM: some may be done as part of warehouse creations • Data selection -- pick the data needed • Data cleansing – Fix bad data (e.g., spelling, zip codes) – Hard to deal with missing, erroneous, conflicting, redundant data • Enrichment – Add data (e.g., age, gender, income) • Data transformation – Aggregate (e.g., zip codes regions) • Data mining • Reporting on discovered K CRESCENT TCU Dept. of Computer Science Types of results • Association rules – Buy diapers buy lots of beer • Sequential patterns – Buy house buy furniture within months • Classification trees – Types of buyers (upscale,bargain-conscience, …) • Why do it? – Make more money – Science & medicine CRESCENT TCU Dept. of Computer Science DM/KDD Goals • Find patterns to predict future events • Find major groupings – Groupings of buyers, stars, diseases … • Find which group something belongs to – creditworthiness CRESCENT TCU Dept. of Computer Science What are we learning? • • • • • • • Association rules Classification hierarchies Clustering Sequential patterns Patterns within time series Type of result, inputs & algorithms vary Often interested in some combination of these types of K CRESCENT TCU Dept. of Computer Science Clustering – Unsupervised learning techniques – – – – • Training samples are unclassified • Vs. supervised learning (classification) Drug categories for depression Categories of TV viewers Categories of buyers (likely, unlikely) Categories of households? • Single male, mother/children, conventional (M/D/kids), DINKs. CRESCENT TCU Dept. of Computer Science Sequential patterns • Detecting associations among events with certain temporal relationships • Example: – Cardiac bypass for blocked arteries – AND within 18 months, high blood urea – THEN kidney failure likely in next 18 months • Particularly important in smart homes CRESCENT TCU Dept. of Computer Science Sequential Pattern Discovery • Sequence of itemsets – Grocery store purchases by 1 person (3 itemsets) • {soy milk, bread, chocolate}, {bananas, chocolate}, {lettuce, tomato, chocolate} • 2 Subsequences – {soy milk, bread, chocolate}, {bananas, chocolate}, – {bananas, chocolate}, {lettuce, tomato, chocolate} CRESCENT TCU Dept. of Computer Science Sequential pattern discovery • The support for a sequence S is the % of the given set U of sequences of which S is a subsequence. – That is: how many times does S show up? • Find all subsequences from the given sequence sets that have a user-defined minimum support. • The sequence S1, S2, … Sn, is a predictor of “fact” that a customer that buys itemset S1 is likely to buy itemset S2, then S3, … • Prediction support based on frequency of this sequence in the past • Many research issues to create good algos CRESCENT TCU Dept. of Computer Science Patterns within time series • Finding 2 patterns that occur over time – 2003 stock prices of Choice Homes and Home Depot – 2 products show same sales pattern in summer but different one in winter – Solar magnetic wind patterns may predict earth atmospheric changes CRESCENT TCU Dept. of Computer Science Time series pattern discovery • Time series are sequences of events – Event could be a transaction (closing daily stock price) – Look at sequences over n days, or – Longest period in which change is no greater than 1% • Comparing – Must define similarity measures CRESCENT TCU Dept. of Computer Science Other approaches in DM/KDD • Neural nets – Infer a function from a set of examples – – – – • Non-parametric curve-fitting • Interpolates to solve new problems Supervised & unsupervised algorithms classification time-series can’t see what it learned (not declarative) CRESCENT TCU Dept. of Computer Science Other approaches in DM/KDD • Genetic algorithms – Set up • Representation (strings over an alphabet) • Evaluation (fitness) function • Parameters: # of generations, cross-over rate, mutation rate, etc. – Randomized (probabilistic operators), parallel search over search space – Used for problem solving and clustering CRESCENT TCU Dept. of Computer Science Sensor DB Article • Design – Distributed vs warehouse approach – Sensor data • Measurement uncertainty, communications failures • Data representation • Data model – Relational + • Sensor descriptions, including location – Special rep for sensor sequences • ADT attribute represents sensor data as output of ADT functions CRESCENT TCU Dept. of Computer Science Sensor DB Article: Queries • Sample queries/characteristics (2nd page) and sample extended SQL (3.1) • Long running (continuous) queries – Incremental queries retrieves all data over t second interval, repeated every t seconds, take union of them – WHERE $every() in SQL • Aggregates over time windows • Virtual joins for ADT (slow) functions CRESCENT TCU Dept. of Computer Science