Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining in Ubiquitous Distributed Environments Assaf Schuster Technion SEBD Tutorial, June 06 Purpose of this Tutorial • Convergence of distributed systems and data mining • Evolving field, no systematic coverage of all aspects • Will present: issues, challenges, examples for algorithmic approaches, ideas, tradeoffs accuracy vs. overhead • Will not present: formal treatment, proofs, details, technology, systems, hardware… SEBD Tutorial, June 06 Ubiquitous Computing Systems • Various Systems: Grid, P2P, WSN, MANET • Several similar technological aspects – Scale, aim for at least 10K (10M in P2P) • partial failure, heterogeneity, dynamic state / data – Multi-user, a 10K system serves >= 1K users • resource sharing, caching, consistency – Lots of distributed data • streams, incremental, anytime, local filtering, locality filtering – Cooperation of self-motivated parties • trust management, security, privacy, competitive market, self vs. global optimizations – Stringently resource limited • in-network computing, storage distribution • Non-similar technological aspects SEBD Tutorial, June 06 Ubiquitous Data Mining • For the community – E.g., P2P recommendations based on einteraction • For Security – E.g., identify and avert DoS attack (Overpeer and P2P poisoning) • For Administration – E.g., misconfiguration detection system (DataMiningGrid demo) • For Data Cleansing – E.g., in-network outliers detection (and removal) in WSN • DM Using HPC – E.g., idle-cycle batch systems for high-complexity SEBD Tutorial, June 06 analysis tasks (Superlink-Online) Technological Challenges: Algorithms • Scalable and resource limited distributed DM – Algorithms for 10K peers, algorithms limited to two messages per peer per hour, synchronization-less, iteration-less, bag-of-tasks, dynamic divisibility, etc. • Monitoring – Distributed, local filtering • Success, Correctness, and Consistency – Partial failure, message dropping, heterogeneity, etc. can yield all sorts of trouble • Reusability, incrementality – E.g., multi-class classifiers, multi-metric k-means SEBD Tutorial, June 06 clustering, etc. Technological Challenges: Systems • Exploitation & HCI – Lay user (parameterless) DM, interactive DM – DM-based autonomous ubiquitous systems • Security, Fraud, and Privacy – Authorization, public-key-infrastructure, trust management, data polution • Longevity of DM jobs – Resource sharing, non dedicated resources • Communication patterns – Esp. reliability and addressability. Are these problems best solved by suitable algorithms? SEBD Tutorial, June 06