Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003 What is Data Mining? Data mining problems all deal with the automatic analysis of large database The outcome of a data mining algorithm is a model which uncovers the nature of the data Main Data Mining Problems Association rules Classification Clustering Source IP begins with 132.68 packets per connection > 1000 Source IP begins with 132.68 and TTL < 5 will be dropped There are three types of packets coming from 132.68: Simple, Heavy load, and Malicious. In data mining the answer precedes the question Why Data Mine an LSD System? Data mining is good, when properly used data mining yields money It is otherwise difficult to monitor an LSD system: lots of data, spread across the system, impossible to collect Many interesting phenomena are inherently distributed (e.g., DDoS), it is not enough to just monitor a few nodes Our Work We developed an association rule mining algorithm that works well in LSD Systems Local and therefore scalable Asynchronous and therefore fast Dynamic and therefore incremental and robust Accurate – you get what you expect Anytime – you get early results fast In a Tea Spoon A distributed data mining algorithm can be described as a series of distributed decisions Those decisions are reduced to a majority vote We developed a majority voting protocol which has all those good qualities The outcome is an LSD association rule mining (still to come: classification) Main Results By the time the database is scanned once, in parallel, the average node has discovered 95% of the rules, and has less than 10% false rules.