Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification What is Classification? • Problem: assign items to pre-defined classes – Sample Y = Y1 … Yn – Set of classes X – Given Y, choose C that contains Y • How do we know how to do this? – Training data: set of items for which proper Xi is known. Issues • Classification accuracy – False positives, False negatives – No clear “best” metric • Computation cost – Training – Classification Approaches: • • • • Naïve Bayes K-Nearest Neighbor Decision rules/Decision trees Neural Networks Naïve Bayes: History Bayes classifier: From Probability Theory • Idea: A-posteriori P[Y | Xi]P[ Xi] P[ Xi | Y ] probability of class j P[Y | Xj]P[ Xj] given all inputs is best possible classifier • Problem: doesn’t Y2 generalize. Y1 Y4 • Solution: Bayesian Y3 Belief network P(Xi|Y) = P(Y4|Y2,Y3)P(Y2|Y1)P(Y3|Y1)P(Y1) Problems with Bayesian Belief Network • What should the network structure be? – Some work in how to learn the structure – Getting it wrong results in over-specificity • What are the probabilities? – Learning techniques exist here • Computational cost to learn network Naïve Bayes • Two-layer Bayes network – No need to learn structure – Assumes inputs independent • Learn the probabilities that work best on training data Y1 Y2 X Y3 P(X|Y1...Yn) = P(X)*Πi P(Yi|X) K-Nearest Neighbor • Idea: Choose “closest” training item – Class of test is same as class of closest training item – Need to define distance • What if this is a bad match? – Find K closest items – Use most common class in those K KNN: Advantages • As training set → ∞, K → ∞, result approaches optimal – View as “best probability over all samples”: this is Bayes theorem • Training simple – Just put training set into a data structure KNN: Problems • With small K, only captures convex classes • High dimensionality: may be “nearest” in irrelevant attributes • Query time: Search all training data – Algorithms to make this faster • But good enough to be “standard” for comparison Classification and Security • Ideas on how to use classifiers to improve security – Intrusion Detection –? • Potential risks – Identifying private information based on similarity with training data