Download SVM Practice - 서울대 Biointelligence lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Transcript
LOGO
Support Vector Machine
& Classification using Weka
(Brain, Computation, and Neural Learning)
Byoung-Hee Kim
Biointelligence Lab. CSE
Seoul National University
Agenda
Supprt Vector Machine*





History and basic concept of SVM
How SVM works
Optimization
Kernel trick
Properties of SVM
SVM classification in Weka


Review of Weka & dataset for the Project #2
Using SVM in Weka
* Many slides are from “Martin Law, A Simple Introduction to Support Vector Machines, Lecture Slide for CSE 802, Michigan State Univ.”
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
2
History of Support Vector Machine
 SVM was first introduced in 1992
 SVM becomes popular because of its success in
handwritten digit recognition
 SVM is now regarded as an important example of
“kernel methods”, one of the key area in
machine learning
 Popularity


SVM is regarded as the first choice for classification
problems
Many commercial analyzers contain SVM
 IBM® SPSS® Statistics
 SAS Enterprise Miner
 Oracle Data Mining (ODM)
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
3
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
4
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
5
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
6
 Sequential minimal optimization (SMO) is one of the most
popular algorithms for large-margin classification by SVM
 Examples closest to the hyperplane are support vectors

x with non-zero α will be support vectors
i
i
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
7
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
8
Hard Margin vs. Soft Margin
 Hard margin formulation
 Soft margin formulation
α

Parameter C is the upper bound of

C can be viewed as a way to control overfitting
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
i
9
Extension to Non-linear Decision Boundary
 How to generalize it to become nonlinear?
 Key idea: transform xi to a higher dimensional
space to “make life easier”


Input space: the space the point xi are located
Feature space: the space of φ(xi) after transformation
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
10
Extension to Non-linear Decision Boundary
Why transform?


Linear operation in the feature space is
equivalent to nonlinear operation in input
space
Classification can become easier with a proper
transformation. In the XOR problem, for
example, adding a new feature of x1x2 make
the problem linearly separable
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
11
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
12
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
13
Kernel Functions
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
14
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
15
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
16
Properties of SVM
Flexibility in choosing a similarity function
Sparseness of solution when dealing with large data
sets

only support vectors are used to specify the separating
hyperplane
Ability to handle large feature spaces

complexity does not depend on the dimensionality of the
feature space
Overfitting can be controlled by soft margin
approach
Nice math property: a simple convex optimization
problem which is guaranteed to converge to a single global
solution
Feature Selection
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
17
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
18
Introduction to Weka
 Weka: Data Mining Software in Java




Weka is a collection of machine learning algorithms for
data mining & machine learning tasks
What you can do with Weka?
 data pre-processing, feature selection, classification,
regression, clustering, association rules, and
visualization
Weka is an open source software issued under the GNU
General Public License
How to get? http://www.cs.waikato.ac.nz/ml/weka/ or
just type „Weka‟ in google.
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
19
Dataset #1: Pima Indians Diabetes
 Description



Pima Indians have the highest prevalence of diabetes in the world
We will build classification models that diagnose if the patient
shows signs of diabetes
http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
 Configuration of the data set




768 instances
8 attributes
 age, number of times pregnant, results of medical tests/analysis
 all numeric (integer or real-valued)
 Also, a discretized set will be provided
Class value = 1 (Positive example )
 Interpreted as "tested positive for diabetes"
 500 instances
Class value = 0 (Negative example)
 268 instances
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
20
Dataset #2: Handwritten Digits (MNIST)
 Description



The MNIST database of handwritten digits contains digits
written by office workers and students
We will build a recognition model based on classifiers with the
reduced set of MNIST
http://yann.lecun.com/exdb/mnist/
 Configuration of the data set




Attributes
 pixel values in gray level in a 28x28 image
 784 attributes (all 0~255 integer)
Full MNIST set
 Training set: 60,000 examples
 Test set: 10,000 examples
For our practice, a reduced set with 800 examples is used
Class value: 0~9, which represent digits from 0 to 9
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
21
Practice
Basic

Comparing the performances of algorithms




MultilayerPerceptron vs. J48 vs. SVM
Checking the trained model (structure & parameter)
Tuning parameters to get better models
Understanding „Test options‟ & „Classifier
output‟ in Weka
 Advanced




Building committee machines using „meta‟ algorithms
for classification
Preprocessing / data manipulation – applying „Filter‟
Batch experiment with „Experimenter‟
Design & run a batch process with „KnowledgeFlow‟
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
22
Mini-project
Classification problem with Weka


Data set
 3 different data sets
 You should include at least one set from
UCI ML repository and MNIST set
(http://archive.ics.uci.edu/ml/)
Classification methods
 MLP: iters, learning rate, momentum, # of
hidden nodes
 SVM: will be addressed in the next class
 J48: Default options only
Following
slides
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
23
Support Vector Machine in Weka
click
• Click „SMO‟
• Set parameters for SMO
• Set parameters for Test
• Click „Start‟ for learning
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
• load a file that contains the
training data by clicking
„Open file‟ button
• „ARFF‟ or „CSV‟ formats are
readible
• Click „Classify‟ tab
• Click „Choose‟ button
• Select „weka – function
- SMO
24
How SMO works in Weka
 Classification algorithm



Implements John Platt's sequential minimal optimization
(SMO) algorithm for training a support vector classifier
Multi-class problems are solved using pairwise
classification
To obtain proper probability estimates, use the option
that fits logistic regression models to the outputs of the
support vector machine
 References


J. Platt: Fast Training of Support Vector Machines using Sequential
Minimal Optimization. In B. Schö lkopf and C. Burges and A. Smola,
editors, Advances in Kernel Methods - Support Vector Learning, 1998.
S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy (2001).
Improvements to Platt's SMO Algorithm for SVM Classifier Design.
Neural Computation. 13(3):637-649.
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
25
Parameters of SMO in Weka
 buildLogisticModels: Whether to
fit logistic models to the outputs
(for proper probability estimates)
 numFolds: The number of folds
for cross-validation used to
generate training data for logistic
models
 randomSeed: Random number
seed for the cross-validation
 c -- The complexity parameter C.
It is the upper bound of alpha‟s
 filterType -- Determines how/if
the data will be transformed.
 kernel -- The kernel to use.
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
26
Parameter Setting Guide
 Suggested




buildLogisticModels: True
numFolds: 3 or 5
randomSeed: any value
C (complexity parameter, upper
bound of alpha)
 Your own choice



filterType
Kernel and its subsequent
parameters
Debug – if on, you can see
intermediate results
 Do not change!


(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/

epsilon
checksTurnedOff
toleranceParameter
27
Parameter Setting Guide - Kernel
PolyKernel


Try various exponents
 Floating points are
allowed
1.0: linear kernel
RBFKernel


(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Try various gammas
Gamma value
corresponds to the
inverse of the variance
(width of the kernel)
28
Test Options and Classifier Output
Setting the
data set used
for evaluation
There are
various metrics
for evaluation
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
29
Committee Machine in Weka
 Using committee machine / ensemble learning in
Weka



Boosting: AdaBoostM1
Voting committee: Vote
Bagging
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
30
Data Manipulation with Filter in Weka
 Attribute

Selection, discretize
 Instance

Re-sampling, selecting specified folds
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
31
Using Experimenter in Weka
 Tool for ‘Batch’ experiments
Click „New‟
click
• Select „Run‟ tab and click „Start‟
• If it has finished successfully, click
„Analyse‟ tab and see the summary
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
• Set experiment type/iteration
control
• Set datasets / algorithms
32
KnowledgeFlow for Analysis Process Design
(‘Process Flow Diagram’ of SAS® Enterprise Miner )
(C) 2009-2011, SNU Biointelligence Lab, http://bi.snu.ac.kr/
33