Support Vector Machine
Support Vector Machine
A Support Vector Machine provides a binary classification mechanism based on finding a hyperplane between a set of samples with +ve and -ve outputs. It assumes the data is linearly separable.
The problem can be structured as a quadratic programming optimization problem that maximizes the margin subjected to a set of linear constraints (i.e., data output on one side of the line must be +ve while the other side must be -ve). This can be solved with the quadratic programming technique.
If the data is not linearly separable due to noise (the majority is still linearly separable), then an error term will be added to penalize the optimization.
If the data distribution is fundamentally non-linear, the trick is to transform the data to a higher dimension so the data will be linearly separable.The optimization term turns out to be a dot product of the transformed points in the high-dimension space, which is found to be equivalent to performing a kernel function in the original (before transformation) space.
The kernel function provides a cheap way to equivalently transform the original point to a high dimension (since we don't actually transform it) and perform the quadratic optimization in that high-dimension space.
There are a couple of tuning parameters (e.g., penalty and cost), so transformation is usually conducted in 2 steps—finding the optimal parameter and then training the SVM model using that parameter. Here are some example codes in R:
parameter. Here are some example codes in R:
> tune <- tune.svm(Species~., data=iristrain, gamma=10^(-6:-1), cost=10^(1:4))
Parameter tuning of 'svm':
- sampling method: 10-fold cross validation
- best parameters:
- best performance: 0.03333333
> model <- svm(Species~., data=iristrain, method="C-classification", kernel="radial", probability=T, gamma=0.001, cost=10000)
> prediction <- predict(model, iristest, probability=T)
> table(iristest$Species, prediction)
setosa versicolor virginica
setosa 10 0 0
versicolor 0 10 0
virginica 0 3 7
SVM with a Kernel function is a highly effective model and works well across a wide range of problem sets. Although it is a binary classifier, it can be easily extended to a multi-class classification by training a group of binary classifiers and using "one vs all" or "one vs one" as predictors.
SVM predicts the output based on the distance to the dividing hyperplane. This doesn't directly estimate the probability of the prediction. We therefore use the calibration technique to find a logistic regression model between the distance of the hyperplane and the binary output. Using that regression model, we then get our estimation.