SVMs adapt the Maximal Margin Classifier for linearly non-separable data using slack variables and kernel functions, focusing on soft margin classification for linear data and the kernel trick for non-linear cases.
Objective Function: Minimize the following objective to find the optimal hyperplane
\(\min_{w, b, \xi} \; \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i\)
Subject to \(y_i(w^{T}x + b) \geq 1 - \xi_i\) for all \(i\), where:
\(w\) is the weight factor
\(b\) is the bias
\(\xi_i\) are slack variables representing the degree of misclassification of \(x_i\)
\(C\) is the is the regularization parameter controlling the trade-off between margin maximization and classification error.
Dual Problem Solution: The SVM optimization problem in its dual form allows the incorporation of kernel functions:
\(\max_{\alpha} \; \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{n}\alpha_i \alpha_j y_i y_j K(x_i, x_j)\)
Subject to \(0 \leq \alpha_1 \leq C\) for all \(i\) and \(\sum_{i=1}^{n}\alpha_iy_i = 0\), where:
\(\alpha_i\) Lagrange multipliers
\(K(x_i, x_j)\) is the kernel function evaluating the dot product of \(x_i\) and \(x_j\) in the transformed feature space.
Slack Variables \(\xi_i\): Allow for flexibility in classification by permitting data points to be within the margin or incorrectly classified, i.e., the soft margin approach.
Regularization Parameter (\(C\)) : Balances the trade-off between achieving a wide margin and minimizing the classification error; higher \(C\) values lead to less tolerance for misclassification.
Kernel Functions: Transform the original feature space into a higher-dimensional space, enabling SVMs to find a separating hyperplane in cases where data is not linearly separable. Common kernels include linear, polynomial, RBF, and sigmoid.
Dual Formulation: Simplifies the problem by focusing on Lagrange multipliers, allowing the use of kernel functions and making the problem solvable even when the feature space is high-dimensional or infinite.
Support Vectors: Data points corresponding to non-zero \(\alpha_i\) values; these are the critical elements that define the hyperplane and margin.
Decision Function: For a new data point \(x\), the decision function becomes \(\text{sign}\left(\sum_{i=1}^{n} \alpha_i y_i K(x_i, x) + b\right)\), determining the class membership based on the sign of the output.