# Mathematical Formulation of SVM In this article we developed the mathematical formulation of the soft-margin variant of the classical maximum margin linear classifier. ## How we compute the margin? The main idea of the algorithm is to choose a lineal margin to separate both classes. If we have the point $x_1$ with $y_1 = 1$ and $x_2$ with $y_2 = -1$ as the extreme points of the clusters. $$ w^t x_1 + b = 1 $$ $$ w^t x_2 + b = -1 $$ We can compute the margin as the difference between them as: $$ w^t (x_1 - x_2) = 2 $$ If we normalize the left side, we get the size of the margin in the right hand side: $$ \frac{w^t (x1 - x2)}{||w||} = \frac{2}{||w||} $$ Finally the margin $\rho$ is: $$ \rho = \frac{2}{||w||} $$ ## Optimization model In order to maximize the margin $\rho$ one can minimize the inverse of the margin $\frac{||w||}{2}$. To make the optimization problem more easy to solve we can get rid off the square root in the norm function and write: $$\min_{i, \zeta}{\frac{w^t w}{2} + C \sum_{i=1}^{n}{\zeta_i}}$$ Subject to: $$ y_i(w^t x + b) \ge 1 - \zeta ,\quad i=1..n $$ $$ \zeta_i \ge 0 ,\quad i=1..n $$ Where the first constraint states that the training points needs to be assigned to the corresponding class. In this we allow some misclassification with a cost $C$ trought the slacks variables $\zeta_i$. ## Getting the optimum of the problem To solve this problem we can compute the Lagragian function and solve the uncostrained equivalent: $$ \mathcal{L}(w, b, \zeta) = {\frac{w^t w}{2} + C \sum_{i=1}^{n}{\zeta_i}} + \sum_{i=1}^{n}{\alpha_i (1 - \zeta - y_i(w^t x + b))} + \sum_{i=1}^{n}{\beta_i(-\zeta_i)} $$ > Notice the minus sign in the terms corresponding to the constraints to match the cannonical form of an optimization problem: >$$ \min_{x} f(x)$$ >$$ g(x) \le 0 $$ The stationary point of the lagragian gives us the optimal of the problem. Taking the derivatives of the parameters and set it to zero: $$ \frac{\partial \mathcal{L}}{\partial w} = w - \sum_{i=1}^{n}{\alpha_i y_i x_i} = 0$$ $$ \frac{\partial \mathcal{L}}{\partial b} = \sum_{i=1}^{n}{\alpha_i y_i} = 0$$ $$ \frac{\partial \mathcal{L}}{\partial \zeta} = C - \alpha_i - \beta = 0, \quad i=1..n$$ If we expand the Lagragian: $$ \mathcal{L}(w, b, \zeta) = \frac{w^t w}{2} + \sum_{i=1}^{n}{\alpha_i} - \sum_{i=1}^{n}{\alpha_i y_i w^t x} - \sum_{i=1}^{n}{\alpha_i y_i b} + C \sum_{i=1}^{n}{\zeta_i} - \sum_{i=1}^{n}{\alpha_i \zeta_i} - \sum_{i=1}^{n}{\beta_i\zeta_i} $$ And embed the stationary points of the Lagragian in the following way: $$ \mathcal{L}(w, b, \zeta) = \frac{\sum_{i, j=1}^{n}{\alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle}}{2} - \sum_{i=1}^{n}{\alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle} + \sum_{i=1}^{n}{\alpha_i} - \sum_{i=1}^{n}{\alpha_i y_i b} + \sum_{i=1}^{n}{(C - \alpha - \beta)\zeta_i} $$ 1. We replace $w$ with the first expresion of the stationary point, $w = \sum_{i=1}^{n}{\alpha_i y_i x_i}$. 2. We gonna remove the term $\sum_{i=1}^{n}{\alpha_i y_i}$ because it sums zero. 3. We gonna remove the last group because $ C - \alpha_i - \beta_i = 0, \quad i=1..n $. Now the Lagragian has the following form: $$ \mathcal{L}(\alpha) = \sum_{i=1}^{n}{\alpha_i} -\frac{\sum_{i, j=1}^{n}{\alpha_i \alpha_j y_i y_j \langle x_i, x_j \rangle}}{2} $$ If we put the constraints of the stationary point we get the dual problem: $$ \max {\mathcal{L}(\alpha)} $$ Subject to $$ \sum_{i=1}^{n}{\alpha_i y_i} = 0, \quad i=1..n$$ $$ 0 \le \alpha_i \le C, \quad i=1..n$$ The first constraints corresponds to the $\frac{\partial \mathcal{L}}{\partial b} = 0$ condition. The second one is a little bit more tricky but since $\alpha_i \ge 0$ and $\beta_i \ge 0$ we have with $\frac{\partial \mathcal{L}}{\partial \zeta}$ an upper bound of $\alpha_i$: $$ C - \alpha_i - \beta_i = 0 $$ $$ \beta_i = 0 \implies \alpha_i = C$$ ## How we classify Once we get the solution of the dual problem and the values of $\alpha^*$ and $b^*$ the decision boundary is given by: $$ d(x; \alpha^*, b^*) = sign(\sum_{sv}{\alpha_i^* y_i \langle x_i, x_j \rangle} + b*)$$ Where $b^*$ can be computed with the KKT conditions: $$ \alpha_i(y_i({w^*}^t x_i + b^*) - 1 - \zeta_i) = 0 $$ $$ \beta_i^* \zeta_i^* = 0$$ Now, the interesting part is thar the decision functions depends only on the support vectors (see the summation index). That is because $\alpha_i = 0$ for any point where the margin constraint is not binding. ## How we classify Once we get the solution of the dual problem and the values of $\alpha^*$ and $b^*$ the decision boundary is given by: $$ d(x; \alpha^*, b^*) = sign(\sum_{sv}{\alpha_i^* y_i \langle x_i, x_j \rangle} + b*)$$ Where $b^*$ can be computed with the KKT conditions: $$ \alpha_i(y_i({w^*}^t x_i + b^*) - 1 - \zeta_i) = 0 $$ $$ \beta_i^* \zeta_i^* = 0$$ Now, the interesting part is thar the decision functions depends only on the support vectors (see the summation index). That is because $\alpha_i = 0$ for any point where the margin constraint is not binding. ```python import numpy as np from sklearn.svm import SVC from sklearn.datasets import make_blobs from sklearn.model_selection import train_test_split from sklearn.inspection import DecisionBoundaryDisplay from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt ``` ```python x, y = make_blobs(centers=[[6, 6], [12, 12]]) x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7) model = SVC(kernel="linear") model.fit(x_train, y_train) print(f"train accuracy:\t{accuracy_score(y_train, model.predict(x_train)):.2%}") print(f"test accuracy:\t{accuracy_score(y_test, model.predict(x_test)):.2%}") ``` In this case the dataset is linearly separable. ``` train accuracy: 100.00% test accuracy: 100.00% ``` Lets define a function to plot it. ```python def plot_svm(x_train, y_train, model): # Plot decision boundary xlim = np.min(x_train[:, 0]), np.max(x_train[:, 0]) ylim = np.min(x_train[:, 1]), np.max(x_train[:, 1]) DecisionBoundaryDisplay.from_estimator(model, x_train, response_method="predict", alpha=0.4) # Plot training data plt.scatter(x_train[y_train==0][:, 0], x_train[y_train==0][:, 1], color="red", label="train - 0") plt.scatter(x_train[y_train==1][:, 0], x_train[y_train==1][:, 1], color="blue", label="train - 1") # Plot support vectors plt.scatter(x_train[model.support_][:, 0], x_train[model.support_][:, 1], color="green", label="Support Vectors") # Plot decision boundary w1, w2 = model.coef_[0] b = model.intercept_ margin = 1 / np.sqrt(np.sum(model.coef_**2)) x1 = np.linspace(*xlim, 2) x2 = (-w1 * x1 - b) / w2 plt.plot(x1, x2, color="green", linestyle="dashed", linewidth=2) # Plot margin upper x1 = np.linspace(*xlim, 2) x2 = (-w1 * x1 - b) / w2 + np.sqrt(1 + (w1/w2)**2) * margin plt.plot(x1, x2, color="black", linestyle="dashed", linewidth=2, label="margin") # Plot margin lower x1 = np.linspace(*xlim, 2) x2 = (-w1 * x1 - b) / w2 - np.sqrt(1 + (w1/w2)**2) * margin plt.plot(x1, x2, color="black", linestyle="dashed", linewidth=2) plt.xlabel("x1") plt.ylabel("x2") plt.xlim(xlim) plt.ylim(ylim) plt.legend() plot_svm(x_train, y_train, model) plt.show() ``` ![](_images/svm_1.bmp) We can see only a few support vectors in comparision with the dataset size. Since the classes are linearly separable there is no training point inside the margin. If we have a non linearly separable dataset. We need to pay the price $C \sum_{i=1}^{n}\zeta_i$ for misclassification. In the examples below we increase the value of $C$ and see how the margin is reduced. ```python x, y = make_blobs(centers=[[6, 6], [8, 8]]) x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7) C = 0.05 model = SVC(C=C, kernel="linear") model.fit(x_train, y_train) print(f"train accuracy:\t{accuracy_score(y_train, model.predict(x_train)):.2%}") print(f"test accuracy:\t{accuracy_score(y_test, model.predict(x_test)):.2%}") plot_svm(x_train, y_train, model) plt.title(f"SVM with C={C:.2f}") plt.show() ``` ``` train accuracy: 92.86% test accuracy: 100.00% ``` ![](_images/svm_2.bmp) ![](_images/svm_3.bmp) ## Non linear separable data There are some cases where data is non linear separable like the following: ```python from sklearn.datasets import make_circles x, y = make_circles([100, 100], factor=0.5) x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7) C = 100 model = SVC(C=C, kernel="linear") model.fit(x_train, y_train) print(f"train accuracy:\t{accuracy_score(y_train, model.predict(x_train)):.2%}") print(f"test accuracy:\t{accuracy_score(y_test, model.predict(x_test)):.2%}") plot_svm(x_train, y_train, model) plt.show() ``` ![](_images/svm_4.bmp) To avoid this wew can use the Kernel-Trick. When the basic idea is to use a function $\phi$ that maps the pairs $[x_i, x_j]^t$ and returns a higher dimension data. for example $[x_i, x_j, x_i^2 + x_j^2]^t$. Now the decision boundary changes to: $$ d(x; \alpha^*, b^*) = sign(\sum_{sv}{\alpha_i^* y_i \langle \phi(x_i), \phi(x_j) \rangle} + b*)$$ The function that can do this mapping, is one that fullfil the conditions of the Mercer theorem. The mercer theorem states that the kernel matrix needs to be symmetric in $\mathcal{X}\times\mathcal{X} \in \mathcal{R}$. The objective of the dual to determine the $\alpha_i$ becomes: $$ \mathcal{L}(\alpha) = \sum_{i=1}^{n}{\alpha_i} -\frac{\sum_{i, j=1}^{n}{\alpha_i \alpha_j y_i y_j \langle \phi(x_i), \phi(x_j) \rangle}}{2} $$ This is interesting because we dont need to compute a function $\phi(x)$ since we only need the correspoding entry from the gram matrix $G_{ij} = \langle \phi(x_i), \phi(x_j) \rangle$. ### Example In the following example we gonna choose the following kernel: $$\phi(x, y) = \exp(-\gamma ||x-y||^2) $$ This is called radial basis function (`rbf`). ```python x = np.linspace(-2, 2) gammas = [1, 5, 10] for gamma in gammas: phi = np.exp(-gamma*(x**2)) plt.plot(x, phi, label=f"gamma = {gamma:.1f}") plt.title("RBF kernel") plt.legend() plt.grid() plt.show() ``` ![](_images/svm_5.bmp) ```python x, y = make_circles([100, 100], factor=0.5) x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7) C = 1 gamma = 1 model = SVC(C=C, kernel="rbf", gamma=gamma) model.fit(x_train, y_train) print(f"train accuracy:\t{accuracy_score(y_train, model.predict(x_train)):.2%}") print(f"test accuracy:\t{accuracy_score(y_test, model.predict(x_test)):.2%}") # Plot decision boundary DecisionBoundaryDisplay.from_estimator(model, x_train, response_method="predict", alpha=0.4) # Plot training data plt.scatter(x_train[y_train==0][:, 0], x_train[y_train==0][:, 1], color="red", label="train - 0") plt.scatter(x_train[y_train==1][:, 0], x_train[y_train==1][:, 1], color="blue", label="train - 1") # Plot support vectors plt.scatter(x_train[model.support_][:, 0], x_train[model.support_][:, 1], color="green", label="Support Vectors") plt.legend() plt.show() ``` ![](_images/svm_6.bmp) If we increase $\gamma$ we gonna have a more sharp decision boundary between the support vectors. We gonna see a decision boundary like a doughnut. ```python C = 100 gamma = 200 model = SVC(C=C, kernel="rbf", gamma=gamma) model.fit(x_train, y_train) print(f"train accuracy:\t{accuracy_score(y_train, model.predict(x_train)):.2%}") print(f"test accuracy:\t{accuracy_score(y_test, model.predict(x_test)):.2%}") # Plot decision boundary DecisionBoundaryDisplay.from_estimator(model, x_train, response_method="predict", alpha=0.4) # Plot training data plt.scatter(x_train[y_train==0][:, 0], x_train[y_train==0][:, 1], color="red", label="train - 0") plt.scatter(x_train[y_train==1][:, 0], x_train[y_train==1][:, 1], color="blue", label="train - 1") # Plot support vectors plt.scatter(x_train[model.support_][:, 0], x_train[model.support_][:, 1], color="green", label="Support Vectors") plt.legend() plt.show() ``` ![](_images/svm_7.bmp)