Archive for Optimization theorems

Math behind Soft-Margin SVM Classifier - Sequential Minimal Optimization

The decision function for an SVM classifier is given by:

\begin{equation*}\begin{aligned}\hat{f}(\bar x) = sgn(\bar w.\bar x - b) \hspace{5 mm} \bar x \in \mathbb{R}^n \hspace{5 mm} and \hspace{2 mm} \bar w \in \mathbb{R}^n\end{aligned}\end{equation*}

\bar w is the normal vector and b is the offset term for the decision surface \bar w.\bar x = b.

The corresponding supporting hyperplanes are as follows:

\begin{equation*}\begin{aligned}\bar w.\bar x = b + 1 - \xi \hspace{5 mm} \forall \hspace{2 mm} (\bar x, y) \mid \bar x \in \mathbb{R}^n, \hspace{2 mm} \hspace{2 mm} y = +1, \hspace{2 mm} \xi \geq 0 \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned}\bar w.\bar x = b - 1 + \xi \hspace{5 mm} \forall \hspace{2 mm} (\bar x, y) \mid \bar x \in \mathbb{R}^n, \hspace{2 mm} \hspace{2 mm} y = -1, \hspace{2 mm} \xi \geq 0 \end{aligned}\end{equation*}

In either of the above supporting hyperplanes, \xi \geq 0 is known as the slack variable or error term that measures how far a particular point lies on the wrong side of its respective hyperplane.

The optimization problem to compute the a soft-margin decision surface \bar w^*.\bar x = b^* is expressed as follows:

\begin{equation*}\begin{aligned} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i  \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.\bar x_i \geq b + 1 - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.\bar x_j \leq b - 1 + \xi_j \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \xi_i \geq 0, \hspace{2 mm} i = 1, \ldots, m \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \text{where} \hspace{2 mm} \mathbb{D} = \left\{ (\bar x_i, y_i) : \bar x_i \in \mathbb{R}^n, y_i \in \{+1, -1\} \right\} \hspace{2 mm} \text{ is Training Set}  \end{aligned}\end{equation*}

Rewriting Supporting Hyperplane Constraints in Compact Form:

The above Supporting Hyperplane constraints can be formatted into compact form as follows:

For Supporting Hyperplane representing all training points labeled as +1, we can rewrite the dot product and free term as follows:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. (+1)\bar x_i \geq 1 + (+1)b - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

Now we can substitute y_i in place of +1, since the inequality will still hold true:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. y_i\bar x_i \geq 1 + y_ib - \xi_i \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

Rearranging the terms, the constraint becomes:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w. y_i\bar x_i - y_ib + \xi_i - 1  \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned}\hspace{0.7 in} \Rightarrow y_i(\bar w.\bar x_i - b) + \xi_i - 1  \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_i = +1, \hspace{2 mm} \end{aligned}\end{equation*}

For Supporting Hyperplane representing all training points labeled as -1, multiplying LHS and RHS by -1 and changing the inequality from \leq to \geq we get:

\begin{equation*}\begin{aligned} \hspace{0.7 in} (-1)( \bar w.\bar x_j) \geq (-1)(b - 1 + \xi_j) \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \Rightarrow \bar w.(-1)\bar x_j \geq  1 - \xi_j + (-1)b \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

Now we can substitute y_j in place of -1, since the inequality will still hold true:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.y_j\bar x_j \geq  1 - \xi_j + y_jb \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

Rearranging the terms, the constraint becomes:

\begin{equation*}\begin{aligned} \hspace{0.7 in} \bar w.y_j\bar x_j - y_jb + \xi_j -1 \geq  0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \Rightarrow y_j(\bar w.\bar x_j - b) + \xi_j -1 \geq  0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_j, y_j)  \in \mathbb{D},  \hspace{2 mm} \hspace{2 mm} y_j = -1, \hspace{2 mm} \end{aligned}\end{equation*}

As evident, the compacted constraints for both the supporting hyperplanes are in similar form. Hence the constraint can be now expressed as one constraint for all points in the training set \mathbb{D} as follows:

\begin{equation*}\begin{aligned}\hspace{0.7 in} y_i(\bar w.\bar x_i - b) + \xi_i - 1  \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i)  \in \mathbb{D},  \hspace{2 mm} y_i \in \{+1,-1\} \end{aligned}\end{equation*}

Optimization Problem with Constraints in Compact Form

\begin{equation*}\begin{aligned} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i  \right) \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned}\hspace{0.7 in} y_i(\bar w.\bar x_i - b) + \xi_i - 1  \geq 0 \hspace{5 mm} \forall \hspace{2 mm} (\bar x_i, y_i)  \in \mathbb{D},  \hspace{2 mm} y_i \in \{+1,-1\} \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \xi_i \geq 0, \hspace{2 mm} i = 1, \ldots, m \end{aligned}\end{equation*}

\begin{equation*}\begin{aligned} \hspace{0.7 in} \text{where} \hspace{2 mm} \mathbb{D} = \left\{ (\bar x_i, y_i) : \bar x_i \in \mathbb{R}^n, y_i \in \{+1, -1\} \right\} \hspace{2 mm} \text{ is Training Set}  \end{aligned}\end{equation*}

Optimization Problem in Lagrange Form

\begin{equation*}\begin{aligned} \underset{\bar \alpha, \bar \beta}{\text{max}}  \hspace{1 mm} \underset{\bar w, b, \xi}{\text{min}} & \left( \frac{1}{2} \bar w.\bar w \hspace{2 mm} + \hspace{2 mm} C \sum\limits_{i=1}^m \xi_i - \sum\limits_{i=1}^m \alpha_i (y_i(\bar w.\bar x_i - b) + \xi_i - 1) - \sum\limits_{i=1}^m \beta_i \xi_i  \right)  \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} \alpha_i \geq 0, \hspace{2 mm} \beta_i \geq 0, \hspace{2 mm} i = 1, \ldots, m. \end{aligned}\end{equation*}

Where there is a will, there is a way - Intuition behind Lagrange Optimization

In many constrained optimization problems (both maximization and minimization), the optimization problem is first converted into something called Lagrange form and then the optimal solution is evaluated.

The following is the general representation of a Constrained minimization problem:

\begin{equation*}\begin{aligned} \underset{\bar x}{\text{min}} & & f(\bar x) \hspace{5 mm} \bar x \in \mathbb{R}^n \end{aligned}\end{equation*}

\begin{aligned}\text{subject to: }\end{aligned}
\begin{equation*}\begin{aligned} \hspace{0.7 in} g_i(\bar x) \geq 0, \; i = 1, \ldots, m. \end{aligned}\end{equation*}

f(\bar x) is the Objective function and g_i(\bar x) are Constraint functions.

Note that constraint functions g_i(\bar x) are assumed to be linearly independent - meaning there is no linear relationship between themselves and hence one function cannot be expressed in terms of others scaled by some constant.

\begin{aligned}\Rightarrow \sum\limits_{i=1}^m c_i \times g_i(\bar x) \neq 0, \hspace{2 mm} \text{for some} \hspace{2 mm} c_1, c_2, \ldots, c_m \in \mathbb{R} \hspace{2 mm} \text{not all zero}  \end{aligned}

The above optimization problem is then written in Lagrange optimization form as follows:

\begin{equation*}\begin{aligned}\underset{\bar \alpha}{\text{max}} \hspace{2 mm} \underset{\bar x}{\text{min}} & f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}\end{equation*}
\begin{aligned}\text{where } \alpha_i \geq 0, \; i = 1, \ldots, m. \end{aligned}

The values \alpha_1, \alpha_2, \ldots \alpha_m in the above representation are called Lagrange Multipliers (a.k.a. KKT Multipliers) .

There seems to be two intuitions to above formulation of Lagrangian optimization - One simple enough to explain in a top-down manner and the second, somewhat related to first one but involves deep understanding. We will look at both the intuitions.

Intuition # 1:
If an optimal solution \bar x^* exists for the constrained optimization problem, then at that optimal point, the gradient vector of the objective function \nabla f(\bar x) is parallel to the gradient vectors of constraint functions \nabla g_i(\bar x). This means that objective function gradient vector at the optimal point \bar x^* can be expressed as some linear combination of constraint function gradient vectors that are parallel to it:

\begin{aligned}\nabla f(\bar x^*) = \alpha_1 \nabla g_1(\bar x^*) + \alpha_2 \nabla g_2(\bar x^*) + \ldots + \alpha_m \nabla g_m(\bar x^*), where \hspace{2 mm} \alpha_i \geq 0 \end{aligned}

\begin{aligned}\Rightarrow \nabla f(\bar x^*) - (\alpha_1 \nabla g_1(\bar x^*) + \alpha_2 \nabla g_2(\bar x^*) + \ldots + \alpha_m \nabla g_m(\bar x^*)) = 0 \end{aligned}

\begin{aligned}\Rightarrow \nabla f(\bar x^*) - \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) = 0 \end{aligned}

The above equation with terms \nabla f(\bar x^*) and \nabla g_i(\bar x^*) seems to suggest that there exists a function L(\bar \alpha, \bar x) whose gradient \nabla L is 0 at \bar x^*.

The function L(\bar \alpha, \bar x) and its gradient \nabla L can be represented more generally as follows:

\begin{aligned}\Rightarrow L(\bar \alpha, \bar x) = f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}

\begin{aligned}\Rightarrow \nabla L(\bar \alpha, \bar x) = \nabla f(\bar x) - \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x) \end{aligned}

Note that the values \alpha_1, \alpha_2, \ldots \alpha_m in the above equations are represented as vector \bar \alpha for convenience such that \bar \alpha = (\alpha_1, \alpha_2, \ldots \alpha_m).

Also note that gradient \nabla L is a vector and since \bar x \in \mathbb{R}^n, \nabla L will have vector components ( \frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, \ldots \frac{\partial}{\partial x_n} ).

If \nabla L = 0 at \bar x^* then each of the vector components \frac{\partial}{\partial x_i} will be equal to 0.

Since \bar x^* is unknown and since finding optimal \bar x^* is our goal, the generalized function L(\bar \alpha, \bar x) seems to give some confidence that there is possibility of finding optimal \bar x^* by solving for point where \nabla L = 0. However, for the generalized function L(\bar \alpha, \bar x), there can be many points \bar x where \nabla L = 0. All such points are called saddle points or stationary points and the challenge then would be which one of them will be the optimal point \bar x^* that will yield optimal solution to primary objective function f(\bar x) subject to constraints g_i(\bar x) \geq 0.

The intuition postulates that under certain special circumstances (which we will quantify mathematically in later paragraphs), one of the saddle points \bar x^* on L(\bar \alpha, \bar x) can potentially be the optimal point for the main objective function f(\bar x) subject to constraints g_i(\bar x) \geq 0.

Note that the values for multipliers \alpha_i in function L(\bar \alpha, \bar x) are unknown. Essentially, if we have to go about using the function L(\bar \alpha, \bar x) to find the optimal point \bar x^* then the challenge is two fold. We have to not only search for \bar x^* but we have to also find optimal values for \alpha_i simultaneously such that the desired saddle point \bar x^* can be found.

This implies that we have to find an optimal combination of \bar x^* = ( x_1^*, x_2^*, \ldots x_n^* ) and \bar \alpha^* = ( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*) with \nabla L(\bar \alpha^*, \bar x^*) = 0 such that \bar x^* can potentially be the optimal point for main objective function f(\bar x) subject to contraints g_i(\bar x) \geq 0.

Note that:

\begin{aligned}\nabla L(\bar \alpha^*, \bar x^*) = 0 \Rightarrow \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_1} = \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_2} = \ldots = \frac{\partial L(\bar \alpha^*, \bar x^*)}{\partial x_n} = 0 \end{aligned}

If we go with the above intuition solving for (\bar \alpha^*, \bar x^*) using L(\bar \alpha, \bar x), analytically speaking, the real challenge is that we have to find n variables in \bar x^* = ( x_1^*, x_2^*, \ldots x_n^* ) and m values in ( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*). Essentially there are (n+m) unknowns that we need to solve for. So far going with the notion of \nabla L(\bar \alpha^*, \bar x^*) = 0, we have arrived at the following n potential equations in (n+m) unknowns.

\begin{aligned}\nabla L = 0 \Rightarrow \frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial x_2} = \ldots = \frac{\partial L}{\partial x_n} = 0 \end{aligned}

Missing are the m additional equations in (n+m) unknowns.

This is where the primary intuition can be revisited. Primary intuition suggested that under certain special conditions, one of the saddle points \bar x^* on L(\bar \alpha, \bar x) can potentially be the optimal point for the main objective function f(\bar x) subject to constraints g_i(\bar x) \geq 0. Essentially this means that when such special conditions are met, the optimal value of objective function f(\bar x) at \bar x^* subject to constraints g_i(\bar x^*) \geq 0 is nothing but the value of function L(\bar \alpha, \bar x) at (\bar \alpha^*, \bar x^*).

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) = f(\bar x^*)\end{aligned}

However by formulation of generalized function L(\bar \alpha, \bar x), we know that:

\begin{aligned} L(\bar \alpha, \bar x) = f(\bar x) - \sum\limits_{i=1}^m \alpha_i \times g_i(\bar x) \end{aligned}

So the value of function L(\bar \alpha, \bar x) at (\bar \alpha^*, \bar x^*) is then given by:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) = f(\bar x^*) - \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) \end{aligned}

This can be re-written as:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) - f(\bar x^*) + \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

If \begin{aligned}L(\bar \alpha^*, \bar x^*) = f(\bar x^*)\end{aligned} then the above equation results in:

\begin{aligned}\Rightarrow L(\bar \alpha^*, \bar x^*) - L(\bar \alpha^*, \bar x^*) + \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

\begin{aligned}\Rightarrow \sum\limits_{i=1}^m \alpha_i^* \times g_i(\bar x^*) = 0 \end{aligned}

Since we know that \alpha_i^* \geq 0 and constraint functions g_i(\bar x) are linearly independent, the above equation will hold true only when each of the terms \alpha_i^* \times g_i(\bar x^*) is equal to 0.

And Voila....!! We have the special conditions and the missing m equations that must be satisfied at (\bar \alpha^*, \bar x^*):

\begin{aligned}\Rightarrow \alpha_i \times g_i(\bar x) = 0, i = 1, \ldots, m. \end{aligned}

Now using the following (n+m) equations in (n+m) unknowns, we can find an optimal combination of \bar x^* = ( x_1^*, x_2^*, \ldots x_n^* ) and \bar \alpha^* = ( \alpha_1^*, \alpha_2^*, \ldots \alpha_m^*) with \nabla L(\bar \alpha^*, \bar x^*) = 0 such that \bar x^* can potentially be the optimal point for main objective function f(\bar x) subject to contraints g_i(\bar x) \geq 0.

\begin{aligned}\frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial x_2} = \ldots = \frac{\partial L}{\partial x_n} = 0 \hspace{2 mm} and \hspace{2 mm} \alpha_i \times g_i(\bar x) = 0, i = 1, \ldots, m. \end{aligned}

Hence this intuition#1 forms the basis for the formulation of Lagrangian Optimization problem from the Constrained Optimization problem.

Notes About Inequality Constraints:

  • In case of Minimization problems, the standard convention is to use constraints as g_i(\bar x) \geq 0. In this case, the gradient vector of objective function \nabla f and
  • and gradient vectors of constraint functions \nabla g_i at the optimal point \bar x^* will point in same direction and \alpha_i in \nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) will be \geq 0. If constraints are declared as g_i(\bar x) \leq 0, gradient vector of objective function \nabla f and gradient vectors of constraint functions \nabla g_i at the optimal point \bar x^* will point in opposite directions and \alpha_i in \nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) will be \leq 0 to compensate for opposite directions

  • Alternatively, in case of Minimization problems, if the constraints are declared as g_i(\bar x) \leq 0, gradient vectors linear combination can be expressed as \nabla f(\bar x^*) = -\sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) to account for opposite directions and \alpha_i will be \geq 0
  • In case of Maximization problems, the standard convention is to use constraints as g_i(\bar x) \leq 0. In this case, the gradient vector of objective function \nabla f and
  • and gradient vectors of constraint functions \nabla g_i at the optimal point \bar x^* will point in same direction and \alpha_i in \nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) will be \geq 0. If constraints are declared as g_i(\bar x) \geq 0, gradient vector of objective function \nabla f and gradient vectors of constraint functions \nabla g_i at the optimal point \bar x^* will point in opposite directions and \alpha_i in \nabla f(\bar x^*) = \sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) will be \leq 0 to compensate for opposite directions

  • Alternatively, in case of Maximization problems, if the constraints are declared as g_i(\bar x) \geq 0, gradient vectors linear combination can be expressed as \nabla f(\bar x^*) = -\sum\limits_{i=1}^m \alpha_i \nabla g_i(\bar x^*) to account for opposite directions and \alpha_i will be \geq 0