These are my solutions of chapter 2 exercises of this wonderful primer on non-convex optimization.

Ex 2.1: Show that strong smoothness does not imply convexity by constructing a nonconvex function \(f: R^p \rightarrow R\) that is 1-SS

Let \(f(x) = - \left\Vert x \right\Vert_2^2\), which is non-convex.

So \(\nabla f(x) = -2x\)

\[f(y) - f(x) - \langle \nabla f(x) , y - x \rangle = \left\Vert x \right\Vert_2^2 - \left\Vert y \right\Vert_2^2 - \langle \nabla f(x) , y - x \rangle\] \[= \left\Vert x \right\Vert_2^2 - \left\Vert y \right\Vert_2^2 - 2\left\Vert x \right\Vert_2^2 + 2\langle x , y \rangle\] \[= - \left\Vert x-y \right\Vert_2^2 <= 0 <= 1/2 \left\Vert x-y \right\Vert_2^2\]

Ex 2.2. Show that if a differentiable function f has bounded gradients i.e., \(\left\Vert \nabla f(x) \right\Vert^2 \leq G\) for all \(x \in R^d\) , then \(f\) is Lipschitz. What is its Lipschitz constant?

From mean value theorem, there exists a point \(c\) on the line between \(x\) and \(y\) such that \(\nabla f(c) = \frac {f(y) - f(x)}{y-x}\)

So, \(\left\Vert \nabla f(c) \right\Vert^2 = \left\Vert\frac{f(y) - f(x)}{y-x}\right\Vert_2^2\)

So, \(\left\Vert f(y) - f(x) \right\Vert \leq \sqrt{G} \left\Vert y-x \right\Vert_2\)

Lipschitz constant is \(\sqrt{G}\)

Ex 2.3. Show that for any point \(z \not\in B_2(r)\), the projection onto the ball is given by \(\Pi_{B_2(r)}(z) = r.\frac{z}{|z|}\)

Geometrically, the closest point in the set \(B_2(r)\) to a point outside it will be on the surface of the set and in the same direction. So, direction of the unit vector is given by \(\hat{e} = \frac{z}{\vert z \rvert}\), and magnitude is \(r\)

So, the projection is \(\Pi_{B_2(r)}(z) = r\hat{e} = r\frac{z}{\lvert z \rvert}\)

Proof by Contradiction:

Assume there’s a point \(\hat{z} \neq r\frac{z}{\lvert z \rvert}\) and is the projection of \(z\) on to the L2-ball.

From Projection lemma, which states: For any set (convex or not) \(C \subset R^p\) and \(z \in R^p\) , let \(\hat{z} = \Pi_C(z)\) . Then for all \(x \in C\), \(\Vert\hat {z} − z\Vert_2 \leq \Vert x − z\Vert\).

So, \(\Vert\hat{z} - z\Vert_2 \leq \Vert r\frac{z}{\lvert z \rvert} - z\Vert_2\)

So, \(\Vert\hat{z} - z\Vert_2 \leq \Vert z\Vert_2 \Vert \frac{r}{\lvert z \rvert} - 1\Vert_2\)

And, \(r \lt \lvert z \rvert\). This is a contradiction

Ex 2.4. Show that a horizon-oblivious setting of \(\eta_t = \frac{1}{\sqrt{t}}\) while executing the Projected Gradient Descent algorithm with a convex function with bounded gradients also ensures convergence.

Define the potential function \(\Phi_t = f(x^t) - f(x^\star)\), where \(x^\star\) is the minima. From convexity of \(f\), we can upperbound \(\Phi_t\).

\begin{align} \Phi_t &= f(x^t) - f(x^\star) \\ &<= \langle \nabla f(x), x^t - x^\star \rangle \\ &<= \frac{1}{2\eta_t} \left( \Vert x^t - x^\star \Vert_2^2 + \eta_t^2G^2 - \Vert z^{t+1} - x^\star \Vert_2^2 \right) \end{align}

where \(z^{t+1}\) is the projection of the update on \(x^t\) onto the convex constraint set.

We also know from the Projection Lemma, \(\Vert z^{t+1} - x^\star \Vert_2^2 >= \Vert x^{t+1} - x^\star \Vert_2^2\)

So, \(\Phi_t <= \frac{1}{2\eta_t} \left(\Vert x^{t} - x^\star \Vert_2^2 - \Vert x^{t+1} - x^\star \Vert_2^2 \right) + \frac{\eta_t G^2}{2}\)

Refer Pg. 21 of this book.

The above form doesn’t allow for easy telescoping operation, so we will have to make some necessary assumptions for this. One such assumption can be on the diameter of the convex constraint set \(diam(X) <= D\).

\begin{align} \frac{1}{T}\sum_t \Phi_t \
&<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + \sum_{t=2}^{T} ( \Vert x^t - x^\star \Vert_2^2(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}) + \eta_t G^2 ) \\ &<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + \sum_{t=2}^{T} ( D^2(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}})) + \sum_{t=1}^T \eta_t G^2 )) \\ &<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + D^2(\frac{1}{\eta_T} - \frac{1}{\eta_{1}})) + \sum_{t=1}^T\eta_t G^2 )) \\ &<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + D^2(\sqrt{T} - 1)) + \sqrt{T}G^2 )) \\ &<= \frac{1}{2\sqrt{T}}(\frac{\Vert x^\star \Vert_2^2}{\sqrt{T}} + D^2 + G^2 )) \end{align}

(Using \(x^1 = 0\), \(\eta_1 = 1\), \(\eta_T = \frac{1}{\sqrt{T}}\), and \(\sum_{i=1}^{n}\frac{1}{\sqrt{i}} <= 2sqrt(n) - 1\) )

Ex 2.5. Show that if \(f : R^p \rightarrow R\) is a strongly convex function that is differentiable, then there is a unique point \(x^* \in R^p\) that minimizes the function value \(f\) i.e., \(f (x^*)\) = min\(_{x \in R^p} f(x)\)

Proof by Contradiction: Let \(x^ *\) and \(y^*\) be two points that minimize \(f\), i.e. \(f(x^*) = f(y^*) = f^*\) From strong convexity, \(f(\theta x^* + (1-\theta)y^*) < \theta f(x^*) + (1-\theta)f(y^*)\) \(= f*\). This is a contradiction.

Ex 2.6. Show that the set of sparse vectors \(B_0(s) \subset R^p\) is non-convex for any \(s < p\). What happens when \(s = p\) ?

Take \(x_1\) s.t. \(\Vert x_1 \Vert_0 = s\) And the \(s\) non zero elements are the first \(s\) elements. And take \(x_2\) s.t. \(\Vert x_2 \Vert_0 = 1\), with the last element as non-zero.

A convex sum of \(x_1\) and \(x_2\) is not in \(B_0(s) \subset R^p\), if \(s < p\). Thus \(B_0(s)\) is non-convex. \(B_0(p)\) however is convex.

Ex 2.7. Show that \(B_{rank(r)} \subseteq R_{n×n}\), the set of \(n × n\) matrices with rank at most \(r\), is non-convex for any \(r < n\). What happens when \(r = n\)

The sum of two singular matrices can be non-singular. So \(B_{rank(r)} \subset R_{n×n}\) is non-convex, for \(r< n\) and convex for \(r = n\).

Ex 2.9. Consider a least squares optimization problem with a strongly convex and smooth objective. Show that the condition number of this problem is equal to the condition number of the Hessian matrix of the objective function.

Let the objective be

\begin{align} f(w)
&= \frac{1}{2} (Xw - y)^T (Xw-y) \\ &= \frac{1}{2} (X(w-w^\star) + Xw^\star - y)^T (X(w-w^\star) + Xw^\star - y) \end{align}

Let \(Xw^\star - y = P\) and \(X(w-w^\star) = Q\), where \(w^\star = (X^TX)^{-1}X^Ty\) is the minima.

So,

\begin{align} f(w) &= \frac{1}{2}(Q + P)^T (Q + P) \\ &= \frac{1}{2}(Q^TQ + Q^TP + P^TQ) + \frac{1}{2}P^TP \\ &= \frac{1}{2}Q^TQ + \frac{1}{2}P^TP \\ &= \frac{1}{2}(w-w^\star)^TX^TX(w-w^\star) + f(w^\star) \end{align}

So Hessian of the objective \(H = \nabla^2 f(w) = X^TX\). Condition number \(\kappa\) of the hessian \(H\) is \(\kappa = \frac{\lambda_1}{\lambda_n}\), where \(\lambda_1\) and \(\lambda_n\) are the largest and smallest eigenvalues.

From strong convexity and smoothness,

\[\frac{\alpha}{2} \Vert w - w^\star \Vert_2^2 <= f(w) - f(w^\star) <= \frac{\beta}{2} \Vert w - w^\star \Vert_2^2\]

So, \(\frac{f(w) - f(w^\star)}{\frac{\alpha}{2} \Vert w - w^\star \Vert_2^2} \in [1, \frac{\beta}{\alpha}]\)

Simply put, \(\frac{v^THv}{\alpha v^Tv} \in [1, \frac{\beta}{\alpha}]\), for any vector \(v\).

If \(v = v_n\), where \(v_n\) is the eigen-vector of the Hessian corresponding to \(\lambda_n\), then,

\begin{align} \frac{v_n^THv_n}{\alpha v_n^Tv_n} &= \frac{v_n^T \lambda_n v_n}{\alpha v_n^T v_n} \\ &= \frac{\lambda_n}{\alpha} = 1 \end{align}

So \(\lambda_n = \alpha\). And, similarly, \(\lambda_1 = \beta\)