A linear model \(y = w^T \phi(x)\) has an equivalent dual representation, \(y = \sum_{i=1}^N \alpha_i k(x, x_i)\). Here, \(\alpha_i\) are scalars, while the function \(k(x, x_i)\) is a kernel function which measures similarity between the two vectors \(x\) and \(x_i\), by computing the inner product \(<\phi(x), \phi(x_i)>\) . Here \(x_i\) are the observations from the training set.
Consider a linear regression model whose objective function \(J(w)\) can be written as follows:
\[J(w) = \frac{1}{2} \sum_{i=1}^N \{w^T \phi(x_i) - y_i\}^2 + \frac{\lambda}{2} w^Tw \tag{1}\]Setting the gradient \(\nabla_w J(w) = 0\), we get
\[w = -\frac{1}{\lambda} \sum_{i=1}^N \{w^T \phi(x_i) - y_i \}\phi(x_i) \tag{2}\]Let
\[\alpha_i = -\frac{1}{\lambda} \{ w^T \phi(x_i) - y_i\} \tag{3}\]So,
\[w = \sum_{i=1}^N \alpha_i \phi(x_i) = \Phi^T \alpha \tag{4}\]Substituting for \(w\), we get,
\[y = w^T \phi(x) = \alpha^T \Phi \phi(x) = \sum_{i=1}^N \alpha_i \phi(x_i)^T \phi(x) = \sum_{i=1}^N \alpha_i k(x, x_i) \tag{5}\]You can thus see that the linear model in \(w\) can now be expressed as a linear combination of functions of the training observations. But what about \(\alpha_i\) ? It is still a function of \(w\) which we haven’t got rid of yet.
Observe,
\[\alpha_1 = -\frac{1}{\lambda} \{\alpha^T \Phi \phi(x_1) - y_1\}\] \[\dots\] \[\alpha_N = -\frac{1}{\lambda} \{\alpha^T \Phi \phi(x_N) - y_N\}\]So,
\[\alpha = -\frac{1}{\lambda} [ \Phi \Phi^T \alpha - Y] \tag{6}\]Thus,
\[\alpha = (K + \lambda I_N)^{-1} Y \tag{7}\]where \(K = \Phi \Phi^T\) is the kernel matrix aka the Gram matrix. This is where the problem lies. As you can see, computing the kernel matrix and its inverse gets prohibitive in both memory and time beyond even a 10000 observations. The solution proposed by Rahimi and Recht, tries to solve this problem for a large number of observations.
Observe that \(p(w)\) also happens to be a probability distribution of the form \(\mathcal{N}(0, I_d)\).
Bochner’s Theorem generalizes the above result and states that the Fourier Transform of a normalized, shift invariant kernel (\(k(x, y) = k(x-y)\)) is a probability distribution. Therefore, we can sample \(w\) vectors from \(p(w)\).
From the inverse Fourier transform,
\[\begin{eqnarray} k(x-y) &=& \int_{R^d} p(w) e^{j w^T (x-y)} dw \\ &=& \mathbb{E}_w [e^{j w^T (x-y)}] \\ &=& \mathbb{E}_{w \sim p(w)} [\gamma_w(x) \gamma_w(y)^*] \end{eqnarray} \tag{9}\]where,
\[\gamma_w(x) = e^{j w^T x}\]and
\[\gamma_w(-x) = \gamma_w(x)^*\]is the complex conjugate.
Observe that both \(k\) and \(p(w)\) are real and even, therefore the complex part of the exponential can be dropped,
\[\begin{eqnarray} k(x-y) &=& \mathbb{E}_{w \sim p(w)} [\gamma_w(x) \gamma_w(y)^*] \\ &=& \mathbb{E}_w [\cos(w^T x) \cos(w^T y) + \sin(w^T x) \sin(w^T y)] \tag{10} \end{eqnarray}\]Therefore,
\[\begin{eqnarray} k(x-y) &\approx& \frac{1}{D} \sum_{i=1}^D \cos(w_i^T x) \cos(w_i^T y) + \sin(w_i^T x) \sin(w_i^T y)\\ &=& z_w(x)^T z_w(y) \tag{11} \end{eqnarray}\]where \(z_w(x) = \sqrt{\frac{1}{D}}[\cos(w_1^T x), \sin(w_1^T x), \dots, \cos(w_D^T x), \sin(w_D^T x)]^T\)
The RHS is thus an unbiased estimate of the kernel function. And the variance of the approximation can be decreased by increasing \(D\).
This is the key insight from the paper. Now that the kernel can be approximated as a dot product between random features, we can thus fit linear models in the random feature subspace instead using the primal formulation.
Observe,
\[|z_{w_i} (x)| = | \frac{1}{\sqrt{D}}[\cos(w_i^T x), \sin(w_i^T x)]^T | <= \frac{1}{\sqrt{D}} \tag{12}\]Hoeffding’s bound provides an upper bound on the probability of deviation of a sum of independent variables \(X_i\) from their mean.
\[P(|\sum_{i=1}^N X_i - \mathbb{E}[X]| >= Nt) <= 2\exp(-\frac{2 N^2 t^2}{\sum_{i=1}^N (b_i - a_i)^2}) \tag{13}\]where \(a_i\) and \(b_i\) are the lower and upper bounds on \(X_i\).
Using the above identity,
\[\begin{eqnarray} P( | \sum_{i=1}^D {z_{w_i} (x) z_{w_i} (y)} - k(x,y)| >= \epsilon) &<=& 2\exp(-\frac{2\epsilon^2}{\sum_{i=1}^D (2\sqrt{1/D})^2}) \\ &<=& 2\exp(-\frac{D\epsilon^2}{2}) \tag{14} \end{eqnarray}\]The paper proves a slightly looser Hoeffding bound of \(2\exp(-\frac{D\epsilon^2}{4})\) by taking the approximation \(k(x-y) = \mathbb{E}_w [\cos(w^T x) \cos(w^T y)]\)
The paper goes on to prove a much stronger argument about the faithfulness of this approximation, essentially proving that it holds across all of the metric space. To understand the covering number argument and the proof, we will need to understand some mathematical vocabulary.
A 2-tuple \((X, d_X)\) where \(X\) is a set and \(d_X\) is a metric defined on this set.
In \((\mathbb{R}^d, l_2)\),
\[B_r(x) = \{y \in \mathbb{R}^d : ||y-x||_2 <= r\}\]This is the minimum number of spherical balls of radius \(r\) required to completely cover a given space \(\mathcal{M} \subset \mathbb{R}^d\). These balls can be centred in a subset \(C \subset \mathbb{R}^d\), where \(C\) may or may not be a subset of \(\mathcal{M}\).
\[N_r = \min_{|C|} : \cup_{x \in C} B_r(x) = \mathcal{M}\]If \(C \subset \mathcal{M}\), it is an internal covering, else an external covering.
An upper bound for the covering number of a metric space \(\mathcal{M}\) has been proved here, which is
\[N_r <= \bigg(\frac{2 diam(\mathcal{M})}{r} \bigg)^d \tag{15}\]Let, \(C \subset \mathcal{M}\). So, the covering radius \(r_c(\mathcal{M}, C)\) is the smallest possible number such that \(\mathcal{M}\) can be covered by spherical balls of radius \(r_c\) centred in \(C\).
\[r_c(\mathcal{M}, C) = \min_{r} : \cup_{x \in C} B_r(x) = \mathcal{M}\]The packing radius is the maximum radius of spherical balls in \(\mathcal{M}\), such that all balls are disjoint. Or in other words half of the smallest distance between any two points in \(\mathcal{M}\)
\[r_p(\mathcal{M}) = \max_r : \cap_{x \in \mathcal{M}} B_r(x) = \phi\]An \(\epsilon\)-packing is a set \(X \subset \mathcal{M}\) which has packing radius \(\geq \epsilon/2\). And an \(\epsilon\)-covering is a set \(X \subset \mathcal{M}\) which has a covering radius \(\leq \epsilon\). An \(\epsilon\)-net is both \(\epsilon\)-packing and \(\epsilon\)-covering.
This means, the \(\epsilon\)-net is a set \(X \subset \mathcal{M}\), s.t.
\[r_p(X) \geq \epsilon/2\]and,
\[r_c(\mathcal{M}, X) \leq \epsilon\]Therefore, as \(X \subset \mathcal{M}\),
\[r_p(X) = \epsilon/2\] \[r_c(\mathcal{M}, X) = \epsilon\]For a non-negative random variable \(X\),
\[P(X > \epsilon) \leq \frac{\mathbb{E}(X)}{\epsilon}\]Therefore,
\[P(A \cup B) <= P(A) + P(B)\]Generalizing,
\[P(\sum_{i=1}^N A_i) <= \sum_{i=1}^N P(A_i)\]The Lipschitz constant \(L_f\) of a function \(f\) is an indicator of the smoothness of the function. Smoother functions have lower values of \(L_f\)
For a function \(f : \mathbb{R}^d \rightarrow \mathbb{R}^k\)
\[L_f = \sup_{x, y} \frac{||f(x) - f(y)||_2}{||x - y||_2}\]If f is differentiable,
\[L_f = \sup_{x} |f'(x)|\]\textbf{Claim}: Let \(\mathcal{M}\) be a compact subset of \(\mathbb{R}^d\) with diameter \(diam(\mathcal{M})\). Then, for the mapping \(z\), we have
\[P[\sup_{x, y \in \mathcal{M}} |z(x)^T z(y) - k(x,y) | \geq \epsilon] \leq 2^8 \bigg(\frac{\sigma_p diam(\mathcal{M})}{\epsilon}\bigg)^2 \exp(-\frac{D\epsilon^2}{4(d+2)})\]where, \(\sigma_p^2 = \mathbb{E}[w^T w]\) is the second moment of the Fourier transform of \(k\).
Firstly, let’s parse this claim.
On the LHS, we are computing the probability of the maximum approximation error for any two points \(x, y \in \mathcal{M}\) to be greater than some \(\epsilon (>0)\). On the RHS, we are claiming an upper bound on this probability as a function of the \(\epsilon\) and \(D\). The \(diam(\mathcal{M})\) and \(\sigma_p\) are constants. Thus the probability that the approximation error is large (for large \(\epsilon\)) is extremely small, and drops exponentially in \(D\) and \(\epsilon^2\) and holds throughout the space \(\mathcal{M}\).
The proof is interesting and beautiful to say the least. At an intuitive level, this is how the proof works.
Now, for \(\vert f(\Delta)\vert \leq \beta\) to hold throughout \(\mathcal{M}_\Delta\), we require all \(\vert f(\Delta_i) \vert \leq \beta , i \in [1, T]\).
So, \(\begin{eqnarray} P(\vert f(\Delta) \vert \leq \beta) &=& P(\vert f(\Delta_1) \vert \leq \beta/2 \wedge \dots \wedge \vert f(\Delta_T) \vert \leq \beta/2 \wedge (L_f \leq \frac{\epsilon}{2r})) \\ &=& 1 - P(\vert f(\Delta_1) \vert \geq \beta/2 \vee \dots \vee \vert f(\Delta_T) \vert \geq \beta/2 \vee (L_f \geq \frac{\epsilon}{2r})) \\ &=& 1 - P(\cup_{i \in [1, T]} \vert f(\Delta_i) \vert \geq \beta/2) - P(L_f \geq \frac{\epsilon}{2r})) \\ \end{eqnarray} \tag{16}\)
A result from functional analysis, gives a bound for the covering number \(T \leq (2 diam(\mathcal{M}_\Delta) /r )^d = (4 diam(\mathcal{M})/r)^d\).
So, \(P(\vert f(\Delta) \vert \leq \beta) \geq 1 - 2 \bigg(\frac{4 diam(\mathcal{M})}{r} \bigg)^d \exp\bigg(-\frac{D\beta^2}{8}\bigg) - \bigg(\frac{2r \sigma_p}{\beta} \bigg)^2\)
Assuming \(\bigg( \frac{diam(\mathcal{M}) \sigma_p}{\beta} \bigg) > 1\)
which completes the proof.
Let’s assume we have a dataset \(X = (x_1, x_2, ..., x_N)\), of N iid observations \(x_i\), and we are trying to estimate the distribution of \(X\) using some model parameterized by \(\theta\).
\[P(\theta \vert X) = \frac{P(X \vert \theta) P(\theta)}{P(X)}\]Here, \(P(\theta \vert X)\) is defined as the posterior probability of \(\theta\) given \(X\), \(P(X \vert \theta)\) is the likelihood and \(P(\theta)\) is the prior probability of \(\theta\). \(P(X)\) is often called the evidence.
Let’s assume that we decide to model the likelihood of our dataset \(X\) using a Gaussian distribution parameterized by mean \(\mu\) and fixed variance \(\sigma^2\), i.e.
\[\begin{eqnarray} \label{eq:1} P(X \vert \mu, \sigma^2) &=& \mathcal{N}(X \vert \mu, \sigma^2) \nonumber \\ &=& \Pi_{i=1}^{N} \mathcal{N} (x_i \vert \mu, \Sigma) \nonumber \\ &=&\frac{1}{( 2 \pi \sigma^2)^{N/2}} \exp(-\frac{1}{2 \sigma^2} \sum_{i=1}^N (x_i - \mu)^2) \nonumber \end{eqnarray} \tag{1}\]Now, if we assume that the parameter \(\mu\) is itself distributed normally, i.e.
\[\begin{eqnarray} \label{eq:2} P(\mu) &=& \mathcal{N}(\mu \vert \mu_0, \sigma_0^2) \nonumber \\ &=& \frac{1}{\sqrt{2 \pi \sigma_0^2}} \exp(-\frac{1}{2\sigma_0^2} (\mu - \mu_0)^2) \end{eqnarray} \tag{2}\]The posterior distribution \(P(\mu \vert X)\) will then be proportional to the product of the likelihood and the prior, both of which have quadratic terms in the exponential. Thus, the posterior will also be Gaussian.
\[\begin{eqnarray} P(\mu \vert X) &\propto& P(X \vert \mu, \sigma^2) P(\mu) \nonumber \\ &\propto& \mathcal{N}(X | \mu, \sigma^2) \mathcal{N}(\mu | \mu_0, \sigma_0^2) \nonumber \\ &=& \mathcal{N} (\mu | \mu_N, \sigma_N^2) \\ &=& \frac{1}{\sqrt{2 \pi \sigma_N^2}} \exp(-\frac{1}{2\sigma_N^2} (\mu - \mu_N)^2) \end{eqnarray} \tag{3}\]We can solve for \(\mu_N\) and \(\sigma_N\) by “completing the square”, or in simpler terms, equating the coefficients for 1st order terms and 2nd order terms in \(\mu\).
Here, the Gaussian prior on \(\mu\) is conjugate to the likelihood, i.e. the product of the likelihood and this prior results in a posterior distribution which takes the same form as the prior. Such a prior is known as a conjugate prior.
As the posterior takes on the same form as the prior, this lends very naturally to the framework of online learning or sequential estimation. With every new datum, the updated posterior can serve as the new prior for the next datum. We will use this very insight in our implementation of Bayesian Linear Regression.
We will briefly digress and discuss Linear Gaussian Models, which will help us derive expressions for Bayesian Linear Regression.
Let
\[p(x) = \mathcal{N} (x | \mu, \Lambda^{-1}) \tag{9}\] \[p(y \vert x) = \mathcal{N} (y | Ax + b, L^{-1}) \tag{10}\]Let’s define a joint distribution on \(w\) and \(y\). Define \(z = \begin{bmatrix} w \\ y \end{bmatrix}\)
\[\begin{eqnarray} \ln p(z) &=& \ln p(x) + \ln p(y \vert x) \\ &=& -\frac{1}{2}(x - \mu)^T \Lambda (x-\mu) -\frac{1}{2}(y - Ax - b)^T L (y - Ax - b) + const \end{eqnarray}\]2nd Order Terms
For the precision of \(z\), we consider 2nd order terms in \(w\) and \(y\).
\[= -\frac{1}{2}x^T (\Lambda + xLx^T)x - \frac{1}{2}y^TLy + \frac{1}{2}y^TLAx + \frac{1}{2}x^TA^TLy = -\frac{1}{2} \begin{bmatrix} x \\ y \end{bmatrix}^T \begin{bmatrix} \Lambda + A^TLA & -A^TL \\ -LA & L \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}\]\(= -\frac{1}{2}z^TRz\), where \(R = \begin{bmatrix} \Lambda + A^TLA & -A^TL \\ -LA & L \end{bmatrix}\)
\[\Sigma_z = R^{-1} = \begin{bmatrix} \Lambda^{-1} & \Lambda^{-1}A^T \\ A\Lambda^{-1} & L^{-1} + A\Lambda^{-1}A^T\end{bmatrix} \tag{11}\]1st Order Terms
Identify linear terms in \(w\) and \(y\) to find an expression for the mean of the joint distribution.
\[= [ x^T\Lambda\mu - x^TA^TLb + y^TLb ] = \begin{bmatrix} x \\ y \end{bmatrix}^T \begin{bmatrix} \Lambda \mu - A^TLb \\ Lb \end{bmatrix}\]Equating with 1st order term in joint distribution in \(z\)
\[z^T R \mu_z = z^T \begin{bmatrix} \Lambda \mu - A^TLb \\ Lb \end{bmatrix}\]So,
\[\mu_z = R^{-1} \begin{bmatrix} \Lambda \mu - A^TLb \\ Lb \end{bmatrix} = \begin{bmatrix} \mu \\ A\mu - b \end{bmatrix}\tag{12}\]Now that we have the joint Gaussian distribution of \(x\) and \(y\) specified by \(\mu_z\) and \(\Sigma_z\), we can easily find the conditional distribution of \(x \vert y\).
From the expression for the joint distribution of \(x\) and \(y\), we can derive the parameters governing the conditional distribution of \(x \vert y\), i.e.
\[\mu_{x \vert y} = (\Lambda + A^TLA)^{-1} \{A^TL (y-b) + \Lambda \mu \} \tag{13}\] \[\Sigma_{x \vert y} = (\Lambda + A^TLA)^{-1} \tag {14}\]The parameters \(\mu_y\) and \(\Sigma_y\) governing the marginal distribution of y are given by
\[\mu_y = A\mu - b \tag{15}\] \[\Sigma_y = R_{yy}^{-1} = L^{-1} + A\Lambda^{-1}A^T \tag{16}\]In the Bayesian setting of Linear Regression, we set a prior \(p(w)\) over the weights \(w\) and the objective is to learn the posterior distribution of \(w\) given the likelihood \(p(y \vert w, x)\).
The likelihood is defined as \(p(y \vert w, x) = \mathcal{N}(y \vert w^Tx + b, \beta^{-1} )\). Here \(\beta^{-1}\) is the variance of irreducible noise in our model, which we are assuming to be fixed and known for the sake of simplicity. We choose a Gaussian prior on \(w\) which is conjugate to the likelihood. As we have seen before, the posterior too will assume the same form as the prior, which is Gaussian. This allows for sequential learning.
\[p(w) = \mathcal{N}(w \vert \mu_0, \Sigma_0) \tag{17}\] \[p(y \vert w, x) = \mathcal{N}(y \vert w^Tx + b, \beta^{-1}) \tag{18}\]From equations 13 and 14,
\[m_N = \mu_{w|y} = S_N^{-1} (x^T \beta (y-b) + \Sigma_0^{-1}\mu_0) \tag{19}\] \[S_N^{-1} = \Sigma_{w|y} = (\Sigma_0^{-1} + x^T \beta x)^{-1} \tag{20}\]While equations 17 and 18 give us an expression to update the posterior distribution of \(w\), it doesn’t help us make any predictions on new points.
The predictive distribution \(p(\hat{y} \vert x, X, Y, \beta, \mu_0, \Sigma_0)\) of a new point given all the training points is defined by
\[\begin{eqnarray} p(\hat{y} \vert X, Y, \beta, \mu_0, \Sigma_0) &=& \int p(\hat{y} \vert w, x, \beta) p(w \vert X, Y, \mu_0, \Sigma_0) dw \\ &=& \int \mathcal{N}(\hat{y} \vert w^Tx, \beta^{-1}) \mathcal{N}(w \vert \mu_N, S_N^{-1}) dw \\ &=& \int \mathcal{N}(\hat{y} - w^Tx \vert 0, \beta^{-1}) \mathcal{N}(w \vert \mu_N, S_N^{-1}) dw \\ &=& \mathcal{N}(\hat{y} \vert m_N^Tx, \beta^{-1} + x^TS_Nx) \end{eqnarray} \tag{21}\]The above expression can be derived by simply substituting identities 15 and 16.
The complete Jupyter notebook is here.
Let \(f(x) = - \left\Vert x \right\Vert_2^2\), which is non-convex.
So \(\nabla f(x) = -2x\)
\[f(y) - f(x) - \langle \nabla f(x) , y - x \rangle = \left\Vert x \right\Vert_2^2 - \left\Vert y \right\Vert_2^2 - \langle \nabla f(x) , y - x \rangle\] \[= \left\Vert x \right\Vert_2^2 - \left\Vert y \right\Vert_2^2 - 2\left\Vert x \right\Vert_2^2 + 2\langle x , y \rangle\] \[= - \left\Vert x-y \right\Vert_2^2 <= 0 <= 1/2 \left\Vert x-y \right\Vert_2^2\]From mean value theorem, there exists a point \(c\) on the line between \(x\) and \(y\) such that \(\nabla f(c) = \frac {f(y) - f(x)}{y-x}\)
So, \(\left\Vert \nabla f(c) \right\Vert^2 = \left\Vert\frac{f(y) - f(x)}{y-x}\right\Vert_2^2\)
So, \(\left\Vert f(y) - f(x) \right\Vert \leq \sqrt{G} \left\Vert y-x \right\Vert_2\)
Lipschitz constant is \(\sqrt{G}\)
Geometrically, the closest point in the set \(B_2(r)\) to a point outside it will be on the surface of the set and in the same direction. So, direction of the unit vector is given by \(\hat{e} = \frac{z}{\vert z \rvert}\), and magnitude is \(r\)
So, the projection is \(\Pi_{B_2(r)}(z) = r\hat{e} = r\frac{z}{\lvert z \rvert}\)
Assume there’s a point \(\hat{z} \neq r\frac{z}{\lvert z \rvert}\) and is the projection of \(z\) on to the L2-ball.
From Projection lemma, which states: For any set (convex or not) \(C \subset R^p\) and \(z \in R^p\) , let \(\hat{z} = \Pi_C(z)\) . Then for all \(x \in C\), \(\Vert\hat {z} − z\Vert_2 \leq \Vert x − z\Vert\).
So, \(\Vert\hat{z} - z\Vert_2 \leq \Vert r\frac{z}{\lvert z \rvert} - z\Vert_2\)
So, \(\Vert\hat{z} - z\Vert_2 \leq \Vert z\Vert_2 \Vert \frac{r}{\lvert z \rvert} - 1\Vert_2\)
And, \(r \lt \lvert z \rvert\). This is a contradiction
Define the potential function \(\Phi_t = f(x^t) - f(x^\star)\), where \(x^\star\) is the minima. From convexity of \(f\), we can upperbound \(\Phi_t\).
\begin{align} \Phi_t &= f(x^t) - f(x^\star) \\ &<= \langle \nabla f(x), x^t - x^\star \rangle \\ &<= \frac{1}{2\eta_t} \left( \Vert x^t - x^\star \Vert_2^2 + \eta_t^2G^2 - \Vert z^{t+1} - x^\star \Vert_2^2 \right) \end{align}
where \(z^{t+1}\) is the projection of the update on \(x^t\) onto the convex constraint set.
We also know from the Projection Lemma, \(\Vert z^{t+1} - x^\star \Vert_2^2 >= \Vert x^{t+1} - x^\star \Vert_2^2\)
So, \(\Phi_t <= \frac{1}{2\eta_t} \left(\Vert x^{t} - x^\star \Vert_2^2 - \Vert x^{t+1} - x^\star \Vert_2^2 \right) + \frac{\eta_t G^2}{2}\)
Refer Pg. 21 of this book.
The above form doesn’t allow for easy telescoping operation, so we will have to make some necessary assumptions for this. One such assumption can be on the diameter of the convex constraint set \(diam(X) <= D\).
\begin{align} \frac{1}{T}\sum_t \Phi_t \
&<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + \sum_{t=2}^{T} ( \Vert x^t - x^\star \Vert_2^2(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}}) + \eta_t G^2 ) \\ &<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + \sum_{t=2}^{T} ( D^2(\frac{1}{\eta_t} - \frac{1}{\eta_{t-1}})) + \sum_{t=1}^T \eta_t G^2 )) \\ &<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + D^2(\frac{1}{\eta_T} - \frac{1}{\eta_{1}})) + \sum_{t=1}^T\eta_t G^2 )) \\ &<= \frac{1}{2T}(\Vert x^\star \Vert_2^2 + D^2(\sqrt{T} - 1)) + \sqrt{T}G^2 )) \\ &<= \frac{1}{2\sqrt{T}}(\frac{\Vert x^\star \Vert_2^2}{\sqrt{T}} + D^2 + G^2 )) \end{align}
(Using \(x^1 = 0\), \(\eta_1 = 1\), \(\eta_T = \frac{1}{\sqrt{T}}\), and \(\sum_{i=1}^{n}\frac{1}{\sqrt{i}} <= 2sqrt(n) - 1\) )
Proof by Contradiction: Let \(x^ *\) and \(y^*\) be two points that minimize \(f\), i.e. \(f(x^*) = f(y^*) = f^*\) From strong convexity, \(f(\theta x^* + (1-\theta)y^*) < \theta f(x^*) + (1-\theta)f(y^*)\) \(= f*\). This is a contradiction.
Take \(x_1\) s.t. \(\Vert x_1 \Vert_0 = s\) And the \(s\) non zero elements are the first \(s\) elements. And take \(x_2\) s.t. \(\Vert x_2 \Vert_0 = 1\), with the last element as non-zero.
A convex sum of \(x_1\) and \(x_2\) is not in \(B_0(s) \subset R^p\), if \(s < p\). Thus \(B_0(s)\) is non-convex. \(B_0(p)\) however is convex.
The sum of two singular matrices can be non-singular. So \(B_{rank(r)} \subset R_{n×n}\) is non-convex, for \(r< n\) and convex for \(r = n\).
Let the objective be
\begin{align} f(w)
&= \frac{1}{2} (Xw - y)^T (Xw-y) \\ &= \frac{1}{2} (X(w-w^\star) + Xw^\star - y)^T (X(w-w^\star) + Xw^\star - y) \end{align}
Let \(Xw^\star - y = P\) and \(X(w-w^\star) = Q\), where \(w^\star = (X^TX)^{-1}X^Ty\) is the minima.
So,
\begin{align} f(w) &= \frac{1}{2}(Q + P)^T (Q + P) \\ &= \frac{1}{2}(Q^TQ + Q^TP + P^TQ) + \frac{1}{2}P^TP \\ &= \frac{1}{2}Q^TQ + \frac{1}{2}P^TP \\ &= \frac{1}{2}(w-w^\star)^TX^TX(w-w^\star) + f(w^\star) \end{align}
So Hessian of the objective \(H = \nabla^2 f(w) = X^TX\). Condition number \(\kappa\) of the hessian \(H\) is \(\kappa = \frac{\lambda_1}{\lambda_n}\), where \(\lambda_1\) and \(\lambda_n\) are the largest and smallest eigenvalues.
From strong convexity and smoothness,
\[\frac{\alpha}{2} \Vert w - w^\star \Vert_2^2 <= f(w) - f(w^\star) <= \frac{\beta}{2} \Vert w - w^\star \Vert_2^2\]So, \(\frac{f(w) - f(w^\star)}{\frac{\alpha}{2} \Vert w - w^\star \Vert_2^2} \in [1, \frac{\beta}{\alpha}]\)
Simply put, \(\frac{v^THv}{\alpha v^Tv} \in [1, \frac{\beta}{\alpha}]\), for any vector \(v\).
If \(v = v_n\), where \(v_n\) is the eigen-vector of the Hessian corresponding to \(\lambda_n\), then,
\begin{align} \frac{v_n^THv_n}{\alpha v_n^Tv_n} &= \frac{v_n^T \lambda_n v_n}{\alpha v_n^T v_n} \\ &= \frac{\lambda_n}{\alpha} = 1 \end{align}
So \(\lambda_n = \alpha\). And, similarly, \(\lambda_1 = \beta\)
]]>