Abstract

Locally Linear Embedding (LLE) is a classical unsupervised manifold learning technique that uncovers low-dimensional representations of high-dimensional computed by minimizing a sum-of-squares cost function, first on local reconstructions and then on the target embeddings. This article provides a terse introduction to the same.

Introduction

Locally Linear Embedding (LLE), an instance of classical unsupervised manifold learning, aims to uncover low-dimensional representations of high dimensional data while preserving local geometric relationships through the intuitive principle that a large number of points and their neighbours sampled from a smooth underlying manifold lie on or around a locally linear patch. A global embedding is computed through minimization of a sum-of-squares cost function, first on local reconstructions of each point, and then on the target embeddings. When compared with traditional dimensionality reduction techniques such as Principal Component Analysis (PCA), LLE is more effective at capturing complex, nonlinear structures- lending itself amenable to applications in image processing and pattern recognition.

Notation

In this article, for any matrix \(\|\mathbf{.}\|_{F}^{2}\) represents the squared Frobenius norm, \(\|.\|_{2}^{2}\) is the L2-norm, \({M}_{p}(\mathbb{R})\) is the set of all square matrices of dimension \(p \times p\) over the reals, \({M}_{m \times n}(\mathbb{R})\) is the set of all matrices over the reals of dimensions \(m \times n\). The term \(\partial_{x}f\) represents the partial derivative of \(f\) with respect to \(x\), \(\nabla f\) is the gradient. If \(\mathbf{A} = \partial_{\mathbf{X}}f(\mathbf{X})\) where \(\mathbf{X}\) and \(f(X)\) is a scalar function on a matrix, then, \(A_{ij} = \partial_{X_{ij}} f(X)\)

Construction

Suppose \(\mathbf{A} \in M_{n}(\mathbb{R})\) is symmetric, and \(\mathbf{x} \in \mathbb{R}^{n}\), then the function \(f = \mathbf{x}^{T}\mathbf{A}\mathbf{x}\) has gradient \(2\mathbf{A}\mathbf{x}\).

Proof. Let \(\mathbf{A} = [\mathbf{a}_1 .. \mathbf{a}_n]\), and \(\mathbf{v} = \mathbf{A}\mathbf{x}\). Then \(\partial_{x_i}f = \hat{e}_i^T(\mathbf{A}\mathbf{x}) + \mathbf{x}^T\partial_{x_i}(\mathbf{A}\mathbf{x})\) by the product rule for bilinear operators. But, \(\hat{e}_i^T \mathbf{v} = v_i\) and \(\mathbf{x}^T \mathbf{a}_i = \mathbf{x} \cdot \mathbf{a}_i = v_i\). So, \(\partial_{x_i} f = 2 v_i\). ◻

Remark. The matrix identity \(\partial_{\mathbf{X}} tr (\mathbf{X^T A X}) = 2 \mathbf{A X}\) also follows.

A vector \(\mathbf{x}\) is said to be reconstructed by ‘neighbours’ \(\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_k\) with weights \(\mathbf{w'} \in \mathbb{R}^{k}\) when \(\|\mathbf{x} - [\mathbf{v}_1 .. \mathbf{v}_k]\mathbf{w'}\|_{2}^{2}\) is minimized and \(\sum{w'_{i}} = 1\). This is denoted \(\mathbf{w'}= \psi(\mathbf{x}; \mathbf{v_1}, ..., \mathbf{v_n})\)

Consider vector \(\mathbf{x}\) and \(\mathbf{v}_i \in \mathbb{R}^d\). Let \(\mathbf{G}\) be the Gram matrix of \((\mathbf{x}\mathbf{1}^{T} - [\mathbf{v}_1 ..\mathbf{v}_n])\). Then: \[\psi(\mathbf{x}; \mathbf{v_1}, ..., \mathbf{v_n}) = \frac{\mathbf{G}^{-1}\mathbf{1}}{\mathbf{1}^{T}\mathbf{G}^{-1}\mathbf{1}}\] when \(\mathbf{G}\) is invertible.

Proof. Consider the variable vector \(\mathbf{w'}\), and the objective \(\varepsilon(\mathbf{w'}) = \|\mathbf{x} - [\mathbf{v}_1 .. \mathbf{v}_k]\mathbf{w'}\|_{2}^{2} = \|\mathbf{x} -\mathbf{X}\mathbf{w'}\|_{2}^{2}\). Re-write \(\varepsilon(\mathbf{w'})\) using \(\mathbf{1}^T\mathbf{w'} = 1\) and apply \(\|\mathbf{a}\|_{2}^{2} = \mathbf{a}^T\mathbf{a}\) giving \(((\mathbf{x}\mathbf{1}^T - \mathbf{X})\mathbf{w'})^{T}((\mathbf{x}\mathbf{1}^T - \mathbf{X})\mathbf{w'})\), so \(\varepsilon(\mathbf{w'}) = \mathbf{w'}^T\mathbf{G}\mathbf{w'}\). It is clear that the extremum, which must be a minimum, occurs when \(\nabla \varepsilon = \lambda \mathbf{1}\) where \(\lambda\) is the Lagrange multiplier. Since \(\mathbf{G}\) is invertible, we can write \(2 \mathbf{G} \mathbf{w'} = \lambda \mathbf{1} \implies \mathbf{w'} = \frac{\lambda}{2}\mathbf{G}^{-1}\mathbf{1}\). The constraint \(\mathbf{1}^T\mathbf{w'} = \frac{\lambda}{2}\mathbf{1}^T\mathbf{G}^{-1}\mathbf{1} = 1 \implies \lambda = \frac{2}{\mathbf{1}^T\mathbf{G}^{-1}\mathbf{1}}\). The result follows by substitution. ◻

Remark. The invertibility of \(\mathbf{G}\) is an important pre-condition, which is discernibly difficult to guarentee in practice since it requires linear independence of displacements to neighbours. Furthermore, \(\mathbf{G}\) has atmost a rank of \(\min\{d, k\}\) which requires \(d \geq k\) (as it so often is). This problem is side-stepped by introducing a regularization parameter \(\epsilon \in \mathbb{R}^{+}\) to obtain \(\mathbf{G'} = \mathbf{G} + \epsilon \mathbf{I}\) which will invert since \(\mathbf{G}\) is positive semi-definite.

The k-NN Graph for a set of points \(\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n \in \mathbb{R}^d\) is generated from k-nearest neighbours in the following manner:

Let vertices of the graph be \(v_1, v_2, ... , v_n\).
Let \(\mathbf{N}\) be the array of indices such that \(\forall i \in \mathbb{N}_{\leq n}, N_{ik}\) is the index of the vector that is the k-th nearest neighbour of \(x_i\).
For each \(x_i\) compute \(\mathbf{w'} = \psi(\mathbf{x}_i;\mathbf{x}_{N{i1}}, .., \mathbf{x}_{N{ik}})\), then add an edge from \(v_1\) to \(v_{N_{ij}}\) for \(j = 1, ..., k\) with weight \(w'_{j}\)

The points \(y_1, y_2, ..., y_n \in \mathbb{R}^{p}\) are embedding points for a k-NN Graph with adjacency matrix \(\mathbf{W}\) when \(\frac{1}{n}{\mathbf{Y} \mathbf{Y}^T} = \mathbf{I}\) and \(\|\mathbf{Y}(\mathbf{I} - \mathbf{W})^T\|_{F}^{2}\) is minimized where \(\mathbf{Y} = [\mathbf{y}_1, \mathbf{y}_2, ..., \mathbf{y}_n]\).

The columns of \(\mathbf{Y}^{T} \in M_{n \times p }(\mathbb{R})\) are the eigenvectors corresponding to \(p\) lowest non-zero eigenvalues of \(\mathbf{M} = (\mathbf{I} - \mathbf{W})^{T}(\mathbf{I} - \mathbf{W})\)

Proof. Since \(\|\mathbf{A}\|_{F}^{2} = tr(\mathbf{A}\mathbf{A}^T)\), we have the objective \(\varepsilon(\mathbf{Y}) = tr(\mathbf{Y}\mathbf{M}\mathbf{Y}^T)\). The Lagrange function, expressing both the objective and constraints is \[\mathcal{L} = tr(\mathbf{Y}\mathbf{M}\mathbf{Y}^T) - tr(\mathbf{ Y Y}^T \mathbf{\Lambda}) + tr(\mathbf{\Lambda})\] where \(\mathbf{\Lambda} \in M_p(\mathbb{R})\) is an upper-triangular matrix of coefficients corresponding to the unit covariance constraint on the embeddings. Should we choose \(\mathbf{\Lambda}\) to be symmetric, the present form allows differentation with \(\mathbf{Y^T}\), yielding the target eigenvalue problem. \[\frac{\partial \mathcal{L}}{\partial {\mathbf{Y}^T}} = 2\mathbf{MY}^T - 2\mathbf{Y}^T \mathbf{\Lambda} = 0 \implies \mathbf{M} = \mathbf{Y_{*}^T \Lambda (Y^T_{*})^{-1}}\] where \(\mathbf{Y} \in M_{p \times n}(\mathbb{R})\) is a submatrix of \(\mathbf{Y}_{*} \in M_{n}(\mathbb{R})\). Minimization of objective requires columns of \(\mathbf{Y}^T\) to have the lowest eigenvalues. ◻

Remark. The matrix \(\mathbf{M}\) is the Gram matrix over the Laplacian of the k-NN graph, and shares the null space of \((\mathbf{I - W})\). Eigenvectors in the null space occur as a consequence of the kNN graph comprising multiple connected components, and do not capture characteristics of the sampling manifold. Furthermore, since \([\mathbf{1 ... 1}]\) is always a solution, the resulting embeddings must have a mean of 0.

The embedding of a vector \(\mathbf{x}_{*}\) not encountered in sample \(\mathbf{x}_1, ...,\mathbf{x}_n\) with embeddings \(\mathbf{y}_{i}\) is \([\mathbf{y}_{n_1}, \mathbf{y}_{n_2}, .., \mathbf{y}_{n_k}]\mathbf{w'}\) where \(\mathbf{w'} = \psi(\mathbf{x}_{*};\mathbf{x}_{n_1}, .., \mathbf{x}_{n_k})\), and \(\mathbf{x}_{n_1}, .., \mathbf{x}_{n_k}\) are the k nearest neighbours of \(\mathbf{x}_{*}\)