Introduction
Concentration inequalities provide bounds on how a random variable deviates from a value (usually the expectation) presuming facts about the underlying distribution. The simplest (and consequently the most versatile) result among these is the familiar Markov inequality.
Proposition 1 (Markov Inequality) Let \(X : \Omega \to \mathbb{R}_{0}^{+}\) be a random variable over the probability space \((\mu, \Sigma, \Omega)\), then for any \(a \in \mathbb{R}_{0}^{+}\): \[ \mathbb{P}_{\mu}[X \geq a] \leq \frac{\mathbb{E}_{\mu}X}{a} \]
Proof. Let \(W = \{ \omega \in \Omega : X(\omega) \geq a \}\). Then, set \(Y = a \mathbf{1}_{W}\). Clearly, \(Y \leq X\) since \(\forall \omega \notin W, X \geq 0 = Y\) and \(\forall \omega \in W, Y = a \leq X\). Then \(\mathbb{E}_{\mu}{X} \geq \mathbb{E}_{\mu}{Y} = a \mathbb{E}_{\mu}[\mathbf{1}_{W}]\). But, \(\mathbb{E}_{\mu}[\mathbf{1}_{W}] = \mathbb{P}_{\mu}[W]\). ◻
Remark. We can also obtain the related Chebyshev inequality by the substitution \(X \to (X - \mathbb{E}X)^2\).
Convex functions are also rather useful in obtaining inequalities via Jensen’s inequality. A function \(f : \mathcal{V} \to \mathbb{R}\) on a convex subset of a real vector space \(\mathcal{V}\), is convex if and only if \(\forall x_1, x_2 \in \mathcal{V}, f(t x_1 + (1-t)x_2) \leq t f(x_1) + (1-t)f(x_2)\) for all \(t \in [0, 1]\). For the purposes of the present discussion, we will restrict ourselves to convex functions on real intervals. A subderivative of a convex function defined over an open interval \(\mathcal{I}\) at a point \(x_0\) is a real number \(c\) such that \(\forall x \in \mathcal{I}\), \(f(x) - f(x_0) \geq c(x - x_0)\). The set of all subderivatives at \(x_0\) is called the subdifferential and is denoted \(\partial f(x_0)\). If \(f\) is differentiable at \(x_0\), then \(\partial f(x_0) = \{f'(x_0)\}\).
Proposition 2 (Monotonicity of Difference Quotients of Convex Functions) Let \(f : \mathcal{I} \to \mathbb{R}\) be a convex function defined over an open interval \(\mathcal{I}\) and define \(q(x,y) = \frac{f(y) - f(x)}{y - x}\) for \(x \neq y\), then \(q(x,y)\) is non-decreasing in both arguments.
Proof. By symmetry, it suffices to show that \(q(x,y)\) is non-decreasing in \(y\). Let \(x_1 < x < x_2\), then \(f(x) \leq \frac{x_2 - x}{x_2 - x_1} f(x_1) + \frac{x - x_1}{x_2 - x_1} f(x_2)\). So \(f(x) - f(x_1) \leq \frac{x - x_1}{x_2 - x_1} (f(x_2) - f(x_1))\). Dividing both sides by \(x - x_1\) gives \(q(x_1,x) \leq q(x_1,x_2)\). ◻
Remark. We can now conclude that the convex function \(f\) has a subderivative everywhere. For any \(x \in \mathcal{I}\) pick \(x_1 < x\), then \(\inf_{x_2 > x}{q(x, x_2)}\) exists as \(q(x, x_2) \geq q(x_1,x)\).
Proposition 3 (Jensen’s Inequality) Let \(X : \Omega \to \mathcal{I}\) be a random variable over the probability space \((\mu, \Sigma, \Omega)\) and \(f : \mathcal{I} \to \mathbb{R}\) be a convex function, then \(f(\mathbb{E}_{\mu}[X]) \leq \mathbb{E}_{\mu}[f(X)]\).
Proof. Let \(c \in \partial f(\mathbb{E}_{\mu}[X])\), then \(\forall x \in \mathcal{I}, f(x) - f(\mathbb{E}_{\mu}X) \geq c(x - \mathbb{E}_{\mu}X)\). So, \(f(X) - f(\mathbb{E}_{\mu}X) \geq c(X - \mathbb{E}_{\mu}X)\). Taking expectations on both sides gives \(\mathbb{E}_{\mu}[f(X)] - f(\mathbb{E}_{\mu}X) \geq c(\mathbb{E}_{\mu}X - \mathbb{E}_{\mu}X) = 0\). ◻
Remark. Jensen’s inequality is quite flexible since it is often not difficult to construct a probability space to apply the result for discrete and continuous cases alike. For instance, suppose one wants to obtain the AM-GM inequality; simply set \(f = - \ln\) and consider the probability space \((\mu, \Sigma, \Omega)\) for finite \(\Omega\), \(\Sigma = \mathcal{P}(\Omega)\) with \(\mu\) as the uniform distribution over \(\Omega\).
The Trick
Consider \(X = \sum{X_i}\) to be the random variable (rv.) defined as the sum of \(n\) independent and identically distributed (i.i.d.) rv. \(X_i\), such that all moments of \(X_i\) (and consequently of \(X\)) exist. Consequently, it makes sense to consider the moment generating function (mgf) \(M_X(\lambda) = \sum_{n \in \mathbb{N}}{\frac{{\lambda^n}}{n!}\mathbb{E}[X^n] = \mathbb{E}[e^{\lambda X}]}\). But, as \(X\) is a sum we have: \(M_X(\lambda) = \prod{\mathbb{E}[e^{\lambda X_i}]}\). Set \(Y = e^{\lambda X}\), by monotonicity of \(e^x\), \(\mathbb{P}[X \geq a] = \mathbb{P}[Y \geq e^{\lambda a}]\). Now, apply Markov’s inequality \(\mathbb{P}[Y \geq e^{\lambda a}] \leq e^{-\lambda a}\mathbb{E}[e^{\lambda X}]\). But, the term \(\mathbb{P}[X \geq a]\) is in fact free of \(\lambda\). Consequently, \[ \mathbb{P}[X \geq a] \leq \inf_{\lambda \in \mathbb{R}^{+}} e^{-\lambda a}\mathbb{E}[e^{\lambda X}] \tag{1}\]
At this point, one might desire to witness this trick in action. An illustrative example is the case of i.i.d. Bernoulli. Suppose \(X_i \sim Bernoulli(p)\) then: \(\mathbb{E}[e^{\lambda X_i}] = (1 - p) + p e^{\lambda}\) which gives \(f(\lambda) = e^{-\lambda a}(pe^{\lambda} + (1-p))^n\) as the function to minimize. Setting \(f'(x) = 0\) obtains \(\lambda_{*} = \ln{a(1-p)} -\ln{p(n - a)}\). Substitute \(\lambda_{*}, p \to u, a \to n(u + t)\) in \(f(\lambda_{*})\) to get \(f_{min} = \bigg(\frac{u}{u + t}\bigg)^{n(u+t)}\bigg(\frac{1-u}{1-u-t}\bigg)^{n(1-u-t)}\). This expression seems familiar. To admit a simplification, let us consider \(-\frac{1}{n}\ln(f_{min})\) which gives: \[ -\frac{1}{n}\ln(f_{min}) = (u+t)\ln\bigg(\frac{u+t}{u}\bigg) + (1-u-t)\ln\bigg(\frac{1-u-t}{1-u}\bigg) = D_{KL}(p' || q') \]
where \(p' \sim Bernoulli(u+t), q' \sim Bernoulli(u)\) and \(D_{KL}\) is the relative entropy or Kullback-Leibler divergence.
As pleasing as the appearance of relative entropy is, for the task at hand, it may be more conducive to consider a simpler lower bound. Let \(R(t) = D_{KL}(p'||q')\), then \(R'(t) = (\ln(u+t) - \ln(1-u-t)) - (\ln(u) - \ln(1-u))\) so \(R''(t) = \frac{1}{(u+t)(1-u-t)} \geq 4\). Expanding \(R(t)\) via Taylor’s theorem about \(t = 0\) gives \(R(t) = R(0) + tR'(0) + \frac{t^2}{2}R''(\xi)\) for some \(\xi \in [0, t]\). But, \(R(0) = 0\) and \(R'(0) = 0\) so \(R(t) = \frac{t^2}{2}R''(\xi) \geq 2t^2\).
Proceeding, we now have: \(\mathbb{P}\big[\frac{1}{n}X \geq u+t\big] \leq e^{-2nt^2}\). Combining with the result of sending \(p \to 1-p, u \to 1-u\) we have: \[ \mathbb{P}\bigg[\bigg|\frac{1}{n}X - u\bigg| \geq t\bigg] \leq 2e^{-2nt^2} \]
Since \(u = \mathbb{E}[\frac{1}{n}X]\) is the only artifact of the original distribution, a natural question is to ask if this result can be salvaged even if an arbitrary distribution is “cast” into a Bernoulli rv. whilst preserving the expectation. In particular, consider independent \(X_i \in [0, 1]\) (a.s.) with \(\mathbb{E}[X_i] = u_i \in [0, 1]\), then set \(Y_i \sim Bernoulli(u_i)\). Conveniently, \(\mathbb{E}[e^{\lambda X_i}] \leq (e^{\lambda} -1)\mathbb{E}X_i + 1 = \mathbb{E}[e^{\lambda Y_i}]\) as \(\mathbb{E}[X_i^{n}] \leq \mathbb{E}X_i\), and \(M_X(\lambda) \leq \prod{e^\lambda u_i + (1-u_i)}\). Applying the AM-GM inequality and setting \(\frac{1}{n}\sum{u_i} \to u\), \(M_X(\lambda) \leq (\frac{1}{n}\sum{e^\lambda u_i + (1-u_i)})^n = (u e^\lambda + (1-u))^n\). The proof ensues as before. For completeness’s sake, the aforementioned ideas are packaged into the following theorems.
It is noteworthy, that the role of independence in the Chernoff Technique is to merely ensure that the mgf of the sum is the product of the mgf’s. Crucially, although independence is a sufficient condition, it is not necessary. Furthermore, in most applications of Equation 1 the resulting bounds are slack largely due to the error incurred by Markov’s inequality which is sharp only in degenerate cases.
Definition 1 A collection of rv. \(X_i\) with sum \(X = \sum{X_i}\) is said to be \(\ln M\)-subadditive with respect to a collection of real-valued functions \(\{v_i\}_{i \in [n]}\) if and only if \(\ln{M_X(\lambda)} \leq \sum{v_i(\lambda)}\) for all \(\lambda \in \mathbb{R}\). If the collection \(\mathcal{F}\) is a parametrized family, then a single function \(v(\lambda, \theta)\), where \(\theta\) is the parameter of the family, is used to denote the collection. If both \(\{v_i\}\) and \(v(\lambda, \theta)\) are omitted, then presume \(v_i = \ln(M_{X_i}(\lambda))\).
Remark. It is evident that this definition is entirely algebraically motivated. The term “subadditive” is only accurate when \(v_i = \ln(M_{X_i}(\lambda))\) and is used to evoke the intuition that the mgf of the sum is upper bounded by the product of the mgf’s.
Proposition 4 If \(X_i\) are independent rv., then \(X_i\) are \(\ln M\)-subadditive.
Theorem 1 (Heterogenous, No-tail) Let \(X_i: \Omega \to \mathbb{R}\) be \(\ln M\)-subadditive rv. such that \(X_i \in [a, b]\) (a.s.) and define \(\bar{X} = \frac{1}{n}\sum{X_i}\), then for any \(t > 0\): \[ \mathbb{P}[|\bar{X} - \mathbb{E}[\bar{X}]| \geq (b-a)t] \leq 2e^{-2nt^2} \tag{2}\]
Remark. The rv. \(X_i\) are “thin-tailed” as they have no tails at all. Additionally, for any collection of bounded \(X_i\), the largest interval containing \(X_i\) must be chosen weakening the bound.
Corollary 1 Let \(X_i: \Omega \to \mathbb{R}\) be \(\ln M\)-subadditive rv. such that \(X_i \in [a, b]\) (a.s.) and define \(\bar{X} = \frac{1}{n}\sum{X_i}\) with \(\mathbb{E}\bar{X} \in [u_0, u_1]\). Then, for each \(\delta \in (0, 1)\) with probability at least \((1 - \delta)\): \[ u_0 - (b-a)\sqrt{\frac{1}{2n}\ln{\frac{2}{\delta}}} \leq \bar{X} \leq u_1 + (b-a)\sqrt{\frac{1}{2n}\ln{\frac{2}{\delta}}} \]
Remark. The corollary explicitly demonstrates the distinctive feature of concentration-type bounds, specifically, that they translate bounds on expectations to bounds in probability.
Theorem 2 (Thin-tail) Let \(X_i: \Omega \to \mathbb{R}\) be \(\ln M\)-subadditive rv. with \(\{v_i\}\), \(K = \frac{1}{n}\sum_{i} v_i\), and sample mean \(\bar{X} = \frac{1}{n}\sum{X_i}\), then there exists a rate function \(I(a)\) such that for any \(a \in \mathbb{R}\): \[ \begin{aligned} \mathbb{P}[\bar{X} \geq a] \leq e^{-nI(a)} \text{ and } & I(a) = \sup_{\lambda \in \mathbb{R}}{(\lambda a - K(\lambda))} \end{aligned} \tag{3}\]
Remark. It is convenient when \(\{v_i\}\) is linear in the parameter \(\theta \in \mathbb{R}^d\), since \(K(\lambda) = v(\lambda, \bar{\theta})\). For identically distributed subindependent rv., the statement makes explicit the connection between the bounds and the Legendre transform of \(\ln{M_X(\lambda)}\) where \(M_X(\lambda)\) is the mgf of any \(X_i\). Moreover, in case of independent rv., Cramer’s theorem shows that the bound so obtained is sharp in the limit: \[ \lim_{n \to \infty} -\frac{1}{n}\ln{\mathbb{P}[\bar{X} \geq a]} = I(a) \] Choosing the codomain of the rate function to be the extended reals handles the cases when \(a\) is outside the support of \(\bar{X}\).
A commonly encountered variant is the multiplicative form of Theorem 1 obtained by “casting” to Poisson rv. with the rate function \(I(a) = a \ln({\frac{a}{u}}) - a + u\) and then sending \(a \to (1 \pm \delta)u\). This approach avoids the AM-GM inequality, and instead relies on the bound \(\sum{\ln{M_{X_i}(\lambda)}} \leq \sum{\ln(1 + u_i(e^{\lambda} -1))} \leq n u (e^{\lambda} - 1)\) where \(u = \frac{1}{n}\sum{u_i}\).
Theorem 3 (Multiplicative Form) Let \(X_i: \Omega \to \mathbb{R}\) be \(\ln M\)-subadditive rv. such that \(X_i \in [0, 1]\) (a.s.) and define \(\bar{X} = \frac{1}{n}\sum{X_i}\) with \(\mathbb{E}\bar{X} = u\), then for any \(\delta \in (0, 1)\): \[ \mathbb{P}[\bar{X} \geq (1+\delta)u] \leq \bigg(\frac{e^{\delta}}{(1+\delta)^{1+\delta}}\bigg)^{n u} \]
\[ \mathbb{P}[\bar{X} \leq (1-\delta)u] \leq \bigg(\frac{e^{-\delta}}{(1-\delta)^{1-\delta}}\bigg)^{n u} \]
Corollary 2 Let \(X_i: \Omega \to \mathbb{R}\) be \(\ln M\)-subadditive rv. such that \(X_i \in [0, 1]\) (a.s.) and define \(\bar{X} = \frac{1}{n}\sum{X_i}\) with \(\mathbb{E}\bar{X} = u\), then for any \(\delta \in (0, 1)\): \[ \mathbb{P}\bigg[\bigg|\frac{\bar{X}}{u} - 1\bigg| \geq \delta\bigg] \leq 2\exp\bigg({-\frac{n u \delta^2}{3}}\bigg) \]
Remark. The simplified bound is obtained by applying the inequality \((\delta + 1) \ln(\delta + 1) - \delta \geq \frac{\delta^2}{3}\) and when \(\delta \in (-1, 1)\).
Hoeffding’s Extension
The bounds obtained for non-identically distributed rv. in the previous section can be improved upon by incorporating Hoeffding’s Lemma.
Proposition 5 (Popoviciu) Let \(X : \Omega \to \mathbb{R}\) be a rv. such that \(X \in [a, b]\) (a.s.) then \(\text{Var}(X) \leq \frac{(b-a)^2}{4}\).
Proof. Let \(g(x) = (b-x)(x-a)\), then \(g(x) \geq 0\) for all \(x \in [a, b]\) and \(g(x) = -x^2 + (a+b)x - ab\). So, \(\mathbb{E}[g(X)] = g(\mathbb{E}[X]) - (\mathbb{E}[X^2] - \mathbb{E}X^2) \geq 0\). Thus, \(\text{Var}(X) \leq g(\mathbb{E}[X]) \leq g(\frac{a+b}{2}) = \frac{(b-a)^2}{4}\). ◻
Lemma 1 (Hoeffding’s Lemma) Let \(X : \Omega \to \mathbb{R}\) be a rv. such that \(X \in [a, b]\) (a.s.) and \(\mathbb{E}[X] = 0\), then \(\mathbb{E}[e^{\lambda X}] \leq e^{\frac{\lambda^2}{8}(b-a)^2}\) for all \(\lambda \in \mathbb{R}\).
Proof. By convexity of \(e^x\), \(\forall x \in [a, b], e^{\lambda x} \leq \frac{b-x}{b-a}e^{\lambda a} + \frac{x-a}{b-a}e^{\lambda b}\). So, \(\mathbb{E}[X] = 0\) so \(\mathbb{E}[e^{\lambda X}] \leq \frac{b}{b-a}e^{\lambda a} - \frac{a}{b-a}e^{\lambda b}\). Set \(g(\lambda) = \ln{\big(\frac{b}{b-a}e^{\lambda a} - \frac{a}{b-a}e^{\lambda b}\big)}\), then \(g(0) = 0\) and \(g'(0) = 0\). Writing \(g''(\lambda) = \alpha (b-a)^2 (1-\alpha)\), \(\alpha = \frac{b}{b-ae^{(b-a)\lambda}}\) gets \(g''(\lambda) \leq \frac{(b-a)^2}{4}\). By Taylor’s theorem, \(g(\lambda) = g(0) + \lambda g'(0) + \frac{\lambda^2}{2}g''(\xi)\) for some \(\xi \in [0, \lambda]\). So, \(g(\lambda) \leq \frac{\lambda^2}{8}(b-a)^2\). ◻
Equipped with Hoeffding’s Lemma, we can now upper bound the mgf of the sum of bounded \(\ln M\)-subadditive rv. as \(M_X(\lambda) \leq \prod{M_{X_i}(\lambda)} \leq \exp(\lambda n u + \frac{\lambda^2}{4}A^2 n)\) where \(u = \frac{1}{n}\sum{\mathbb{E}{X_i}}, A^2 = \frac{1}{2 n}\sum{(b_i - a_i)^2}\). Proceeding as in Equation 1, with \(a \to n(u + t)\), we get \(f(\lambda) = -\lambda t + \frac{\lambda^2}{4}A^2 = (\frac{1}{2}\lambda A - \frac{1}{A}t)^2 - \frac{1}{A^2}t^2\) as the function to be minimized. Thus, \(-\frac{1}{n} \ln \mathbb{P}[\bar{X} \geq u+t] \geq \frac{t^2}{A^2}\).
Theorem 4 (Hoeffding’s Inequality) Let \(X_i: \Omega \to \mathbb{R}\) be \(\ln M\)-subadditive rv. such that \(X_i \in [a_i, b_i]\) (a.s.), define \(\bar{X} = \frac{1}{n}\sum{X_i}\), then for any \(t > 0\): \[ \mathbb{P}\big[ |\bar{X} - \mathbb{E}\bar{X}| \geq t \big] \leq 2\exp\left(-\frac{2n^2 t^2}{\sum{(b_i - a_i)^2}}\right) \tag{4}\]
Remark. The bound Equation 2 is readily obtained by setting \(a_i \to a, b_i \to b\). Otherwise, the denominator \(\sum{(b_i - a_i)^2}\) is smaller than \(n(b-a)^2\) when \(a = \min a_i\) and \(b = \max b_i\) so the bound is tighter.
Naturally, if a non-trivial bound on variance is available then the result of Hoeffding’s inequality can be improved upon. In this regard, the following lemma of Bernstein can be applied.
Lemma 2 (Bernstein’s Lemma) Let \(X : \Omega \to \mathbb{R}\) be a rv. such that \(|X| \leq c\) (a.s.), \(\mathbb{E}[X] = 0\) and \(\text{Var}(X) = \sigma^2 \leq c^2\) then \(\ln \mathbb{E}[e^{\lambda X}] \leq \frac{\sigma^2}{c^2}{(e^{\lambda c} -1 - \lambda c})\) for all \(\lambda \in \mathbb{R}\).
Proof. We have \(\mathbb{E}[e^{\lambda X}] = 1 + \sum_{n = 2}^{\infty}{\frac{\lambda^n}{n!}}\mathbb{E}[X^n] = 1 + \sigma^2 \sum_{n=2}^{\infty}{\frac{\lambda^{n}}{n!}\frac{\mathbb{E}[X^n]}{\mathbb{E}[X^2]}} = 1 + \sigma^2 B \leq e^{ \sigma^2 B}\). But, \(\mathbb{E}[X^n] \leq c^n\) and \(\frac{c^2}{\sigma^2} > 1\), so \(B = \frac{c^2}{\sigma^2}\sum_{n=2}^{\infty}{\frac{\lambda^{n}}{n!}c^{n-2}} \leq \frac{1}{c^2}(e^{\lambda c} - 1 - \lambda c)\). ◻
With this lemma, the function to be minimized is \(f(t) = -ta + \frac{\sigma^2}{c^2}(e^{tc} - tc - 1)\). Setting \(f'(t) =0\) we get \(t_{*} = \frac{1}{c}\ln\big(\frac{ac}{\sigma^2} + 1\big)\). Substituting \(t \to t_{*}\) we get \(f_{min} = -\frac{\sigma^2}{c^2}h\big(\frac{ac}{\sigma^2}\big)\) where \(h(x) = (x+1)\ln(x+1) - x\). At this point, one may consider the Pade approximant of \(h\) and attempt to obtain a lower bound. Indeed \(h(x) \geq \frac{x^2}{2(1+\frac{x}{3})}\).
Theorem 5 (Bernstein’s Inequality) Let \(X_i: \Omega \to \mathbb{R}\) be \(\ln M\)-subadditive rv. such that \(|X_i - \mathbb{E}X_i| \leq c\) (a.s.), define \(\bar{X} = \frac{1}{n}\sum{X_i}\) and \(\sigma^2 = \frac{1}{n}\sum{\text{Var}(X_i)}\), then for any \(t > 0\): \[ \mathbb{P}\big[ |\bar{X} - \mathbb{E}\bar{X}| \geq t \big] \leq 2\exp\left(-\frac{n t^2}{2(\sigma^2 + \frac{c t}{3})}\right) \tag{5}\]
Martingale Variants
As previously mentioned, independence is not a necessary condition for the Chernoff Technique. It is worthwhile to examine Martingale Difference Sequences (MDS), which are a common relaxation of independence.
Proposition 6 (Tower Property of Conditional Expectation) Let \((\mu, \Sigma, \Omega)\) be a probability space and \(\mathcal{H} \subseteq \mathcal{G}\) be sub-\(\sigma\)-algebras of \(\Sigma\), then:
\[ \mathbb{E}[\mathbb{E}[X | \mathcal{G}] | \mathcal{H}] = \mathbb{E}[\mathbb{E}[X | \mathcal{H}] | \mathcal{G}] = \mathbb{E}[X | \mathcal{H}] \]
Remark. In the interest of brevity, an appeal to intuition is made for the notion that “finer” \(\sigma\)-algebras (\(\mathcal{G}\)) are more informative than “coarser” (\(\mathcal{H}\)). Consequently, conditioning on \(\mathcal{G}\), with detailed information, and then conditioning on a coarser \(\mathcal{H}\) is the same as conditioning on \(\mathcal{H}\) alone (discarding detailed information). Likewise, \(\mathcal{G}\) contains the information needed to determine the conditional expectation with respect to the coarser \(\mathcal{H}\) (which already discards information), so the value is unchanged. Setting \(\mathcal{H} = \{\phi, \Omega\}\), we recover the law of total expectation \(\mathbb{E}[\mathbb{E}[X | \mathcal{G}]] = \mathbb{E}[X]\).
Definition 2 (Filtration) A sequence of \(\sigma\)-algebras \(\{\mathcal{F}_t\}_{t \in \mathbb{N}}\) is a filtration if and only if \(\mathcal{F}_t \subseteq \mathcal{F}_{t+1}\) for all \(t\).
Remark. For the present discussion, only discrete time filtrations are considered. The natural filtration of a sequence of rv. \(X_t\) is the filtration \(\{\sigma(X_1, \dots, X_t)\}_{t \in \mathbb{N}}\) where \(\sigma(X_1, \dots, X_t)\) is the smallest \(\sigma\)-algebra with respect to which \(X_1, \dots, X_t\) are measurable.
Definition 3 (MDS) A sequence of rv. \(X_t\) with \(\mathbb{E}{|X_t|} \lt \infty\) is a martingale difference sequence if and only if \(\mathbb{E}[X_t | \mathcal{F}_{t-1}] = 0\) for all \(t \in \mathbb{N}\) where \(\{\mathcal{F}_t\}_{t \in \mathbb{N}}\) is a filtration such that \(X_t\) is measurable with respect to \(\mathcal{F}_t\) for all \(t\).
Remark. The partial sums of an MDS form a martingale, and the converse is also true upto an initial random variable.
Proposition 7 (MDS \(\ln M\)-subadditivity) Let \(X_t\) be an MDS such that \(\ln\mathbb{E}[e^{\lambda X_t} | \mathcal{F}_{t-1}] \leq v_t(\lambda)\) for all \(t, \lambda\) then \(X_t\) are \(\ln M\)-subadditive with \(\{v_t\}\).
Proof. Define \(S_n = \sum_{t=1}^n X_t\). For induction, let \(\ln M_{S_n}(\lambda) \leq A_k(\lambda) = \ln\mathbb{E}[e^{\lambda S_{k-1}}] + \sum_{t=k}^n{v_t(\lambda)}\). But \(\ln\mathbb{E}[e^{\lambda S_k}] = \ln\mathbb{E}[e^{\lambda S_{k-1}}\mathbb{E}[e^{\lambda X_k} | \mathcal{F}_{k-1}]]\) since \(S_k = S_{k-1} + X_k\) and \(S_{k-1}\) is measurable with respect to \(\mathcal{F}_{k-1}\). Now, \(\ln\mathbb{E}[e^{\lambda X_k} | \mathcal{F}_{k-1}] \leq v_k(\lambda)\) so \(A_k(\lambda) \leq \ln\mathbb{E}[e^{\lambda S_{k-2}}] + v_{k-1}(\lambda) + \sum_{t=k}^n{v_t(\lambda)} = A_{k-1}(\lambda)\). Clearly, \(A_{n+1}(\lambda) = \ln M_{S_n}(\lambda)\).
Combining Proposition 7 with Lemma 1 leads us to an analogous inequality for martingales.
Theorem 6 (Azuma’s Inequality) Let \(X_t: \Omega \to \mathbb{R}\) be a martingale such that \(|X_t - X_{t-1}| \leq c_t\) (a.s.) over \(\mathcal{F}_{t-1}\) for all \(t\), then for any \(\epsilon > 0\): \[ \mathbb{P}\big[ |X_n - X_0| \geq \epsilon \big] \leq 2\exp\left(-\frac{\epsilon^2}{2\sum_{i=1}^n{c_i^2}}\right) \tag{6}\]
We could also obtain a Bernstein-type inequality for martingales by applying Lemma 2 instead of Lemma 1. The resulting bound is known as Freedman’s inequality.
Theorem 7 (Freedman’s Inequality) Let \(X_t: \Omega \to \mathbb{R}\) be a martingale such that \(|X_t - X_{t-1}| \leq c\) (a.s.) over \(\mathcal{F}_{t-1}\) for all \(t\), define \(V_n = \sum_{t=1}^n{\mathbb{E}[(X_t - X_{t-1})^2 | \mathcal{F}_{t-1}]}\), then for any \(\epsilon > 0\): \[ \mathbb{P}\big[ |X_n - X_0| \geq \epsilon \land V_n \leq \sigma^2 \big] \leq 2\exp\left(-\frac{\epsilon^2}{2(\sigma^2 + \frac{c \epsilon}{3})}\right) \]
A recurring martingale is the Exposure Martingale, obtained from computing conditional expectation by sequentially revealing comprising variables.
Proposition 8 (Exposure Martingale) Suppose there exists a product space \(\Omega = \prod_{i=1}^n{\Omega_i}\), a function \(f: \Omega \to \mathbb{R}\), and define the exposure martingale of \(f\) is the sequence of rv. \(X_t: \Omega \to \mathbb{R}\) defined as \(X_t(\mathbf{w}) = \mathbb{E}[f | W_1, \dots, W_t]\) where \(w_i\) is the projection of \(\textbf{w} \in \Omega\) onto \(\Omega_i\). Then, the exposure martingale \(X_t\) is indeed, a martingale.
Remark. Evidently, \(\{\mathcal{F}_t\}\), where \(\mathcal{F}_t = \sigma(\pi_1^{-1}(S_1), \dots, \pi_t^{-1}(S_t) : \forall S_i \in \Sigma_i)\) is a filtration with \(\mathcal{F}_t = \sigma(\Omega_1 \dots \times S_i \times \dots \Omega_n: \forall S_i \in \Sigma_i, i \in [t]) = \sigma (S_1 \times \dots \times S_t \times \dots \Omega_n : \forall S_i \in \Sigma_i) \supset {F}_{t-1}\).
To apply Theorem 6 to an exposure martingale, an upper bound on the change in the value of the function \(f\) when one variable is revealed is sufficient. This leads to the notion of bounded differences and co-ordinatewise Lipschitz functions.
Definition 4 (Bounded Differences) A function \(f:\mathcal{A}_1 \times \dots \times \mathcal{A}_n \to \mathbb{R}\) satisfies the bounded differences property if and only if \(\forall i \in [n], \exists c_i \in \mathbb{R}^{+}\) such that \(|f(\mathbf{a}) - f(\mathbf{a}')| \leq c_i\) when \(\mathbf{a}_i = (a_1, \dots, a_i, \dots ,a_n), \mathbf{a}'_i = (a_1, \dots, a_i', \dots, a_n')\) with \(a_i \neq a_i'\). Let \(L = \max{c_i}\), then \(f\) is said to be co-ordinatewise \(L\)-Lipschitz.
Lemma 3 (Bounded Differences Exposure Martingale) Let \(f:\mathcal{A}_1 \times \dots \times \mathcal{A}_n \to \mathbb{R}\) be a function satisfying the bounded differences property with constants \(c_i\) and let \(X_t\) be the exposure martingale of \(f\) with filtration \(\{\mathcal{F}_t\}\), then \(|X_t - X_{t-1}| \leq c_t\) (a.s.) over \(\mathcal{F}_{t-1}\) for all \(t\).
Proof. Let \(\mathbf{w} \in \Omega\) be the outcome corresponding to observation \(\mathbf{a} \in \mathcal{A}_1 \times \dots \times \mathcal{A}_n\) and \(W_i: \Omega \to \mathcal{A}_i\) be the exposure rv. Then \(X_t = \mathbb{E}[f(W_1, \dots, W_t, \dots, W_n)|\mathcal{F}_t]\), but \(W_1,\dots,W_t\) are measurable with respect to \(\mathcal{F}_t\) so \(X_t(\mathbf{w}) = \mathbb{E}[f(a_1, \dots, a_t, W_{t+1}, \dots, W_n)|\mathcal{F}_t]\). Define rv. \(g(W_{t+1},\dots, W_n) = f(a_1, \dots, a_t, W_{t+1}, \dots, W_n)\) so that \(X_t(\mathbf{w}) = \mathbb{E}[g(W_{t+1},\dots,W_n)|\mathcal{F}_t] = \mathbb{E}[g(W_{t+1},\dots,W_n)|\mathcal{F}_{t-1}]\). So \(|X_t - X_{t-1}| \leq \mathbb{E}[|g(W_{t+1},\dots,W_n) - f(a_1, \dots, a_{t-1}, W_t, W_{t+1}, \dots, W_n)|\mathcal{F}_{t-1}]\). Clearly, \(g\) and \(f\) differ only in the \(t\)-th co-ordinate, so \(|g(W_{t+1},\dots,W_n) - f(a_1, \dots, a_{t-1}, W_t, W_{t+1}, \dots, W_n)| \leq c_t\). Therefore \(|X_t - X_{t-1}| \leq c_t\) (a.s.) over \(\mathcal{F}_{t-1}\) ◻
Equipped with Lemma 3, we can now model a Lipschitz function of independent rv. as a martingale with bounded differences and apply Azuma’s inequality.
Theorem 8 (McDiarmid’s Inequality) Let \(X_i: \Omega \to \mathcal{A}_i\) be independent rv., let \(f:\mathcal{A}_1 \times \dots \times \mathcal{A}_n \to \mathbb{R}\) be a function satisfying the bounded differences property with constants \(c_i\), then for any \(t > 0\): \[ \mathbb{P}\big[ |f(X_1, \dots, X_n) - \mathbb{E}[f(X_1, \dots, X_n)]| \geq t \big] \leq 2\exp\left(-\frac{2t^2}{\sum_{i=1}^n{c_i^2}}\right) \]
In Action
Having established the Chernoff Technique and its variants, we can examine some of its applications.
Set Balancing
Consider the problem of dividing a group of subjects into two qualitatively similar groups in the design of an experiment. Let there be \(N \in \mathbb{N}\) subjects, and \(M \in \mathbb{N}\) attributes, then the problem is to find a partition of the subjects into two groups T (treatment) and C (control) such that the excess of subjects with a given attribute is small. Formally, let \(A \in \{0, 1\}^{M \times N}\) be the matrix of attributes where \(A_{ij} = 1\) if subject \(j\) has attribute \(i\) and \(0\) otherwise. Let \(\mathbf{x} \in \{-1, 1\}^N\) be the vector of group assignments where \(x_j = 1\) if subject \(j\) is assigned to group T and \(-1\) otherwise. The problem is to find an assignment such that \(\|A\mathbf{x}\|_{\infty}\) is small. Recall the rate function for Rademacher rv. is \(I(a) = \frac{1}{2}(1 + a)\ln(1 + a) + \frac{1}{2}(1 - a)\ln(1 - a)\). Observe that \(I(a) \geq \frac{1}{2}a^2\) (by Taylor’s theorem and \(I''(a) \geq 1\)), so \(\mathbb{P}[|c_i| \geq t] \leq 2e^{-\frac{t^2}{2N}}\) (since at most \(N\) terms may participate in the sum). By the union bound, \(\mathbb{P}[\|A\mathbf{x}\|_{\infty} \geq t] \leq 2Me^{-\frac{t^2}{2N}}\). Send \(\delta \to 2Me^{-\frac{t^2}{2N}}\) to get \(a = \sqrt{2N\ln(2M/\delta)}\). We can set \(\delta \leftarrow \frac{2}{M}\) so that, with probability at least \(\big(1 - \frac{2}{M}\big)\), \(\|A\mathbf{x}\|_{\infty} \lt \sqrt{4N\ln(M)}\) when \(\mathbf{x}_i \sim Rademacher\). Thus, randomly assigning subjects to groups T and C with equal probability is sufficient to ensure that the excess of subjects with a given attribute is small with high probability. This is moderately useful in scenarios where finding out the attributes of the subjects is costly or infeasible. In the absence of knowledge of the attributes, any deterministic algorithm would be susceptible to an adversarial setup of attributes, so the random assignment is optimal in this sense.
Cliques in Random Graphs
Consider the Erdős–Rényi model \(G \sim \mathcal{G}(n, p)\) where \(n\) is the number of vertices and \(p\) is the probability of an edge between any two vertices. Fix an arbitrary enumeration of the edges of \(K_{n} \supseteq G\), each of which is present in \(G\) with probability \(p\). Let \(X_t\) be the expected number of cliques of size \(k\) in \(G\) having examined the membership of first \(t\) edges. By Proposition 8, \(X_t\) is the edge exposure martingale. Moreover, \(|X_t - X_{t-1}| \leq \binom{n-2}{k-2}\) (a.s.) since the addition of an edge can create at most \(\binom{n-2}{k-2}\) new cliques. Thus, by Azuma’s inequality, \(\mathbb{P}\big[X_n \leq X_0 - \epsilon\big] \leq \exp\left(-\frac{\epsilon^2}{n(n-1)\binom{n-2}{k-2}^2}\right)\). But \(X_0 = \binom{n}{k}p^{\binom{k}{2}}\) so we have with probability at least \((1 - \frac{1}{n})\): \[ \rho(n, p, k) \geq \max\bigg(p^{\frac{1}{2}k(k-1)} - k(k-1)\sqrt{\frac{\ln n}{n(n-1)}}, 0\bigg) \]
where \(\rho(n, p, k)\) is the clique density of \(G\) defined as the fraction of all possible \(k\)-cliques that are present in \(G\). An illustration of this bound is provided in Figure 1 for triangles.
Random Projections
Consider a matrix \(\mathbf{A} \in \mathbb{R}^{m,n}\) such that \(A_{ij} \overset{i.i.d.}{\sim} \mathcal{N}(0,1)\). Let \(\mathbf{x} \in \mathbb{S}^{n-1}\) be a unit vector in \(\mathbb{R}^n\), then \(\mathbf{y} = \mathbf{A}\mathbf{x}\) is a Gaussian vector in \(\mathbb{R}^m\) with \(y_i \sim \mathcal{N}(0,1)\), since \(Var(y_i) = \sum_{j=1}^n Var(A_{ij}x_j) = \sum_{j=1}^n x_j^2 = 1\). Naturally, we have the following proposition
Proposition 9 (Relative Length of Image under Random Matrix) Let \(\mathbf{A} \in \mathbb{R}^{m,n}\) be a random matrix such that \(A_{ij} \overset{i.i.d.}{\sim} \mathcal{N}(0,1)\), then for any \(\mathbf{x} \in \mathbb{R}^m\): \[ \frac{||\mathbf{Ax}||_2^2}{||x||_2^2} \sim \chi^2(m) \]
We may also attempt to obtain a tail bound for the relative length by applying the Chernoff Technique. Recall that the rate function for \(\chi^2(m)\) is \(I(a) = \frac{1}{2}(a - m - m\ln{\frac{a}{m}})\), so \(\mathbb{P}\bigg[\bigg|\frac{||\mathbf{Ax}||_2^2}{||x||_2^2} - m\bigg| \geq \epsilon m\bigg] \leq e^{-\frac{m}{2}(\epsilon - \ln(1+\epsilon))} + e^{-\frac{m}{2}(-\epsilon - \ln(1-\epsilon))}\). From the power series expansions we have \(\epsilon - \ln(1+\epsilon) \geq \frac{\epsilon^2}{2} - \frac{\epsilon^3}{3}\) and \(-\epsilon - \ln(1-\epsilon) \geq \frac{\epsilon^2}{2} + \frac{\epsilon^3}{3}\), so we get the bound \(\mathbb{P}\bigg[\bigg|\frac{||\mathbf{Ax}||_2^2}{||x||_2^2} - m\bigg| \geq \epsilon m\bigg] \leq 2e^{-\frac{m}{2}\left(\frac{\epsilon^2}{2} - \frac{\epsilon^3}{3}\right)}\). If we consider a collection of \(N\) such vectors, then applying the union bound we get \(\mathbb{P}\bigg[\exists i \in [N] : \bigg|\frac{||\mathbf{Ax}_i||_2^2}{||\mathbf{x}_i||_2^2} - m\bigg| \geq \epsilon m\bigg] \leq 2Ne^{-\frac{m}{2}\left(\frac{\epsilon^2}{2} - \frac{\epsilon^3}{3}\right)}\). We may now seek the minimum value of \(m\) such that all vector lengths are within \((m - \epsilon m, m + \epsilon m)\). In other words, we want \(1 - 2N\exp(-\frac{m}{2}(\frac{\epsilon^2}{2} - \frac{\epsilon^3}{3})) > 0\). Solving for \(m\) we get \(m > \frac{2\ln(2N)}{\frac{\epsilon^2}{2} - \frac{\epsilon^3}{3}}\). If we now set \(\mathbf{P} = \frac{1}{\sqrt{m}}\mathbf{A}\) then we have, in effect shown (for pairwise displacements):
Lemma 4 Let \(\mathbf{x}_1, \dots, \mathbf{x}_N \in \mathbb{R}^n\) be a collection of vectors and \(\epsilon \in (0, 1)\), then for any \(m > \frac{4\ln(N)}{\frac{\epsilon^2}{2} - \frac{\epsilon^3}{3}}\), there exists a linear map \(\mathbf{P} : \mathbb{R}^n \to \mathbb{R}^m\) such that \(\forall i, j \in [N]\): \[ (1 - \epsilon)||\mathbf{x}_i - \mathbf{x}_j||_2^2 \leq ||\mathbf{P}\mathbf{x}_i - \mathbf{P}\mathbf{x}_j||_2^2 \leq (1 + \epsilon)||\mathbf{x}_i - \mathbf{x}_j||_2^2 \]
Remark. We can now conclude, that in so far as a finite set of \(N\) points are concerned, approximately distance preserving maps to \(\mathbb{R}^m\) exist for \(m = O(\epsilon^{-2}\ln(N))\). The celebrated Johnson-Lindenstrauss lemma is a strengthening of this result, which gives the tighter lower bound \(m \geq \frac{8\ln(N)}{\epsilon^2}\), and shows that there exists a set of points for which \(\Omega(\epsilon^{-2}\ln(N))\) is necessary to preserve pairwise distances within a factor of \((1 \pm \epsilon)\).