An elementary model
Consider the data generating process specified by the tuple \((\mu, A, a, c)\) where \(\mu\) is a probability measure over some set of objects \(\mathcal{X}\) each with an associated label in \(\{0,1\}\), \(A\) is the set of all possible object attributes, \(a\) is an observation rule, and \(c\) is the conditional probability. The classification task is to learn an approximation \(s\) of the conditional probability distribution \(a\) given some training data \(\tau = ((a_1,y_1),...,(a_n, y_n)), a_i \in A, y_i \in \{0,1\}\). A learner \(L\) is an algorithm that consumes \(\tau\) and outputs \(s\) with no knowledge of \(c\) or \(\mu\). Generalization accuracy is the probability of predicting the correct label for un-observed attributes.
A Conservation Law
A useful notion of generalization performance would be to measure generalization accuracy relative to a random guesser. A random guesser would always achieve a generalization accuracy of \(0.5\). Consequently, let Generalization performace = Generalization Accuracy - 0.5.
Let \(T = (A \times \{0,1\})^n\) be the set of all possible training data and \(GP_{L}(a_0, c; \tau)\) denote the generalization performance of a learner \(L\) for objects such that \(a(x) = a_0\), given \(\tau \in T\). Since \(A\) is finite, every \(c\) can be identified by an m-tuple \((c(a_1), c(a_2), ..., c(a_m))\).
To evaluate the learner, we can consider an expression of the form: \[\int_{[0,1]^m}{\mathbb{E}_{T}\mathbb{E}_A{GP_L(c)} dc} = \int_{[0,1]^m}{\sum_{\tau \in T, a_i \notin \tau} \mathbb{P}(\tau | c)\mathbb{P}(a_i)GP_L(a_i, c; \tau) dc}\]
Now consider just the expression \(I(\tau) = \int_{[0,1]^m}{\sum_{i}{\mathbb{P}(\tau|c)\mathbb{P}(a_i)GP_L(a_i, c; \tau)} dc}\). For a fixed \(\tau \in T\), we can construct \(\forall c \in [0,1]^m, c' | c'_i = c_i \iff a_i \in \tau\) otherwise \(c'_i = 1 - c_i\). Thus, for a given \(\tau\) we have a permutation \(c \to c'\). Observe that \(\mathbb{P}(\tau|c) = \mathbb{P}(\tau|c')\). Changing variables, we have \(I(\tau) = \int_{[0,1]^m}{\sum_{i}{\mathbb{P}(\tau|c')\mathbb{P}(a_i)GP_L(a_i, c'; \tau)} dc}\). Let \(s_i\) be the probability that \(L\) predicts 1 for \(a_i\). Then, \(\forall a_i \notin\) \[\begin{aligned} GP_L(a_i, c; \tau) &= s_i c_i + (1-s_i)(1-c_i) -0.5 \\ GP_L(a_i, c';\tau) &= s_i (1-c_i) + (1-s_i)c_i - 0.5 \\ \implies GP_L(a_i,c;\tau) &+ GP_L(a_i,c';\tau) = 0 \end{aligned}\]
Thus, \[I(\tau) = \frac{1}{2}\int_{[0,1]^m}{\mathbb{P}(\tau|c)\sum_{a_i \notin \tau}\mathbb{P}(a_i)(GP_L(a_i,c;\tau) + GP_L(a_i,c';\tau)) dc = 0}\]
We have effectively shown that, \[\int_{[0,1]^m}{\mathbb{E}_{T}\mathbb{E}_A{GP_L(c)} dc} = \sum_{\tau \in T}{I(\tau)} = 0\]
In other words, “Generalization Performance is conserved over all learning situations” ! Consequently,
- If there are problems where a learner out-performs another significantly, then must exist problems where the learner under-performs by the same amount in total.
- Every optimizer is as good or as bad as every other optimizer.
- We cannot, a priori expect learners to do better than random chance.
Kolmogorov-style NFL
Goldblum et. al (2024) discuss a Kolmogorov-style formulation of the No Free Lunch Theorem. Let \(T \in (X \times Y)^{n}\) be a dataset uniformly sampled over both objects (X) and labels (Y). Then, with probability at least \(1 - \delta\), for every classifier that uses a conditional distribution \(p(y|x)\), the empirical cross-entropy (\(CE\)) is bounded below as: \[CE(p) \geq \ln|Y| -\frac{\ln(2)}{n}(K(p) + 2\log_{2}(K(p)) - \log(\delta) + c)\] where \(c\) is a constant depending on the choice of language for \(K\). This is an intuitive result, since in particular it implies that, no model can represent a classifier with appreciably lower cross-entropy than that attained from random guess when the data is incompressible if the dataset is large enough.
PAC Learnability and NFL
Valiant (1984) introduced Probably-Approximately-Correct (PAC) as a formal learning model for machine learning tasks, that captures learning as a search through a set of hypotheses that, with high probability, finds a hypothesis within a bounded error.
A hypothesis class \(\mathcal{H}\) is agnostic PAC-learnable with respect to a set \(Z\) and a (measurable) loss function \(l:\mathcal{H} \times Z \to \mathbb{R}_{+}\) if there exists a function \(m_{\mathcal{H}}: (0,1)^2 \to \mathbb{N}\) and a learning algorithm which: for every \(\epsilon, \delta \in (0,1)\), and probability space \(D\) over \(Z\), when running the algorithm on \(m \geq m_{\mathcal{H}}\) i.i.d. examples generated by \(D\) the algorithm returns \(h \in \mathcal{H}\) such that, with probability at least \(1 - \delta\) \[\mathcal{L}_{D}(h) \leq \min_{h' \in \mathcal{H}}{\mathcal{L}_D(h')} + \epsilon\] where \(L_{D}(h) = \mathbb{E}_{z \sim D}[l(h, Z)]\)
Suppose \(A\) is a learning algorithm for binary classification with respect to 0-1 loss over a domain \(\mathcal{X}\). Let \(m \in \mathbb{N}\) such that \(m < \frac{|\mathcal{X}|}{2}\) There there exists a distribution \(D\) over \(\mathcal{X} \times \{0,1\}\) and a labelling function \(f\) such that \(\mathcal{L}_D(f) = 0\) and \(\mathbb{P}_{S \sim D^{m}}\big(\mathcal{L}_D(A(S)) \geq \frac{1}{8}\big) \geq \frac{1}{7}\).
It can be shown for a set \(C \subset \mathcal{S}\) such that \(|C| = 2m\), any learner that observes only \(m\) examples, cannot comment about the remaining items in C. Considering all functions from \(C \to \{0,1\}\), \(f_1, f_2, .., f_{2^{2m}}\), “complement-pairing” shows that \(\exists {i \in [2^{2m}]}\): \[{\mathbb{E}_{S \sim D_{i}^{m}}}[{\mathcal{L}_{D_i}(A(S))}] \geq \frac{1}{4}\] \[D_i(x,y) = \begin{cases} \frac{1}{2m} & \text{if } y = f_i(x)\\ 0 & \text{otherwise} \end{cases}\]
For hypothesis class \(\mathcal{H}\) of binary labelling functions over \(\mathcal{X}\), the VC-Dimension is defined as the cardinality of the largest \(C \subset \mathcal{X}\) such that the restriction \(\mathcal{H}_{C}\) has all functions from C to \(\{0,1\}\).
The Fundamental Theorem of Statistical Learning states that VC-Dimension characterizes PAC-learnability. In particular, a hypothesis class is PAC-learnable if and only if it has finite VC-Dimension.
Ultimately,
PAC Learnability is not inconsistent with No Free Lunch Theorems.
NFLs consider hypothesis classes of infinite VC-Dimension.
PAC-Learning succeeds by incorporating an Inductive Bias.
Simplicity Bias
Part of the success of Deep Neural networks can be attributed to an inherent simplicity bias. Recent research empirically confirms:
GPT-2 (pre-trained and randomly initialized) models produce low-complexity bitstrings with higher probability (Goldblum et. al, 2024).
Deep networks are inductively biased to find lower “effective-rank” (\(-\sum{\tilde{\sigma_i}\ln(\tilde{\sigma_i})}\)) embeddings (Huh et. al, 2022).
Although deep networks, which are often over-parametrized can memorize noise data, they tend to learn in simple patterns first (fewer critical samples). (Arpit et. al, 2017)
References
Schaffer, C. (1994) ‘A conservation law for generalization performance’, Machine Learning Proceedings 1994, pp. 259–265. doi:10.1016/b978-1-55860-335-6.50039-8.
Shalev-Shwartz, S. and Ben-David, S. (2022) Understanding machine learning: From theory to algorithms. Cambridge: Cambridge University Press.
Valiant, L.G. (1984) ‘A theory of the learnable’, Proceedings of the sixteenth annual ACM symposium on Theory of computing - STOC ’84, pp. 436–445. doi:10.1145/800057.808710.
Wolpert, D.H. and Macready, W.G. (1997) ‘No free lunch theorems for optimization’, IEEE Transactions on Evolutionary Computation, 1(1), pp. 67–82. doi:10.1109/4235.585893.