Skip to content

Tags

On this page

A Universal Law of Robustness via Isoperimetry

src: (Bubeck & Sellke, 2021)Bubeck:2021vv

Here's another attempt to explain the overparameterisation enigma found in deep learning (p>>np >> n).

  • to fit (interpolate) nn data points of dd dimensions, one requires ndnd parameters if one wants a smooth function (without the smoothness, nn should suffice)
  • thus, the overparameterisation is necessary to ensure smoothness (which implies robustness)

Their theorem/proof works in the contrapositive: for any function class smoothly parameterised by pp parameters (each of size at most poly(n,d)\text{poly}(n,d)), and any dd-dimensional data distribution satisfying mild regularity conditions, then any function in this class that fits the data below the noise level must have Lipschitz constant larger than ndp\sqrt{\dfrac{nd}{p}}.

  • note that this result doesn't say that you'll fit a smooth function if you're overparameterising -- just that if you don't have enough parameters, then there's no way you're fitting a smooth one.

    • thus, in the context of scale (à la GPT-3), this result says that scale is necessary to achieve robustness (though not sufficient).
  • the noise level here is defined as the expected conditional variance σ2:=Eμ[Var(yx)]>0\sigma^2 := \mathbb{E}_{\mu}\left[\text{Var}(y \mid x)\right] > 0. noise is necessary from a theoretical perspective, as it prevents fitting a smooth classifier with perfect accuracy.


  1. Bubeck, S. & Sellke, M., 2021. A Universal Law of Robustness via Isoperimetry. arXiv.org.
Edit this page
Last updated on 4/16/2022