On this page
On this page
Overparameterized Regression
src: (Arora et al., 2018)Arora:2018vn, section 3 on lp regression.
Summary
They show that in the simple scalar linear regression problem, i.e. loss function
E[(x,y)∼S]p1(y−x⊤w)p
if you overparameterize ever so slightly to
E[(x,y)∼S]p1(y−x⊤w1w2)p,
then gradient descent turns into an accelerated version.
Thoughts
I remember Arora presenting on this particular result a few years back (or something like this), and finding it very intriguing. I think this goes nicely with the theme of [[statistics-vs-ml]], though it's a slightly different angle. Essentially: we statisticians are afraid of overparameterization, because it removes specificity/leads to ambiguity. But actually, when it comes to these gradient methods, it actually helps to overparameterize!
- Arora, S., Cohen, N. & Hazan, E., 2018. On the optimization of deep networks: Implicit acceleration by overparameterization. In 35th International Conference on Machine Learning, ICML 2018. Institute for Advanced Studies, Princeton, United States, pp. 372–389.↩
Edit this page
Last updated on 1/29/2022