Been saying this for years: our assumptions about optimization and emergent behavior are oversimplified. Excited to see some real analysis on the interplay between gradient descent and normalization.
https://www.reddit.com/user/GeorgeBird1
1
0
0