Been saying this for years: our assumptions about optimization and emergent behavior are oversimplified. Excited to see some real analysis on the interplay between gradient descent and normalization. https://www.reddit.com/user/GeorgeBird1