I have analyzed some decoder transformer models using Lyapunov spectral analysis and found that the ratio of the MLP and attention spectral norms strongly indicates whether a model will eventually collapse to rank-1 or not by the final layers.
I found that the spectral ratio is best kept around 0.5–2 for keeping the model stable till the final layers.
Github repo: https://github.com/yousef-rafat/the-1-1-rule
[link] [comments]




