Culture Magazine

Is the Transformer Architecture One from a Set of Equivalent Architectures?

By Bbenzon @bbenzon

This strengthens my hypothesis that there is a large set of equivalent neural architectures of which the original transformer is just one sample. https://t.co/rW5eGrbSn0

— Richard Socher (@RichardSocher) March 14, 2025

We found a surprisingly simple alternative to normalization layers:
the scaled tanh function (yes, we go back to the 80s).
We call it Dynamic Tanh, or DyT. pic.twitter.com/0sZ44mbZHR

— Zhuang Liu (@liuzhuang1234) March 14, 2025

Therefore, we replace norm layers with the proposed Dynamic Tanh (DyT) layer, and it is really simple: pic.twitter.com/qWToAhEmWX

— Zhuang Liu (@liuzhuang1234) March 14, 2025

DyT is faster than RMSNorm (common in frontier LLMs) on H100s pic.twitter.com/i7zgVTYKLB

— Zhuang Liu (@liuzhuang1234) March 14, 2025

Normalization layers have always been one of the more mysterious aspects of deep learning for me, and this work has given me a better understanding of their role.
Given that model training and inference can require tens of millions in compute, DyT has the potential to contribute…

— Zhuang Liu (@liuzhuang1234) March 14, 2025

Back to Featured Articles on Logo Paperblog