If you look at their excellent paper & code, the reward model is a logical function that was handcrafted & progammed by engineers.
— Chomba Bupe (@ChombaBupe) February 1, 2025
DeepSeek RL approach is impressive in the sense that it reduces the need for tedious supervised fine tuning (SFT) but isn't really general.
Culture Magazine
Author's Latest Articles
-
Groupthink Drove Yann LeCun out of Meta
-
Some More Flowers
-
Effing the Ineffable
-
Trees in Hoboken, 11th Street
