Carefully Curating Your Data Makes for More Efficient Machine Learning

By Bbenzon @bbenzon

1/Is scale all you need for AGI?(unlikely).But our new paper "Beyond neural scaling laws:beating power law scaling via data pruning" shows how to achieve much superior exponential decay of error with dataset size rather than slow power law neural scaling https://t.co/Vn62UJXGTd pic.twitter.com/vVt4xDBcr7
— Surya Ganguli (@SuryaGanguli) June 30, 2022

Later in the stream:

6/ Overall this work suggests that our current ML practice of collecting large amounts of random data is highly inefficient, leading to huge redundancy in the data, which we show mathematically is the origin of very slow, unsustainable power law scaling of error with dataset size pic.twitter.com/2aNv0ssb9S
— Surya Ganguli (@SuryaGanguli) June 30, 2022

8/ Indeed, the initial computational cost of creating a foundation dataset through data pruning can be amortized across efficiency gains in training
many downstream models, just as the initial cost of training foundation models is amortized across faster fine-tuning on many tasks
— Surya Ganguli (@SuryaGanguli) June 30, 2022

Back to Featured Articles on