Training optimization - a galois77 Collection

galois77 's Collections

energy based models

OCR

Poetry

Agentic

Videos

ahan

Image generation

Training optimization

RL

Benchmarks and challenges

Training optimization

updated Sep 2

The Curse of Depth in Large Language Models

Paper • 2502.05795 • Published Feb 9 • 40
Transformers without Normalization

Paper • 2503.10622 • Published Mar 13 • 170
Parallel Scaling Law for Language Models

Paper • 2505.10475 • Published May 15 • 83
Learning to Skip the Middle Layers of Transformers

Paper • 2506.21103 • Published Jun 26 • 18
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 75
All is Not Lost: LLM Recovery without Checkpoints

Paper • 2506.15461 • Published Jun 18 • 37
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training

Paper • 2508.17677 • Published Aug 25 • 14