⛳ Optimizer: What Does It Do and Why We Need It

Community Article Published November 12, 2025

Did you know? Training a large language model like GPT requires storing three full-precision variables for each parameter, one for itself and the other two for its optimizers. For a 7B parameter model, that's not 28 GB—it's closer to 84 GB because of these optimizers. Naturally, they must be super important to justify such a high cost.

Training a neural network is fundamentally a search problem: we're trying to find the best possible set of weights that minimize our loss function. But this isn't a simple downhill walk—the loss landscape is filled with valleys, plateaus, and treacherous terrain that can trap your training.

The Basic Problem: Vanilla Gradient Descent

Imagine you're trying to find the lowest point in a hilly landscape while blindfolded. The simplest approach: feel the slope beneath your feet and take a step downhill. This is **Stochastic Gradient Descent (SGD)**—at each step, calculate the gradient and move in the direction of steepest descent.

    Loss
      ↑
      │     You are here
      │         ●
      │        ╱ ╲
      │       ╱   ╲      
      │      ╱     ╲___
      │     ╱          ╲___
      │____╱               ╲___
      └──────────────────────────→ Parameters

But the real landscape is much more difficult.

Problem 1: Stuck in Shallow Valleys

Pure gradient descent sees an uphill slope ahead and stops. It found a minimum, but not necessarily the minimum. In high-dimensional spaces with billions of parameters, there are countless such traps.

(Note: Modern research suggests that for overparameterized networks, saddle points—not local minima—are the main obstacle. But the intuition remains: vanilla SGD gets stuck in places it shouldn't.)

Problem 2: Thrashing in Narrow Ravines

      │  ╲                              ╱
      │   ╲        ●→               ←●  ╱
      │    ╲         ●→           ←●   ╱
      │     ╲          ●→       ←●    ╱
      │      ╲           ●→   ←●     ╱
      │       ╲            ●←●      ╱
      │        ╲_____    ↓↓↓↓   ___╱
      │              ╲__________╱
      └────────────────────────────────→
           Steep      Wasted energy     Slow progress
           walls      zigzagging!       along valley

In dimensions with steep curvature perpendicular to the path forward, SGD bounces back and forth. The gradient keeps pointing at the walls instead of down the valley. Progress toward the actual optimum is painfully slow.

This happens all the time in neural networks—think of a weight that interacts strongly with many others versus one that's relatively independent.

Problem 3: Glacial Progress on Plateaus

      │
      │           ● → ● → ● → ● → ● → ●
      │  ________________________________________
      │                                         ╲
      │                                          ╲
      │                                           ╲___
      └────────────────────────────────────────────────→
           Tiny gradient = tiny, tiny steps
           (this could take millions of iterations)

On plateaus or in regions with very small gradients, vanilla SGD crawls. The gradient tells you which way is down, but when the slope is nearly zero, you barely move. Training can get stuck here for thousands of iterations.

Real neural networks have lots of these flat regions, especially early in training when weights are randomly initialized.

Solution 1: Momentum (Remember Where You Came From)

(Historical note: The term "optimizer" emerged in the machine learning community around the late 1980s and early 1990s as training algorithms became more sophisticated. Early papers simply referred to "gradient descent" or "learning rules," but as methods like momentum and adaptive learning rates became standard, we needed a term for these complete update strategies. By the time libraries like Theano and later PyTorch/TensorFlow arrived, "optimizer" was the standard abstraction—an object that encapsulates the entire parameter update logic.)

Here's the first key insight from physics: objects in motion tend to stay in motion.

SGD with Momentum doesn't just look at the current gradient—it accumulates a moving average of recent gradients. Think of it roughly like this: 90% of your previous velocity, plus 10% new gradient information. (The exact math uses exponential moving averages, but this gives you the feel.)

This solves several problems at once:

Escaping shallow regions: If you've been moving in a direction and suddenly hit an uphill slope, momentum carries you forward anyway. You have "speed" built up.

Smoothing through ravines: When you bounce left, then right, then left again, the velocity in those directions cancels out. But velocity along the valley floor accumulates. You stop fighting the walls and start making forward progress.

Pushing through plateaus: Even with tiny gradients, your accumulated velocity keeps you moving. You don't get stuck the moment the gradient shrinks.

The key intuition: momentum makes optimization decisions based on where you've been recently, not just where you are right now.

Solution 2: Adaptive Learning Rates (Different Terrain Needs Different Stride)

But momentum alone isn't enough. Here's the second problem: different parameters live in completely different terrain.

    Parameter A (steep)     Parameter B (gentle)
    
      │    ╱╲                  │
      │   ╱  ╲                 │
      │  ╱    ╲                │
      │_╱______╲___            │_______________╲___
      └───────────→            └────────────────────→

If you use the same step size (learning rate) for both:

  • Too large → you overshoot and bounce on parameter A
  • Too small → you barely move on parameter B

The insight: Adapt the learning rate for each parameter based on how much it's been changing recently.

This is where RMSProp (Root Mean Square Propagation) enters the picture. Instead of just tracking the gradient direction (first moment), RMSProp tracks something like the magnitude of recent gradients (loosely speaking, a second moment—the mean of squared gradients).

Parameter A: Large, volatile gradients recently
→ RMSProp uses SMALLER effective learning rate (careful on steep terrain)

Parameter B: Small, gentle gradients recently  
→ RMSProp uses LARGER effective learning rate (confident on gentle terrain)

Think of it as automatic terrain detection. Parameters that have been experiencing large gradient swings get treated cautiously. Parameters that have been stable get pushed harder.

Adam Optimizer and Its Cost

Adam (Adaptive Moment Estimation) brought these two insights together and quickly became the default choice for training neural networks:

  1. First moment (momentum): Track where you've been moving → which direction to go
  2. Second moment (RMSProp-style): Track how volatile gradients have been → how big a step to take

The result: Adam automatically takes small, careful steps on steep or noisy terrain and large, confident steps on gentle terrain—simultaneously, independently, for every single parameter.

This is why the Adam family (AdamW and others) became the default optimizer for deep learning. They work well across wildly different architectures with minimal tuning, handles parameters that barely change alongside parameters that change drastically, and converges reliably even when you deal with billions of parameters creating incomprehensibly complex loss landscapes.

But they come with a price.

For every parameter θ in your model, Adam needs to track:

  • The parameter value itself: θ
  • First moment (momentum): m
  • Second moment (magnitude history): v

That's roughly three times the memory compared to just storing parameters.

7B parameter model:

Parameters:        ████████████████  ~28 GB
First moment:      ████████████████  ~28 GB
Second moment:     ████████████████  ~28 GB
                   ────────────────────────
Total:             ~84 GB

Looking Ahead

Now we arrive at a natural question: Could we build an optimizer that matches the performance of Adam but uses less memory?

The answer is yes—but it requires rethinking what momentum really means and finding a completely different way to achieve adaptive behavior. Enter the Muon optimizer. Stay tuned

Further Reading

  • Adam: A Method for Stochastic Optimization (Kingma & Ba, 2014) – The original paper
  • An Overview of Gradient Descent Optimization Algorithms (Ruder, 2016) – Comprehensive survey
  • Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2019) – The AdamW paper
  • Visualizing optimization at distill.pub/2017/momentum – Interactive demonstrations

Note: Throughout this article, we've used physics-inspired language (momentum, velocity, terrain) to build intuition. These are mathematical algorithms, not physical systems—the "momentum" is an exponential moving average, not actual inertia. But the metaphors help us understand the behavior and design principles behind these crucial tools.

Community

Sign up or log in to comment