Blog Post Plan: The Silent Killer — How ReLU Murdered My Bounding Boxes

Hook

A single dead neuron in a 4-channel output head caused every bounding box to lose its left edge. The model trained for hours, the loss converged, and inference produced confident detections — but every box only covered the right half of each person. Here’s how I found it, why it happened, and three ways to prevent it.

1. The Setup (~200 words)

  • What is TTFNet? Anchor-free detector, heatmap + 4-channel offset regression (left, top, right, bottom)
  • The lightweight twist: using MobileNetV3-Small instead of the original ResNet-18 + Deformable Conv backbone
  • The offset head: wh = F.relu(wh_head(x)) * wh_offset_base
  • Why ReLU? Offsets are distances from center to edges — should always be positive

2. The Symptom (~150 words + images)

  • Show the broken inference result (boxes only covering right half of people)
  • “The model is confident and boxes are at the right locations… but they’re all too narrow”
  • Initial suspicion: decoding bug? scale mismatch? wrong channel order?

Reproduction Steps

git clone <repo> && cd ttfnet
git checkout dead-relu-demo  # branch with the original bug
uv sync
./run.sh  # download dataset + train
# After ~5 epochs, run inference:
uv run python eval.py --checkpoint checkpoints/ttfnet-epoch=05-*.ckpt --image WiderPerson/Images/000089.jpg
# Observe: boxes only cover right half of objects

3. The Investigation (~400 words)

Step 1: Verify the decode logic

  • Trace get_bboxes: x1 = center_x - wh[0], x2 = center_x + wh[2]
  • Compare against original TTFNet source — identical
  • Compare against training loss computation — consistent
  • Conclusion: decode is correct

Step 2: Dump raw predictions

# Debug script output:
wh channel 0 (left):   min=0.000, max=0.000, mean=0.000  ← DEAD!
wh channel 1 (top):    min=7.1,   max=210.1, mean=73.4
wh channel 2 (right):  min=5.0,   max=67.4,  mean=26.5
wh channel 3 (bottom): min=12.5,  max=225.2, mean=59.7
  • Channel 0 is always exactly zero → x1 = center_x - 0 = center_x
  • That’s why boxes start at center and extend only right

Step 3: Inspect the weights

Channel 0: weight std=0.0004, bias=-0.004  ← never learned
Channel 1: weight std=0.221,  bias=0.660   ← healthy
Channel 2: weight std=0.083,  bias=0.518   ← healthy
Channel 3: weight std=0.215,  bias=0.902   ← healthy

4. The Root Cause (~300 words)

The Dead ReLU Problem

  • Initialization: normal_init(std=0.001) → weights ≈ 0, bias = 0
  • Pre-activation output hovers around 0
  • If channel 0 drifts slightly negative early in training → ReLU clips to 0 → gradient = 0 → permanently dead
  • Classic “dying ReLU” problem, but manifesting at the output channel level

Why did it work in the original TTFNet?

  • Original uses ResNet-18 backbone + Modulated Deformable Convolutions → richer features, stronger gradients
  • MobileNetV3-Small produces weaker features → more prone to dead neurons in downstream heads
  • It’s a training dynamics issue, not a math issue — the architecture is correct

Diagram: Gradient Flow Through ReLU

Pre-activation: -0.002 (slightly negative due to random init)
         ↓
     ReLU(x) = 0
         ↓
   Gradient = 0 (no learning signal)
         ↓
   Weight stays near 0 forever
         ↓
   Channel is dead ☠️

5. Three Fixes (~300 words)

Fix A: Leaky ReLU (activation swap)

wh = F.leaky_relu(self.wh_head(x)) * Config.wh_offset_base
  • Allows 1% gradient for negative values
  • Offsets can go slightly negative, but downstream clamp handles it
  • Tradeoff: minor departure from original paper

Fix B: Positive bias initialization (init fix)

nn.init.constant_(self.wh_head[-1].bias, 0.1)
  • Keeps original ReLU
  • Output starts at ~0.1 (safely positive) → ReLU never kills it
  • Tradeoff: initial predictions are biased, but training corrects it quickly

Fix C: Softplus (smooth activation)

wh = F.softplus(self.wh_head(x)) * Config.wh_offset_base
  • softplus(x) = log(1 + exp(x)) — smooth approximation of ReLU
  • Never has zero gradient, always positive output
  • Tradeoff: slightly more compute than ReLU

Comparison Table

Fix Matches Paper Prevents Dead Neurons Output Always Positive Compute Cost
ReLU (original) Cheapest
Leaky ReLU Same
Positive bias init ✅ (probabilistic) Same
Softplus Slightly more
Leaky ReLU + bias init ✅✅ Same

6. The Broader Lesson (~200 words)

  • “Dying ReLU” is a well-known problem, but usually discussed in hidden layers. Here it killed an output channel
  • Lightweight backbones make downstream heads more fragile
  • Debugging tip: when bounding boxes are systematically shifted, dump per-channel statistics before checking the decode logic
  • Prevention: always verify that all output channels have non-zero statistics after a few training iterations — add this as a sanity check callback

7. Bonus: A PyTorch Lightning Callback for Early Detection (~100 words)

class DeadChannelDetector(pl.Callback):
    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
        if batch_idx == 100:  # check after 100 iterations
            with torch.no_grad():
                bias = pl_module.model.head.wh_head[-1].bias
                for ch in range(bias.shape[0]):
                    if abs(bias[ch].item()) < 0.01:
                        print(f"⚠️ wh_head channel {ch} may be dead (bias={bias[ch]:.4f})")

Metadata

  • Estimated length: ~1800 words + code blocks + images
  • Target audience: ML practitioners doing object detection
  • Tags: PyTorch, object detection, debugging, ReLU, weight initialization, TTFNet