Blog Post Plan: The Silent Killer — How ReLU Murdered My Bounding Boxes

Hook

A single dead neuron in a 4-channel output head caused every bounding box to lose its left edge. The model trained for hours, the loss converged, and inference produced confident detections — but every box only covered the right half of each person. Here’s how I found it, why it happened, and three ways to prevent it.

1. The Setup (~200 words)

What is TTFNet? Anchor-free detector, heatmap + 4-channel offset regression (left, top, right, bottom)
The lightweight twist: using MobileNetV3-Small instead of the original ResNet-18 + Deformable Conv backbone
The offset head: wh = F.relu(wh_head(x)) * wh_offset_base
Why ReLU? Offsets are distances from center to edges — should always be positive

2. The Symptom (~150 words + images)

Show the broken inference result (boxes only covering right half of people)
“The model is confident and boxes are at the right locations… but they’re all too narrow”
Initial suspicion: decoding bug? scale mismatch? wrong channel order?

Reproduction Steps

git clone <repo> && cd ttfnet
git checkout dead-relu-demo  # branch with the original bug
uv sync
./run.sh  # download dataset + train
# After ~5 epochs, run inference:
uv run python eval.py --checkpoint checkpoints/ttfnet-epoch=05-*.ckpt --image WiderPerson/Images/000089.jpg
# Observe: boxes only cover right half of objects

3. The Investigation (~400 words)

Step 1: Verify the decode logic

Trace get_bboxes: x1 = center_x - wh[0], x2 = center_x + wh[2]
Compare against original TTFNet source — identical
Compare against training loss computation — consistent
Conclusion: decode is correct

Step 2: Dump raw predictions

# Debug script output:
wh channel 0 (left):   min=0.000, max=0.000, mean=0.000  ← DEAD!
wh channel 1 (top):    min=7.1,   max=210.1, mean=73.4
wh channel 2 (right):  min=5.0,   max=67.4,  mean=26.5
wh channel 3 (bottom): min=12.5,  max=225.2, mean=59.7

Channel 0 is always exactly zero → x1 = center_x - 0 = center_x
That’s why boxes start at center and extend only right

Step 3: Inspect the weights

Channel 0: weight std=0.0004, bias=-0.004  ← never learned
Channel 1: weight std=0.221,  bias=0.660   ← healthy
Channel 2: weight std=0.083,  bias=0.518   ← healthy
Channel 3: weight std=0.215,  bias=0.902   ← healthy

4. The Root Cause (~300 words)

The Dead ReLU Problem

Initialization: normal_init(std=0.001) → weights ≈ 0, bias = 0
Pre-activation output hovers around 0
If channel 0 drifts slightly negative early in training → ReLU clips to 0 → gradient = 0 → permanently dead
Classic “dying ReLU” problem, but manifesting at the output channel level

Why did it work in the original TTFNet?

Original uses ResNet-18 backbone + Modulated Deformable Convolutions → richer features, stronger gradients
MobileNetV3-Small produces weaker features → more prone to dead neurons in downstream heads
It’s a training dynamics issue, not a math issue — the architecture is correct

Diagram: Gradient Flow Through ReLU

Pre-activation: -0.002 (slightly negative due to random init)
         ↓
     ReLU(x) = 0
         ↓
   Gradient = 0 (no learning signal)
         ↓
   Weight stays near 0 forever
         ↓
   Channel is dead ☠️

5. Three Fixes (~300 words)

Fix A: Leaky ReLU (activation swap)

wh = F.leaky_relu(self.wh_head(x)) * Config.wh_offset_base

Allows 1% gradient for negative values
Offsets can go slightly negative, but downstream clamp handles it
Tradeoff: minor departure from original paper

Fix B: Positive bias initialization (init fix)

nn.init.constant_(self.wh_head[-1].bias, 0.1)

Keeps original ReLU
Output starts at ~0.1 (safely positive) → ReLU never kills it
Tradeoff: initial predictions are biased, but training corrects it quickly

Fix C: Softplus (smooth activation)

wh = F.softplus(self.wh_head(x)) * Config.wh_offset_base

softplus(x) = log(1 + exp(x)) — smooth approximation of ReLU
Never has zero gradient, always positive output
Tradeoff: slightly more compute than ReLU

Comparison Table

Fix	Matches Paper	Prevents Dead Neurons	Output Always Positive	Compute Cost
ReLU (original)	✅	❌	✅	Cheapest
Leaky ReLU	❌	✅	❌	Same
Positive bias init	✅	✅ (probabilistic)	✅	Same
Softplus	❌	✅	✅	Slightly more
Leaky ReLU + bias init	❌	✅✅	❌	Same

6. The Broader Lesson (~200 words)

“Dying ReLU” is a well-known problem, but usually discussed in hidden layers. Here it killed an output channel
Lightweight backbones make downstream heads more fragile
Debugging tip: when bounding boxes are systematically shifted, dump per-channel statistics before checking the decode logic
Prevention: always verify that all output channels have non-zero statistics after a few training iterations — add this as a sanity check callback

7. Bonus: A PyTorch Lightning Callback for Early Detection (~100 words)

class DeadChannelDetector(pl.Callback):
    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
        if batch_idx == 100:  # check after 100 iterations
            with torch.no_grad():
                bias = pl_module.model.head.wh_head[-1].bias
                for ch in range(bias.shape[0]):
                    if abs(bias[ch].item()) < 0.01:
                        print(f"⚠️ wh_head channel {ch} may be dead (bias={bias[ch]:.4f})")

Metadata

Estimated length: ~1800 words + code blocks + images
Target audience: ML practitioners doing object detection
Tags: PyTorch, object detection, debugging, ReLU, weight initialization, TTFNet