Blog Post Plan: The Silent Killer — How ReLU Murdered My Bounding Boxes
Hook
A single dead neuron in a 4-channel output head caused every bounding box to lose its left edge. The model trained for hours, the loss converged, and inference produced confident detections — but every box only covered the right half of each person. Here’s how I found it, why it happened, and three ways to prevent it.
1. The Setup (~200 words)
- What is TTFNet? Anchor-free detector, heatmap + 4-channel offset regression (left, top, right, bottom)
- The lightweight twist: using MobileNetV3-Small instead of the original ResNet-18 + Deformable Conv backbone
- The offset head:
wh = F.relu(wh_head(x)) * wh_offset_base - Why ReLU? Offsets are distances from center to edges — should always be positive
2. The Symptom (~150 words + images)
- Show the broken inference result (boxes only covering right half of people)
- “The model is confident and boxes are at the right locations… but they’re all too narrow”
- Initial suspicion: decoding bug? scale mismatch? wrong channel order?
Reproduction Steps
git clone <repo> && cd ttfnet
git checkout dead-relu-demo # branch with the original bug
uv sync
./run.sh # download dataset + train
# After ~5 epochs, run inference:
uv run python eval.py --checkpoint checkpoints/ttfnet-epoch=05-*.ckpt --image WiderPerson/Images/000089.jpg
# Observe: boxes only cover right half of objects3. The Investigation (~400 words)
Step 1: Verify the decode logic
- Trace get_bboxes:
x1 = center_x - wh[0],x2 = center_x + wh[2] - Compare against original TTFNet source — identical
- Compare against training loss computation — consistent
- Conclusion: decode is correct
Step 2: Dump raw predictions
# Debug script output:
wh channel 0 (left): min=0.000, max=0.000, mean=0.000 ← DEAD!
wh channel 1 (top): min=7.1, max=210.1, mean=73.4
wh channel 2 (right): min=5.0, max=67.4, mean=26.5
wh channel 3 (bottom): min=12.5, max=225.2, mean=59.7- Channel 0 is always exactly zero →
x1 = center_x - 0 = center_x - That’s why boxes start at center and extend only right
Step 3: Inspect the weights
Channel 0: weight std=0.0004, bias=-0.004 ← never learned
Channel 1: weight std=0.221, bias=0.660 ← healthy
Channel 2: weight std=0.083, bias=0.518 ← healthy
Channel 3: weight std=0.215, bias=0.902 ← healthy4. The Root Cause (~300 words)
The Dead ReLU Problem
- Initialization:
normal_init(std=0.001)→ weights ≈ 0, bias = 0 - Pre-activation output hovers around 0
- If channel 0 drifts slightly negative early in training → ReLU clips to 0 → gradient = 0 → permanently dead
- Classic “dying ReLU” problem, but manifesting at the output channel level
Why did it work in the original TTFNet?
- Original uses ResNet-18 backbone + Modulated Deformable Convolutions → richer features, stronger gradients
- MobileNetV3-Small produces weaker features → more prone to dead neurons in downstream heads
- It’s a training dynamics issue, not a math issue — the architecture is correct
Diagram: Gradient Flow Through ReLU
Pre-activation: -0.002 (slightly negative due to random init)
↓
ReLU(x) = 0
↓
Gradient = 0 (no learning signal)
↓
Weight stays near 0 forever
↓
Channel is dead ☠️
5. Three Fixes (~300 words)
Fix A: Leaky ReLU (activation swap)
wh = F.leaky_relu(self.wh_head(x)) * Config.wh_offset_base- Allows 1% gradient for negative values
- Offsets can go slightly negative, but downstream clamp handles it
- Tradeoff: minor departure from original paper
Fix B: Positive bias initialization (init fix)
nn.init.constant_(self.wh_head[-1].bias, 0.1)- Keeps original ReLU
- Output starts at ~0.1 (safely positive) → ReLU never kills it
- Tradeoff: initial predictions are biased, but training corrects it quickly
Fix C: Softplus (smooth activation)
wh = F.softplus(self.wh_head(x)) * Config.wh_offset_basesoftplus(x) = log(1 + exp(x))— smooth approximation of ReLU- Never has zero gradient, always positive output
- Tradeoff: slightly more compute than ReLU
Comparison Table
| Fix | Matches Paper | Prevents Dead Neurons | Output Always Positive | Compute Cost |
|---|---|---|---|---|
| ReLU (original) | ✅ | ❌ | ✅ | Cheapest |
| Leaky ReLU | ❌ | ✅ | ❌ | Same |
| Positive bias init | ✅ | ✅ (probabilistic) | ✅ | Same |
| Softplus | ❌ | ✅ | ✅ | Slightly more |
| Leaky ReLU + bias init | ❌ | ✅✅ | ❌ | Same |
6. The Broader Lesson (~200 words)
- “Dying ReLU” is a well-known problem, but usually discussed in hidden layers. Here it killed an output channel
- Lightweight backbones make downstream heads more fragile
- Debugging tip: when bounding boxes are systematically shifted, dump per-channel statistics before checking the decode logic
- Prevention: always verify that all output channels have non-zero statistics after a few training iterations — add this as a sanity check callback
7. Bonus: A PyTorch Lightning Callback for Early Detection (~100 words)
class DeadChannelDetector(pl.Callback):
def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
if batch_idx == 100: # check after 100 iterations
with torch.no_grad():
bias = pl_module.model.head.wh_head[-1].bias
for ch in range(bias.shape[0]):
if abs(bias[ch].item()) < 0.01:
print(f"⚠️ wh_head channel {ch} may be dead (bias={bias[ch]:.4f})")Metadata
- Estimated length: ~1800 words + code blocks + images
- Target audience: ML practitioners doing object detection
- Tags: PyTorch, object detection, debugging, ReLU, weight initialization, TTFNet