Vision Transformers

ViT
Transformers
PyTorch
From attention mechanisms to ViT, DETR, and SegFormer — implemented from scratch in PyTorch.
Published

February 15, 2026

A hands-on study of Vision Transformers, progressing from foundational attention mechanisms to state-of-the-art architectures for classification, detection, and segmentation.

Repository: digital-nomad-cheng/vit

flowchart LR
    A["01 Attention"] --> B["02 ViT"]
    B --> C["03 DETR"]
    B --> D["04 SegFormer"]

    style A fill:#4a90d9,color:#fff
    style B fill:#7b61ff,color:#fff
    style C fill:#e06c75,color:#fff
    style D fill:#56b6c2,color:#fff

Sub-projects

01 — Transformer Attention

Core multi-head self-attention and positional encoding, implemented from scratch in a Jupyter notebook. Covers the building blocks that power all subsequent projects.

🔗 View code

02 — ViT on CIFAR-10

Vision Transformer (ViT) image classification on CIFAR-10. Also includes a Transformer-in-Transformer (TNT) variant trained on MNIST, exploring nested tokenisation at both patch and pixel levels.

🔗 View code

03 — DETR on VisDrone

DEtection TRansformer (DETR) fine-tuned for aerial object detection on the VisDrone dataset. The architecture was rewritten to exactly match the official Facebook DETR implementation, enabling direct loading of pretrained weights and head-only fine-tuning for faster convergence.

🔗 View code

04 — SegFormer on ADE20K

SegFormer B0 semantic segmentation on ADE20K. Combines hierarchical Transformer encoders with lightweight MLP decoders for efficient dense prediction.

🔗 View code

References

Paper Year Sub-project
Attention Is All You Need (Vaswani et al.) 2017 01
An Image is Worth 16x16 Words (Dosovitskiy et al.) 2020 02
Transformer in Transformer (Han et al.) 2021 02
End-to-End Object Detection with Transformers (Carion et al.) 2020 03
SegFormer: Simple and Efficient Design for Semantic Segmentation (Xie et al.) 2021 04

Tech Stack

Python · PyTorch · uv ·