flowchart LR
A["01 Attention"] --> B["02 ViT"]
B --> C["03 DETR"]
B --> D["04 SegFormer"]
style A fill:#4a90d9,color:#fff
style B fill:#7b61ff,color:#fff
style C fill:#e06c75,color:#fff
style D fill:#56b6c2,color:#fff
Vision Transformers
A hands-on study of Vision Transformers, progressing from foundational attention mechanisms to state-of-the-art architectures for classification, detection, and segmentation.
Repository: digital-nomad-cheng/vit
Sub-projects
01 — Transformer Attention
Core multi-head self-attention and positional encoding, implemented from scratch in a Jupyter notebook. Covers the building blocks that power all subsequent projects.
02 — ViT on CIFAR-10
Vision Transformer (ViT) image classification on CIFAR-10. Also includes a Transformer-in-Transformer (TNT) variant trained on MNIST, exploring nested tokenisation at both patch and pixel levels.
03 — DETR on VisDrone
DEtection TRansformer (DETR) fine-tuned for aerial object detection on the VisDrone dataset. The architecture was rewritten to exactly match the official Facebook DETR implementation, enabling direct loading of pretrained weights and head-only fine-tuning for faster convergence.
04 — SegFormer on ADE20K
SegFormer B0 semantic segmentation on ADE20K. Combines hierarchical Transformer encoders with lightweight MLP decoders for efficient dense prediction.
References
| Paper | Year | Sub-project |
|---|---|---|
| Attention Is All You Need (Vaswani et al.) | 2017 | 01 |
| An Image is Worth 16x16 Words (Dosovitskiy et al.) | 2020 | 02 |
| Transformer in Transformer (Han et al.) | 2021 | 02 |
| End-to-End Object Detection with Transformers (Carion et al.) | 2020 | 03 |
| SegFormer: Simple and Efficient Design for Semantic Segmentation (Xie et al.) | 2021 | 04 |
Tech Stack
Python · PyTorch · uv ·