GPT-2 From Scratch

PythonPyTorchLLM

A from-scratch PyTorch implementation of GPT-2, built to understand every layer of a decoder-only transformer. Covers tokenization, learned embeddings, positional encodings, masked multi-head attention, feed-forward blocks, layer normalization, and the full training loop. Trained on a custom corpus to demonstrate end-to-end language model training without relying on HuggingFace abstractions.

PreviousVariants of the Attention Mechanism