Qwen3 From Scratch: Building a Modern LLM

Overview

Qwen3 From Scratch is a complete PyTorch implementation of the Qwen3 language model (~0.6B parameters) built entirely from first principles. This project demonstrates modern transformer architecture components including Grouped Query Attention (GQA), Rotary Position Encoding (RoPE), RMSNorm, and SwiGLU feed-forward networks. It serves as both an educational resource for understanding LLM internals and a functional implementation suitable for experimentation.

How It Works

Token Embedding: Input tokens are converted to 1024-dimensional embeddings from a vocabulary of 151,936 tokens.
Transformer Blocks: 28 sequential transformer layers process the embeddings with attention and feed-forward networks.
Grouped Query Attention: Efficient attention mechanism with 16 query heads sharing 8 key-value heads for reduced memory footprint.
Position Encoding: RoPE applies rotary embeddings to queries and keys for relative position awareness.
Language Modeling Head: Final output projects back to vocabulary space for next-token prediction.

Key Features

Grouped Query Attention (GQA)

Reduces KV cache memory requirements by 50% compared to Multi-Head Attention.
16 query heads with 8 KV heads for efficient attention computation.
Q/K normalization for improved training stability.
Proper causal masking for autoregressive generation.

Advanced Position Encoding

RoPE with base frequency of 1,000,000 for better extrapolation.
Pre-computed sine and cosine values for efficiency.
Supports context lengths up to 4,096 tokens.
Rotation applied directly to query and key vectors.

Modern Normalization

RMSNorm throughout the model for stable training.
Pre-norm architecture with normalization before attention and FFN.
Learnable scaling parameters (alpha) for each norm layer.

SwiGLU Feed-Forward Network

Gated activation mechanism combining Swish with linear gating.
3x expansion ratio (1024 → 3072 → 1024).
Bias-free linear layers for parameter efficiency.

Why I Built This

The motivation behind Qwen3 From Scratch includes:

Deep Learning: Understand transformer internals by implementing each component from scratch.
Modern Architecture: Explore state-of-the-art techniques like GQA and RoPE in practice.
Educational Value: Create a clear, well-documented reference implementation for others learning LLMs.
Experimentation: Provide a foundation for testing architectural modifications and optimizations.
Research Understanding: Bridge the gap between reading papers and practical implementation.

Tech Stack

Core Framework

Python & PyTorch: Complete model implementation with tensor operations.
Custom Components: Hand-built attention, normalization, and position encoding layers.

Architecture Features

Grouped Query Attention for memory efficiency.
Rotary Position Encoding for position awareness.
RMSNorm for training stability.
SwiGLU activation for improved performance.

Technical Implementation

Model Configuration

Parameters: ~612M total (accounting for weight tying between embeddings and LM head).
Layers: 28 transformer blocks with residual connections.
Hidden Size: 1024-dimensional embeddings throughout.
Attention: 16 query heads with 128-dimensional head size, 8 KV heads.
FFN: 3072-dimensional intermediate layer with SwiGLU activation.

Attention Mechanism

GQA Implementation: KV heads are repeated to match query head count during attention computation.
Causal Masking: Upper triangular mask prevents attending to future tokens.
Score Computation: Scaled dot-product attention with proper normalization.
KV Cache Structure: Designed to support efficient incremental decoding during inference.

Position Encoding

Frequency Calculation: Complex exponential approach with base frequency scaling.
Rotation Application: Query and key vectors rotated using precomputed sin/cos values.
Head Dimension Handling: Proper splitting and recombination for multi-head attention.

Normalization Strategy

Pre-Norm Architecture: RMSNorm applied before attention and FFN sub-layers.
Residual Connections: Skip connections around each sub-layer preserve gradient flow.
Final Normalization: RMSNorm before the language modeling head for output stability.

Memory Optimization

Weight Tying: Embeddings and LM head share parameters, reducing total count.
GQA Efficiency: Fewer KV heads reduce memory requirements for attention.
Float32 Precision: Default precision with flexibility for mixed-precision training.

Behind the Scenes

Building Qwen3 from scratch was an exercise in understanding every detail of modern transformer architectures. The process involved careful study of the original Qwen papers, analyzing existing implementations, and debugging numerical issues with attention masks and position encodings. The project reveals how seemingly simple components like RoPE and RMSNorm significantly impact model behavior and training dynamics. It also demonstrates that building an LLM from scratch, while challenging, is achievable and provides invaluable insights into how these powerful models actually work under the hood.

Qwen3 From Scratch

Technology Stack

Key Challenges

Key Learnings