
Qwen3 From Scratch
A PyTorch implementation of the Qwen3 0.6B language model built from scratch with modern transformer architecture.
Technology Stack
Key Challenges
- Implementing Grouped Query Attention from first principles
- Calculating and managing KV cache for efficient inference
- Implementing RoPE positional encoding correctly
- Optimizing memory usage for 0.6B parameter model
- Understanding weight tying between embeddings and LM head
Key Learnings
- Deep understanding of transformer architecture internals
- GQA implementation and efficiency benefits over MHA
- RoPE mechanics and frequency calculations
- RMSNorm advantages over LayerNorm
- SwiGLU activation function and gating mechanisms
- Pre-norm vs post-norm architecture tradeoffs
Qwen3 From Scratch: Building a Modern LLM
Overview
Qwen3 From Scratch is a complete PyTorch implementation of the Qwen3 language model (~0.6B parameters) built entirely from first principles. This project demonstrates modern transformer architecture components including Grouped Query Attention (GQA), Rotary Position Encoding (RoPE), RMSNorm, and SwiGLU feed-forward networks. It serves as both an educational resource for understanding LLM internals and a functional implementation suitable for experimentation.
How It Works
- Token Embedding: Input tokens are converted to 1024-dimensional embeddings from a vocabulary of 151,936 tokens.
- Transformer Blocks: 28 sequential transformer layers process the embeddings with attention and feed-forward networks.
- Grouped Query Attention: Efficient attention mechanism with 16 query heads sharing 8 key-value heads for reduced memory footprint.
- Position Encoding: RoPE applies rotary embeddings to queries and keys for relative position awareness.
- Language Modeling Head: Final output projects back to vocabulary space for next-token prediction.
Key Features
Grouped Query Attention (GQA)
- Reduces KV cache memory requirements by 50% compared to Multi-Head Attention.
- 16 query heads with 8 KV heads for efficient attention computation.
- Q/K normalization for improved training stability.
- Proper causal masking for autoregressive generation.
Advanced Position Encoding
- RoPE with base frequency of 1,000,000 for better extrapolation.
- Pre-computed sine and cosine values for efficiency.
- Supports context lengths up to 4,096 tokens.
- Rotation applied directly to query and key vectors.
Modern Normalization
- RMSNorm throughout the model for stable training.
- Pre-norm architecture with normalization before attention and FFN.
- Learnable scaling parameters (alpha) for each norm layer.
SwiGLU Feed-Forward Network
- Gated activation mechanism combining Swish with linear gating.
- 3x expansion ratio (1024 → 3072 → 1024).
- Bias-free linear layers for parameter efficiency.
Why I Built This
The motivation behind Qwen3 From Scratch includes:
- Deep Learning: Understand transformer internals by implementing each component from scratch.
- Modern Architecture: Explore state-of-the-art techniques like GQA and RoPE in practice.
- Educational Value: Create a clear, well-documented reference implementation for others learning LLMs.
- Experimentation: Provide a foundation for testing architectural modifications and optimizations.
- Research Understanding: Bridge the gap between reading papers and practical implementation.
Tech Stack
Core Framework
- Python & PyTorch: Complete model implementation with tensor operations.
- Custom Components: Hand-built attention, normalization, and position encoding layers.
Architecture Features
- Grouped Query Attention for memory efficiency.
- Rotary Position Encoding for position awareness.
- RMSNorm for training stability.
- SwiGLU activation for improved performance.
Technical Implementation
Model Configuration
- Parameters: ~612M total (accounting for weight tying between embeddings and LM head).
- Layers: 28 transformer blocks with residual connections.
- Hidden Size: 1024-dimensional embeddings throughout.
- Attention: 16 query heads with 128-dimensional head size, 8 KV heads.
- FFN: 3072-dimensional intermediate layer with SwiGLU activation.
Attention Mechanism
- GQA Implementation: KV heads are repeated to match query head count during attention computation.
- Causal Masking: Upper triangular mask prevents attending to future tokens.
- Score Computation: Scaled dot-product attention with proper normalization.
- KV Cache Structure: Designed to support efficient incremental decoding during inference.
Position Encoding
- Frequency Calculation: Complex exponential approach with base frequency scaling.
- Rotation Application: Query and key vectors rotated using precomputed sin/cos values.
- Head Dimension Handling: Proper splitting and recombination for multi-head attention.
Normalization Strategy
- Pre-Norm Architecture: RMSNorm applied before attention and FFN sub-layers.
- Residual Connections: Skip connections around each sub-layer preserve gradient flow.
- Final Normalization: RMSNorm before the language modeling head for output stability.
Memory Optimization
- Weight Tying: Embeddings and LM head share parameters, reducing total count.
- GQA Efficiency: Fewer KV heads reduce memory requirements for attention.
- Float32 Precision: Default precision with flexibility for mixed-precision training.
Behind the Scenes
Building Qwen3 from scratch was an exercise in understanding every detail of modern transformer architectures. The process involved careful study of the original Qwen papers, analyzing existing implementations, and debugging numerical issues with attention masks and position encodings. The project reveals how seemingly simple components like RoPE and RMSNorm significantly impact model behavior and training dynamics. It also demonstrates that building an LLM from scratch, while challenging, is achievable and provides invaluable insights into how these powerful models actually work under the hood.