DistilCLIP

PythonPyTorchVision TransformerMultimodal AI
DistilCLIP

A from-scratch CLIP implementation pairing a Vision Transformer with DistilBERT for multimodal image-text understanding. Trained on the Naruto BLIP Captions dataset for 25 epochs, it learns joint representations of images and text, enabling zero-shot image classification and text-to-image retrieval. Built entirely in PyTorch to demystify contrastive learning and multimodal architectures.