The OpenAI CLIP model are really impressive and how it’s a foundation for stuff like stable diffusion is awesome. The thing about CLIP models which I am most impressed by is the wide range of applications it be used for like Semantic Video Search, Zero shot image classification, searching images in your gallery etc.

I recently started reading CLIP paper and paper claims to have very high accuracy in image clssification accuracy. To test that claim, I thought of trying it out that in a kaggle competition I had recently participated.

The kaggle competition is a Corn image classification competition and is asking to classify images of corn seeds into following categories:

  • pure
  • broken
  • discolored
  • silkcut

I used open_clip, an open source implementation of CLIP which is having higher accuracy compared to model weights released by OpenAI. Even after using one of the best trained CLIP models available it got me a classification accuracy score of 27.95% in private LB whereas Resnet or Convnext models could have given easily above 75% score.

Model Public LB Private LB Notebook link
ViT-B-32-quickgelu 0.16666 0.18397 link
ViT-H-14 0.28591 0.27955 link
Convnext model 0.76149 0.75386 link