How well does CLIP models classify corn seeds?

The OpenAI CLIP model are really impressive and how it’s a foundation for stuff like stable diffusion is awesome. The thing about CLIP models which I am most impressed by is the wide range of applications it be used for like Semantic Video Search, Zero shot image classification, searching images in your gallery etc.

I recently started reading CLIP paper and paper claims to have very high accuracy in image clssification accuracy. To test that claim, I thought of trying it out that in a kaggle competition I had recently participated.

The kaggle competition is a Corn image classification competition and is asking to classify images of corn seeds into following categories:

pure
broken
discolored
silkcut

I used open_clip, an open source implementation of CLIP which is having higher accuracy compared to model weights released by OpenAI.

Even after using one of the best accuracy CLIP models available( ViT-H-14), it got me a classification accuracy score of 27.95% in private LB whereas Resnet or Convnext models could have given easily above 75% score.

Model	Public LB	Private LB	Notebook link
ViT-B-32-quickgelu	0.16666	0.18397	link
ViT-H-14	0.28591	0.27955	link
Convnext model	0.76149	0.75386	link

UPDATE

When I shared this results in twitter, YiYi Xu suggested to try out linear probing in CLIP. She mentioned that, I was not comparing apples to apples, as I was using a zero-shot model with CLIP to compared with a fine tuned model of convnext. In order to level up, I should use linear probing which is using training data to kind of fine tune with a logistic regression model leveraging features in CLIP model.

Based on this, I leveraged using linear probing on the dataset. As a result my updated result are the following:

Model	Public LB	Private LB	Notebook link
Zero Shot ViT-B-32-quickgelu	0.16666	0.18397	link
Zero Shot ViT-H-14	0.28591	0.27955	link
Linear probing w/ ViT-H-14	0.71982	0.72583	link
Convnext model	0.76149	0.75386	link

Note: This article was originally published in Kaggle discussion here