How well does CLIP models classify corn seeds?

coding
Deep learning
experience
kaggle
Published

November 7, 2022

The OpenAI CLIP model are really impressive and how it’s a foundation for stuff like stable diffusion is awesome. The thing about CLIP models which I am most impressed by is the wide range of applications it be used for like Semantic Video Search, Zero shot image classification, searching images in your gallery etc.

I recently started reading CLIP paper and paper claims to have very high accuracy in image clssification accuracy. To test that claim, I thought of trying it out that in a kaggle competition I had recently participated.

The kaggle competition is a Corn image classification competition and is asking to classify images of corn seeds into following categories:

I used open_clip, an open source implementation of CLIP which is having higher accuracy compared to model weights released by OpenAI.

Even after using one of the best accuracy CLIP models available( ViT-H-14), it got me a classification accuracy score of 27.95% in private LB whereas Resnet or Convnext models could have given easily above 75% score.

Model Public LB Private LB Notebook link
ViT-B-32-quickgelu 0.16666 0.18397 link
ViT-H-14 0.28591 0.27955 link
Convnext model 0.76149 0.75386 link

UPDATE

When I shared this results in twitter, YiYi Xu suggested to try out linear probing in CLIP. She mentioned that, I was not comparing apples to apples, as I was using a zero-shot model with CLIP to compared with a fine tuned model of convnext. In order to level up, I should use linear probing which is using training data to kind of fine tune with a logistic regression model leveraging features in CLIP model.

Based on this, I leveraged using linear probing on the dataset. As a result my updated result are the following:

Model Public LB Private LB Notebook link
Zero Shot ViT-B-32-quickgelu 0.16666 0.18397 link
Zero Shot ViT-H-14 0.28591 0.27955 link
Linear probing w/ ViT-H-14 0.71982 0.72583 link
Convnext model 0.76149 0.75386 link

Note: This article was originally published in Kaggle discussion here