OpenAI Whisper and it’s amazing power to do finetuning.

FOSSASIA Summit, Singapore

Kurian Benoy

Saturday, April 15, 2023


  • OpenAI Whisper and it’s awesome features
  • Fine-tuning and how to fine-tune Whisper?
  • Results on fine-tuning whisper in Malayalam
  • Conclusion


  • AI Engineer & Team Lead @
  • Volunteer @ Swathanthra Malayalam Computing(SMC)
  • Open-source enthusiast
  • Not affiliated to OpenAI

OpenAI Whisper

  • Whisper is the most under-rated models released by OpenAI.
  • It open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.

About OpenAI Whisper Model

  • Whisper is a computer program which can listen to people talking and write down what they say.
  • Whisper can understand people speaking different languages and can even translate what they say into English.

Whisper Models

Size Parameters Required VRAM Relative speed
tiny 39 M ~1 GB ~32x
base 74 M ~1 GB ~16x
small 244 M ~2 GB ~6x
medium 769 M ~5 GB ~2x
large 1550 M ~10 GB 1x

English Speech Recognition

Whisper is competitive with state of art commercial and open source systems. Diagram from whisper research paper p.9

Multi-lingual Speech recognition

  • Whisper model is trained on 99 languages
  • OpenAI Whisper API supports just 57 languages as some languages performance are not really good.

Runs in almost any device

  • Since Whisper followed the open source route, whisper.cpp developed by Georgi Gerganov which is a port of OpenAI’s Whisper model in C/C++.

  • It supports the below platforms:

  1. Mac OS (Intel and ARM)
  2. iOS
  3. Android
  4. Linux/Free BSD
  5. Web Assembly etc.

Awesome community plugins

What is fine tuning?

Given a pre-trained model, which is a large model which is trained on a very specific task. If we want to fit it into our specific dataset we will train and use the pre-trained model to build a new model which works very well for our task.

Picture from fast.lesson covering steps in finetuning a text classifier model

Fine tuning is still relevant

Why try fine-tuning in Whisper?

  • In your problem, the open source Whisper model doesn’t give good results.

What are steps for fine-tuning Whisper?

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

What are steps for fine-tuning Whisper?

  1. Preparing Environment
  2. Load dataset
  3. Prepare Feature Extractor, Tokenizer and Data
  4. Training and evaluation
  5. Building a demo(optional)

Whisper Event

  • HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
  • The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣

Malayalam models in Whisper Event

  • For the language Malayalam, the results are as follows:

Malayalam models performance in whisper event according to leaderboard

Winning models in Malayalam in Whisper Event

  • The winning model for Common voice: thennal/whisper-medium-ml
  • The winning model for Fleurs: parambharath/whisper-small-ml

I was not convinced

I was sceptical about the winning models becuase of:

  1. Achieving 10% WER in Malayalam is astonishing.
  2. In Malayalam there is not even a single yard stick to compare. Most of previous works were done in proprietary datasets and not open-sourced.
  3. Malyalam is a morpohologically complex language. So even achieving 30% WER is a big deal.

I was not convinced

  1. Didn’t trust the Hugging Face way of evaluating models.

thennal/whisper-medium-ml model card readme

I was not convinced

  1. Didn’t trust the Hugging Face way of evaluating models.

Last commit in thennal/whisper-medium-ml

I wanted to build something new

Time for a new adventure

Results on Benchmarking in Malayalam

  • Started with 6 fine-tuned models in Malayalam
  1. thennal/whisper-medium-ml
  2. parambharat/whisper-tiny-ml
  3. parambharat/whisper-base-ml
  4. parambharat/whisper-small-ml
  5. anuragshas/whisper-large-v2-ml
  6. DrishtiSharma/whisper-large-v2-malayalam

Results on benechmarking in Common Voice dataset

Output from benchmarking tool

Results in Common Voice dataset

WER in Common Voice-9 test split

CER in Common Voice dataset

CER in Common Voice-9 test split

Results on benechmarking in Malayalam Speech Corpus dataset

Output from benchmarking tool

Results in Malayalam Speech Corpus dataset


CER in Malayalam Speech Corpus

Character Error rate in MSC


  • In Malayalam we have achieved phenomenal results for fine tuned whisper models.
  • We seems to have build really good ASR suitable for production use-cases.
  • You can also do it in your own language especially if it is a low resource language.

Thanks to

Tributes to Areeb Jamal

Source: Areeb Jamal Forever in our Hearts and our Memory - FOSSASIA blog