Kurian Benoy – OpenAI Whisper and it’s amazing power to do finetuning.

Outline

OpenAI Whisper and it’s awesome features
Fine-tuning and how to fine-tune Whisper?
Results on fine-tuning whisper in Malayalam
Conclusion

$whoami

AI Engineer & Team Lead @ Sentient.io
Volunteer @ Swathanthra Malayalam Computing(SMC)
Open-source enthusiast
Not affiliated to OpenAI

OpenAI Whisper

Whisper is the most under-rated models released by OpenAI.
It open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.

About OpenAI Whisper Model

Whisper is a computer program which can listen to people talking and write down what they say.
Whisper can understand people speaking different languages and can even translate what they say into English.

Whisper Models

Size	Parameters	Required VRAM	Relative speed
tiny	39 M	~1 GB	~32x
base	74 M	~1 GB	~16x
small	244 M	~2 GB	~6x
medium	769 M	~5 GB	~2x
large	1550 M	~10 GB	1x

English Speech Recognition

Whisper is competitive with state of art commercial and open source systems. Diagram from whisper research paper p.9

Multi-lingual Speech recognition

Whisper model is trained on 99 languages
OpenAI Whisper API supports just 57 languages as some languages performance are not really good.

Runs in almost any device

Since Whisper followed the open source route, whisper.cpp developed by Georgi Gerganov which is a port of OpenAI’s Whisper model in C/C++.
It supports the below platforms:

Mac OS (Intel and ARM)
iOS
Android
Linux/Free BSD
Web Assembly etc.

Awesome community plugins

Word-level time stamps with whisper-timestamped,whisperX etc.
Fine-Tune Whisper is achieving SOTA in lot of languages
Speaker diarization
Audio classification using OpenAI’s Whisper

What is fine tuning?

Given a pre-trained model, which is a large model which is trained on a very specific task. If we want to fit it into our specific dataset we will train and use the pre-trained model to build a new model which works very well for our task.

Picture from fast.lesson covering steps in finetuning a text classifier model

Fine tuning is still relevant

Why try fine-tuning in Whisper?

In your problem, the open source Whisper model doesn’t give good results.

What are steps for fine-tuning Whisper?

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

What are steps for fine-tuning Whisper?

Preparing Environment
Load dataset
Prepare Feature Extractor, Tokenizer and Data
Training and evaluation
Building a demo(optional)

Whisper Event

HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣

Malayalam models in Whisper Event

For the language Malayalam, the results are as follows:

Malayalam models performance in whisper event according to leaderboard

Winning models in Malayalam in Whisper Event

The winning model for Common voice: thennal/whisper-medium-ml
The winning model for Fleurs: parambharath/whisper-small-ml

I was not convinced

I was sceptical about the winning models becuase of:

Achieving 10% WER in Malayalam is astonishing.
In Malayalam there is not even a single yard stick to compare. Most of previous works were done in proprietary datasets and not open-sourced.
Malyalam is a morpohologically complex language. So even achieving 30% WER is a big deal.

I was not convinced

Didn’t trust the Hugging Face way of evaluating models.

thennal/whisper-medium-ml model card readme

I was not convinced

Didn’t trust the Hugging Face way of evaluating models.

Last commit in thennal/whisper-medium-ml

I wanted to build something new

New github project for Malayalam ASR Benchmarking

Time for a new adventure

Results on Benchmarking in Malayalam

Started with 6 fine-tuned models in Malayalam

thennal/whisper-medium-ml
parambharat/whisper-tiny-ml
parambharat/whisper-base-ml
parambharat/whisper-small-ml
anuragshas/whisper-large-v2-ml
DrishtiSharma/whisper-large-v2-malayalam

Compared it with 6 model versions released by OpenAI.

Results on benechmarking in Common Voice dataset

Output from benchmarking tool

Results in Common Voice dataset

WER in Common Voice-9 test split

CER in Common Voice dataset

CER in Common Voice-9 test split

Results on benechmarking in Malayalam Speech Corpus dataset

Output from benchmarking tool

Results in Malayalam Speech Corpus dataset

WER in MSC

CER in Malayalam Speech Corpus

Character Error rate in MSC

Conclusion

In Malayalam we have achieved phenomenal results for fine tuned whisper models.
We seems to have build really good ASR suitable for production use-cases.
You can also do it in your own language especially if it is a low resource language.

Thanks to

OpenAI team - Alec Radford, Jong Wook Kim, Christine McLeavey etc. other authors of Whisper paper
HuggingFace team - Sanchit Gandhi, Nicolas Patry, Vaibhav Srivastav etc.
Kavya Manohar and other members in Swathanthra Malayalam Computing community.

Tributes to Areeb Jamal

Source: Areeb Jamal Forever in our Hearts and our Memory - FOSSASIA blog