OpenAI Whisper and it’s amazing power to do finetuning.
FOSSASIA Summit, Singapore
Saturday, April 15, 2023
OpenAI Whisper and it’s awesome features
Fine-tuning and how to fine-tune Whisper?
Results on fine-tuning whisper in Malayalam
AI Engineer & Team Lead @ Sentient.io
Volunteer @ Swathanthra Malayalam Computing(SMC)
Not affiliated to OpenAI
Whisper is the most
under-rated models released by OpenAI.
It open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.
research paper p.2, the name Whisper is an abbrevation for WSPR:
Web-scale Supervised Pretraining for Speech Recognition.
About OpenAI Whisper Model
Whisper is a computer program which can listen to people talking and write down what they say.
Whisper can understand people speaking different languages and can even translate what they say into English.
English Speech Recognition
Whisper is competitive with state of art commercial and open source systems. Diagram from
whisper research paper p.9
A list of ideas for how to use Whisper in your own applications. - English Speech Recognition - Multi-lingual speech recognition - Support for multiple tasks - Can run in almost any devices with whisper.cpp - Awesome community plugins
Multi-lingual Speech recognition
Whisper model is trained on 99 languages
OpenAI Whisper API supports just 57 languages as some languages performance are not really good.
More details can be found in Section 3.4 of
Whisper research paper pp.6 - 8. Zero-shot Whisper improves performance on Multilingual LibriSpeech(MLS) but is still significantly behind both Maestro, XLS-R, and mSLAM on VoxPopuli. (
Whisper research paper p.7)
Runs in almost any device
Since Whisper followed the open source route,
whisper.cpp developed by Georgi Gerganov which is a port of OpenAI’s Whisper model in C/C++.
It supports the below platforms:
Mac OS (Intel and ARM)
Web Assembly etc.
What is fine tuning?
Given a pre-trained model, which is a large model which is trained on a very specific task. If we want to fit it into our specific dataset we will train and use the pre-trained model to build a new model which works very well for our task.
fast.lesson covering steps in finetuning a text classifier model
Fine tuning is still relevant
Why try fine-tuning in Whisper?
In your problem, the open source Whisper model doesn’t give good results.
What are steps for fine-tuning Whisper?
Prepare Feature Extractor, Tokenizer and Data
Training and evaluation
Building a demo(optional)
HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣
Malayalam models in Whisper Event
For the language Malayalam, the results are as follows:
Malayalam models performance in whisper event according
Winning models in Malayalam in Whisper Event
The winning model for Common voice:
The winning model for Fleurs:
I was not convinced
I was sceptical about the winning models becuase of:
Achieving 10% WER in Malayalam is astonishing.
In Malayalam there is not even a single yard stick to compare. Most of previous works were done in proprietary datasets and not open-sourced.
Malyalam is a
morpohologically complex language. So even achieving 30% WER is a big deal.
I wanted to build something new
Time for a new adventure
Results on Benchmarking in Malayalam
Started with 6 fine-tuned models in Malayalam
Results on benechmarking in Common Voice dataset
Output from benchmarking tool
Results in Common Voice dataset
WER in Common Voice-9 test split
CER in Common Voice dataset
CER in Common Voice-9 test split
Results on benechmarking in Malayalam Speech Corpus dataset
Output from benchmarking tool
Results in Malayalam Speech Corpus dataset
WER in MSC
CER in Malayalam Speech Corpus
Character Error rate in MSC
in WER Malayalam CommonVoice9 dataset:
tiny - 102.7 base - 122.9 small - 104.8 medium - 137.8 large - 107.1 largev2 - 103.2
In Malayalam we have achieved phenomenal results for fine tuned whisper models.
We seems to have build really good ASR suitable for production use-cases.
You can also do it in your own language especially if it is a low resource language.