OpenAI Whisper’s amazing power to do fine-tuning demonstrated on Malayalam
Summit 2023 @ Indian Institute of Information Technology, Kottayam (IIIT-K)
Saturday, June 10, 2023
OpenAI Whisper
- I think Whisper1 is the most
under-rated model
released by OpenAI.
- It was open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.
About OpenAI Whisper Model
- Whisper is a computer program which can listen to people talking and write down what they say. (Automatic Speech Recognition Model)
- Whisper can understand people speaking different languages and can even translate what they say into English. (Supports transcription and translation to English)
Malayalam is a complex language
Whisper Event
- HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
- The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣
Malayalam models produced in Whisper Event
- For the language Malayalam, the results are as follows:
Malayalam models performance in whisper event according to leaderboard
Winning models in Malayalam in Whisper Event
- The winning model for Common voice:
thennal/whisper-medium-ml
- The winning model for Fleurs:
parambharath/whisper-small-ml
I was not convinced
- Didn’t trust the Hugging Face way of evaluating models.
thennal/whisper-medium-ml model card readme
I was not convinced
- Didn’t trust the Hugging Face way of evaluating models.
Last commit in thennal/whisper-medium-ml
Objective of my benchmarking
- To test whether 10% WER was possible in available academic datasets.
Datasets
- Common Voice 11 malayalam subset
- SMC Malayalam Speech Corpus
Metrics for evaluating ASR models
- ASR evaulation relies on comparission between ground-truth and ASR output.
- Common metrics for ASR evaluation which are popular and good enough1 are :
1. Word Error Rate(WER)
2. Character Error Rate(CER)
I wanted to build something new
Time for a new adventure
Methadology for benchmarking
- Create as a python library so further whisper-based transformer models can be benchmark.
- Calculate WER, CER, model size and time taken to benchmark the model for the listed datasets.
- Build a reproducible approach, so results of benchmarking is stored as dataset.
Benchmarked models
- thennal/whisper-medium-ml
- parambharat/whisper-tiny-ml
- parambharat/whisper-base-ml
- parambharat/whisper-small-ml
- anuragshas/whisper-large-v2-ml
- DrishtiSharma/whisper-large-v2-malayalam
Results on benechmarking in Common Voice dataset
Output from benchmarking tool
WER in Common Voice dataset
Word Error Rate in Common Voice-9 test split
CER in Common Voice dataset
Character Error Rate in Common Voice-9 test split
Results on benechmarking in Malayalam Speech Corpus dataset
Output from benchmarking tool
WER in Malayalam Speech Corpus
Word Error Rate in MSC
CER in Malayalam Speech Corpus
Character Error rate in MSC
Links to Project
Github project
https://github.com/kurianbenoy/malayalam_asr_benchmarking
Links to Project
Benchmarking results
- Results on SMC Malayalam Speech corpus
https://huggingface.co/datasets/kurianbenoy/malayalam_msc_benchmarking/tree/main
- Results on Common Voice 11
https://huggingface.co/datasets/kurianbenoy/malayalam_common_voice_benchmarking
Future Ideas for Benchmarking
- Something very similar to OpenLLM Leaderboard with results of latest malayalam speech models.
- Should include results for Kaldi, Meta’s MMS, Wav2Vec etc.
Open LLM leaderboard in huggingface spaces
Conclusion
- In Malayalam we have achieved phenomenal results for fine tuned whisper models.
- The best model is:
thennal/whisper-medium-ml
- I think their seems to be a good ASR model suitable for production use-cases.
- You can also do it in your own language especially if it is a low resource language.