OpenAI Whisper and it’s amazing power to do finetuning.

FOSSASIA Summit, Singapore

Kurian Benoy

Saturday, April 15, 2023

Outline

  • OpenAI Whisper and it’s awesome features
  • Fine-tuning and how to fine-tune Whisper?
  • Results on fine-tuning whisper in Malayalam
  • Conclusion

$whoami

  • AI Engineer & Team Lead @ Sentient.io
  • Volunteer @ Swathanthra Malayalam Computing(SMC)
  • Open-source enthusiast
  • Not affiliated to OpenAI

OpenAI Whisper

  • Whisper is the most under-rated models released by OpenAI.
  • It open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.

About OpenAI Whisper Model

  • Whisper is a computer program which can listen to people talking and write down what they say.
  • Whisper can understand people speaking different languages and can even translate what they say into English.

Whisper Models

Size Parameters Required VRAM Relative speed
tiny 39 M ~1 GB ~32x
base 74 M ~1 GB ~16x
small 244 M ~2 GB ~6x
medium 769 M ~5 GB ~2x
large 1550 M ~10 GB 1x

English Speech Recognition

Whisper is competitive with state of art commercial and open source systems. Diagram from whisper research paper p.9

Multi-lingual Speech recognition

  • Whisper model is trained on 99 languages
  • OpenAI Whisper API supports just 57 languages as some languages performance are not really good.

Runs in almost any device

  • Since Whisper followed the open source route, whisper.cpp developed by Georgi Gerganov which is a port of OpenAI’s Whisper model in C/C++.

  • It supports the below platforms:

  1. Mac OS (Intel and ARM)
  2. iOS
  3. Android
  4. Linux/Free BSD
  5. Web Assembly etc.

Awesome community plugins

What is fine tuning?

Given a pre-trained model, which is a large model which is trained on a very specific task. If we want to fit it into our specific dataset we will train and use the pre-trained model to build a new model which works very well for our task.

Picture from fast.lesson covering steps in finetuning a text classifier model

Fine tuning is still relevant

Why try fine-tuning in Whisper?

  • In your problem, the open source Whisper model doesn’t give good results.

What are steps for fine-tuning Whisper?

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

What are steps for fine-tuning Whisper?

  1. Preparing Environment
  2. Load dataset
  3. Prepare Feature Extractor, Tokenizer and Data
  4. Training and evaluation
  5. Building a demo(optional)

Whisper Event

  • HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
  • The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣

Malayalam models in Whisper Event

  • For the language Malayalam, the results are as follows:

Malayalam models performance in whisper event according to leaderboard

Winning models in Malayalam in Whisper Event

  • The winning model for Common voice: thennal/whisper-medium-ml
  • The winning model for Fleurs: parambharath/whisper-small-ml

I was not convinced

I was sceptical about the winning models becuase of:

  1. Achieving 10% WER in Malayalam is astonishing.
  2. In Malayalam there is not even a single yard stick to compare. Most of previous works were done in proprietary datasets and not open-sourced.
  3. Malyalam is a morpohologically complex language. So even achieving 30% WER is a big deal.

I was not convinced

  1. Didn’t trust the Hugging Face way of evaluating models.

thennal/whisper-medium-ml model card readme

I was not convinced

  1. Didn’t trust the Hugging Face way of evaluating models.

Last commit in thennal/whisper-medium-ml

I wanted to build something new

Time for a new adventure

Results on Benchmarking in Malayalam

  • Started with 6 fine-tuned models in Malayalam
  1. thennal/whisper-medium-ml
  2. parambharat/whisper-tiny-ml
  3. parambharat/whisper-base-ml
  4. parambharat/whisper-small-ml
  5. anuragshas/whisper-large-v2-ml
  6. DrishtiSharma/whisper-large-v2-malayalam

Results on benechmarking in Common Voice dataset

Output from benchmarking tool

Results in Common Voice dataset

WER in Common Voice-9 test split

CER in Common Voice dataset

CER in Common Voice-9 test split

Results on benechmarking in Malayalam Speech Corpus dataset

Output from benchmarking tool

Results in Malayalam Speech Corpus dataset

WER in MSC

CER in Malayalam Speech Corpus

Character Error rate in MSC

Conclusion

  • In Malayalam we have achieved phenomenal results for fine tuned whisper models.
  • We seems to have build really good ASR suitable for production use-cases.
  • You can also do it in your own language especially if it is a low resource language.

Thanks to

Tributes to Areeb Jamal

Source: Areeb Jamal Forever in our Hearts and our Memory - FOSSASIA blog