OpenAI Whisper and it’s amazing power to do fine-tuning demonstrated on my mother-tongue
Kurian Benoy
Sunday, October 1, 2023
Outline
What is OpenAI Whisper?
Features of OpenAI Whisper
What is Fine-tuning and how to fine-tune Whisper?
About my mother tongue
Methodology of benchmarking whisper models
Results on benchmarking Whisper model
Future Ideas & Conclusion
$whoami
AI Engineer & Team Lead @ sentient.io
Volunteer @ Swathanthra Malayalam Computing(SMC)
FOSS enthusiast
Not affiliated to OpenAI
Disclaimer
This talk is not generated.
If I use something generated I will explicitly mark as from an LLM.
OpenAI Whisper
I think Whisper1 is the most under-rated model released by OpenAI.
It was open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.
About OpenAI Whisper Model
Whisper is a computer program which can listen to people talking and write down what they say. (Automatic Speech Recognition Model)
Whisper can understand people speaking different languages and can even translate what they say into English. (Supports transcription and translation to English)
Whisper Models
Size
Parameters
Required VRAM
Relative speed
tiny
39 M
~1 GB
~32x
base
74 M
~1 GB
~16x
small
244 M
~2 GB
~6x
medium
769 M
~5 GB
~2x
large
1550 M
~10 GB
1x
English Speech Recognition
Whisper is competitive with state of art commercial and open source systems. Diagram from whisper research paper p.9
Multi-lingual Speech recognition
Whisper model is trained on 99 languages
OpenAI Whisper API supports just 57 languages as some languages performance are not really good.
Runs in almost any device
Since Whisper followed the open source route, whisper.cpp developed by Georgi Gerganov which is a port of OpenAI’s Whisper model in C/C++.
- It supports the below platforms:
Mac OS (Intel and ARM)
iOS
Android
Linux/Free BSD
Web Assembly etc.
Runs faster now
JAX implementation of OpenAI Whisper model upto 70x speedup on TPU using Whisper JAX
Given a pre-trained model, which is a large model which is trained on a very specific task. If we want to fit it into our specific dataset we will train and use the pre-trained model to build a new model which works very well for our task.
Picture from fast.lesson covering steps in finetuning a text classifier model
Native speakers: 38+ million.(according to 2011 census)
Spoken in: Kerala, Lakshadweep, Puducherry, wherever Malayalees are living.
Malayalam is morphologically complex language
Whisper Event
HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣
Malayalam models produced in Whisper Event
For the language Malayalam, the results are as follows:
Malayalam models performance in whisper event according to leaderboard
Winning models in Malayalam in Whisper Event
The winning model for Common voice: thennal/whisper-medium-ml
The winning model for Fleurs: parambharath/whisper-small-ml
I was not convinced
I was sceptical about the winning models becuase of:
Achieving 11% WER in Malayalam is astonishing.
In Malayalam there is not even a single yard stick to compare. Most of previous works were done in proprietary datasets and not open-sourced.
def evaluate_whisper_model_common_voice( model_name: str, # The model name werlist: List[float], # WER List cerlist: List[float],# CER list modelsizelist: List[str], # model size list timelist: List[float], # time(s) list bs:int=16, # batch size. Default value is 16.)->None: whisper_asr = pipeline("automatic-speech-recognition", model=model_name, device=0 ) dataset = load_common_voice_malayalam_dataset() predictions = [] references = [] start = time.time()for out in whisper_asr(data(dataset), batch_size=bs): predictions.append(normalizer((out["text"]))) references.append(normalizer(out["reference"][0])) end = time.time()