TTGR and TTR plot of Malayalam for SMC Corpus of Wikipedia text from K. Manohar et al.
OpenAI Whisper
I think Whisper1 is the most under-rated model released by OpenAI.
It was open-sourced on September 21, 2022 by releasing the inference code and pre-trained model weights.
About OpenAI Whisper Model
Whisper is a computer program which can listen to people talking and write down what they say. (Automatic Speech Recognition Model)
Whisper can understand people speaking different languages and can even translate what they say into English. (Supports transcription and translation to English)
Whisper Models
Size
Parameters
Required VRAM
Relative speed
tiny
39 M
~1 GB
~32x
base
74 M
~1 GB
~16x
small
244 M
~2 GB
~6x
medium
769 M
~5 GB
~2x
large
1550 M
~10 GB
1x
English Speech Recognition
Whisper is competitive with state of art commercial and open source systems. Diagram from whisper research paper p.9
Multi-lingual Speech recognition
Whisper model is trained on 99 languages
OpenAI Whisper API supports just 57 languages as some languages performance are not really good.
Runs in almost any device
Since Whisper followed the open source route, whisper.cpp developed by Georgi Gerganov which is a port of OpenAI’s Whisper model in C/C++.
Given a pre-trained model, which is a large model which is trained on a very specific task. If we want to fit it into our specific dataset we will train and use the pre-trained model to build a new model which works very well for our task.
Picture from fast.lesson covering steps in finetuning a text classifier model
HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣
Malayalam models produced in Whisper Event
For the language Malayalam, the results are as follows:
Malayalam models performance in whisper event according to leaderboard
Winning models in Malayalam in Whisper Event
The winning model for Common voice: thennal/whisper-medium-ml
The winning model for Fleurs: parambharath/whisper-small-ml
Question Time
Name three Malayalam fonts? (Hint: SMC makes a lot of fonts)
Who developed the user friendly GNU/Linux distribution called Slynux during his high school? (Hint: He is an xMECian)
I was not convinced
Didn’t trust the Hugging Face way of evaluating models.
thennal/whisper-medium-ml model card readme
I was not convinced
Didn’t trust the Hugging Face way of evaluating models.
Last commit in thennal/whisper-medium-ml
Objective of my benchmarking
To test whether 10% WER was possible in available academic datasets.
Datasets
Common Voice 11 malayalam subset
SMC Malayalam Speech Corpus
Metrics for evaluating ASR models
ASR evaulation relies on comparission between ground-truth and ASR output.
Common metrics for ASR evaluation which are popular and good enough1 are :
faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
CTranslate2
An awesome library for optimizing ML models for production.
CTranslate2 is a C++ and Python library for efficient inference with Transformer models.
The project implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering, etc., to accelerate and reduce the memory usage of Transformer models on CPU and GPU.
CTranslate2 Whisper converter
It had this utility for converting any whisper based model to faster-whisper like models.