Kurian Benoy – Malayalam Projects I am working at the moment!

Outline

About Malayalam
malayalam_asr_benchmarking nbdev project
vegam-whisper-medium ml model and Pallakku

$whoami

AI Engineer & Team Lead @ Sentient.io
Volunteer @ Swathanthra Malayalam Computing(SMC)
Open-source enthusiast
Not affiliated to OpenAI

Disclaimer

Nothing in this talk is generated.

unless explicitly marked, or in a screenshot from an LLM

About Malayalam

Malayalam is my mother tongue.
Native speakers: 38+ million.
Spoken in: Kerala, Lakshadweep, Puducherry, wherever Mallus is living.

Malayalam is a morphologically complex language

It has complex morphology compared to other languages English, Tamil, Hindi, Spanish, Finnish etc.
Morphology can be calculated by metrics like TTR and MATTR [1], [2]

Types and Tokens

To be or not to be is question
Type count: 7
Token count: 9

Type Token Ratio (TTR)

\[\begin{gather*} TTR = \frac{\text{Type count}}{\text{Token count}} \end{gather*}\]

To be or not to be is question

\[\begin{gather*} TTR = 7 \div 9 \end{gather*}\]

TTGR and TTR plot of Malayalam for SMC Corpus of Wikipedia text from K. Manohar et al.

Comparison of Malayalam TTR with that of European Union Constitution Corpus and DoE-CIIL Corpus from K. Manohar et al.

Malayalam_asr_benchmarking project

Whisper Event

HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣

Malayalam models in Whisper Event

For the language Malayalam, the results are as follows:

Malayalam models performance in whisper event according to leaderboard

Winning models in Malayalam in Whisper Event

The winning model for Common voice: thennal/whisper-medium-ml
The winning model for Fleurs: parambharath/whisper-small-ml

I was not convinced

I was sceptical about the winning models becuase of:

Achieving 10% WER in Malayalam is astonishing.
In Malayalam there is not even a single yard stick to compare. Most of previous works were done in proprietary datasets and not open-sourced.
Malyalam is a morpohologically complex language. So even achieving 30% WER is a big deal.

I was not convinced

Didn’t trust the Hugging Face way of evaluating models.

thennal/whisper-medium-ml model card readme

I was not convinced

Didn’t trust the Hugging Face way of evaluating models.

Last commit in thennal/whisper-medium-ml

I wanted to build something new

New github project for Malayalam ASR Benchmarking

Time for a new adventure

Benchmarked models

Started with 6 fine-tuned models in Malayalam and compared it with 6 model versions released by OpenAI.

thennal/whisper-medium-ml
parambharat/whisper-tiny-ml
parambharat/whisper-base-ml
parambharat/whisper-small-ml
anuragshas/whisper-large-v2-ml
DrishtiSharma/whisper-large-v2-malayalam

Results on benechmarking in Common Voice dataset

Output from benchmarking tool

WER in Common Voice dataset

WER in Common Voice-9 test split

CER in Common Voice dataset

CER in Common Voice-9 test split

Results on benechmarking in Malayalam Speech Corpus dataset

Output from benchmarking tool

WER in Malayalam Speech Corpus

WER in MSC

CER in Malayalam Speech Corpus

Character Error rate in MSC

End Goal

Something very similar to OpenLLM Leaderboard with results of latest malayalam speech models.
Should include results for Kaldi, Wav2Vec, Whisper, MMS etc.

Open LLM leaderboard in huggingface spaces

vegam-whisper-medium ml model and Pallakku

Inspired by

faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

CTranslate2

It had this utility for converting any whisper based model to faster-whisper like models.

ct2-transformers-converter \
 --model openai/whisper-tiny \
 --output_dir whisper-tiny-ct2

Vegam Whisper models released

I used thennal/whisper-medium-ml to convert it to faster-whisper based models for Malayalam:

Pallakku

Pallakku is a Malayalam speech to text demo leveraging the model-weights of vegam-whisper-medium-ml.
Now hosted as:

🤗 spaces
GPU-based microservice (coming soon.)

🤗 spaces

Thanks to

OpenAI team - Alec Radford, Jong Wook Kim, Christine McLeavey etc. other authors of Whisper paper
Creators of CTranslate2 and faster-whisper - Guillaume Klein
HuggingFace team - Sanchit Gandhi, Nicolas Patry, Vaibhav Srivastav etc.
Kavya Manohar
Santhosh Thottingal

Thennal D K
Other members in Swathanthra Malayalam Computing community.
Jarvis Labs

Appendix

Why I love nbdev?

I love Jupyter notebook
website + pypi + anaconda + github project
I love Quarto

Nbdev is a secret weapon for productivity

I have build one company project and two python packages with nbdev so far:

malayalam_asr_benchmarking
whisper_normalizer

Nbdev opinion by Hamel

image

From Hamel answer in Nbdev for production code topic

My favourite nbdev project

Quarto

nbdev builds on giant shoulders of quarto
Learning quarto is like a new programming language.

Malayalam Projects I am working at the moment!

Outline

$whoami

Disclaimer

About Malayalam

About Malayalam

Malayalam is a morphologically complex language

Types and Tokens

Type Token Ratio (TTR)

Malayalam_asr_benchmarking project

Whisper Event

Malayalam models in Whisper Event

Winning models in Malayalam in Whisper Event

I was not convinced

I was not convinced

I was not convinced

I wanted to build something new

Benchmarked models

Results on benechmarking in Common Voice dataset

WER in Common Voice dataset

CER in Common Voice dataset

Results on benechmarking in Malayalam Speech Corpus dataset

WER in Malayalam Speech Corpus

CER in Malayalam Speech Corpus

End Goal

vegam-whisper-medium ml model and Pallakku

Inspired by

CTranslate2

Vegam Whisper models released

Pallakku

🤗 spaces

Thanks to

Appendix

Why I love nbdev?

Nbdev is a secret weapon for productivity

Nbdev opinion by Hamel

My favourite nbdev project

Quarto

Quarto

My favourite quarto projects