Malayalam Projects I am working at the moment!

Delft-Fastai check-in sessions

Sunday, May 28, 2023

Outline

  • About Malayalam
  • malayalam_asr_benchmarking nbdev project
  • vegam-whisper-medium ml model and Pallakku

$whoami

  • AI Engineer & Team Lead @ Sentient.io
  • Volunteer @ Swathanthra Malayalam Computing(SMC)
  • Open-source enthusiast
  • Not affiliated to OpenAI

Disclaimer

  • Nothing in this talk is generated.

unless explicitly marked, or in a screenshot from an LLM

About Malayalam

About Malayalam

  • Malayalam is my mother tongue.
  • Native speakers: 38+ million.
  • Spoken in: Kerala, Lakshadweep, Puducherry, wherever Mallus is living.

Malayalam is a morphologically complex language

  • It has complex morphology compared to other languages English, Tamil, Hindi, Spanish, Finnish etc.
  • Morphology can be calculated by metrics like TTR and MATTR [1], [2]

Types and Tokens

  • To be or not to be is question
  • Type count: 7
  • Token count: 9

Type Token Ratio (TTR)

\[\begin{gather*} TTR = \frac{\text{Type count}}{\text{Token count}} \end{gather*}\]
  • To be or not to be is question
\[\begin{gather*} TTR = 7 \div 9 \end{gather*}\]

TTGR and TTR plot of Malayalam for SMC Corpus of Wikipedia text from K. Manohar et al.

Comparison of Malayalam TTR with that of European Union Constitution Corpus and DoE-CIIL Corpus from K. Manohar et al.

Malayalam_asr_benchmarking project

Whisper Event

  • HuggingFace Team conducted a whisper fine tuning event for 2 weeks from 5th December 2022 to 19th December 2022. The results were out on 23rd December 2022.
  • The goal was to to fine-tune the Whisper model to build state-of-the-art speech recognition systems in the languages of our choice 🗣

Malayalam models in Whisper Event

  • For the language Malayalam, the results are as follows:

Malayalam models performance in whisper event according to leaderboard

Winning models in Malayalam in Whisper Event

  • The winning model for Common voice: thennal/whisper-medium-ml
  • The winning model for Fleurs: parambharath/whisper-small-ml

I was not convinced

I was sceptical about the winning models becuase of:

  1. Achieving 10% WER in Malayalam is astonishing.
  2. In Malayalam there is not even a single yard stick to compare. Most of previous works were done in proprietary datasets and not open-sourced.
  3. Malyalam is a morpohologically complex language. So even achieving 30% WER is a big deal.

I was not convinced

  1. Didn’t trust the Hugging Face way of evaluating models.

thennal/whisper-medium-ml model card readme

I was not convinced

  1. Didn’t trust the Hugging Face way of evaluating models.

Last commit in thennal/whisper-medium-ml

I wanted to build something new

Time for a new adventure

Benchmarked models

  1. thennal/whisper-medium-ml
  2. parambharat/whisper-tiny-ml
  3. parambharat/whisper-base-ml
  4. parambharat/whisper-small-ml
  5. anuragshas/whisper-large-v2-ml
  6. DrishtiSharma/whisper-large-v2-malayalam

Results on benechmarking in Common Voice dataset

Output from benchmarking tool

WER in Common Voice dataset

WER in Common Voice-9 test split

CER in Common Voice dataset

CER in Common Voice-9 test split

Results on benechmarking in Malayalam Speech Corpus dataset

Output from benchmarking tool

WER in Malayalam Speech Corpus

WER in MSC

CER in Malayalam Speech Corpus

Character Error rate in MSC

End Goal

  • Something very similar to OpenLLM Leaderboard with results of latest malayalam speech models.
  • Should include results for Kaldi, Wav2Vec, Whisper, MMS etc.

Open LLM leaderboard in huggingface spaces

vegam-whisper-medium ml model and Pallakku

Inspired by

  • faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
  • This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.

CTranslate2

  • It had this utility for converting any whisper based model to faster-whisper like models.
ct2-transformers-converter \
 --model openai/whisper-tiny \
 --output_dir whisper-tiny-ct2

Vegam Whisper models released

  • I used thennal/whisper-medium-ml to convert it to faster-whisper based models for Malayalam:
  1. kurianbenoy/vegam-whisper-medium-ml
  2. kurianbenoy/vegam-whisper-medium-ml-fp16

Pallakku

  • Pallakku is a Malayalam speech to text demo leveraging the model-weights of vegam-whisper-medium-ml.
  • Now hosted as:
  1. 🤗 spaces
  2. GPU-based microservice (coming soon.)

🤗 spaces

Thanks to

Appendix

Why I love nbdev?

  • I love Jupyter notebook
  • website + pypi + anaconda + github project
  • I love Quarto

Nbdev is a secret weapon for productivity

  • I have build one company project and two python packages with nbdev so far:
  1. malayalam_asr_benchmarking
  2. whisper_normalizer

Nbdev opinion by Hamel

image

From Hamel answer in Nbdev for production code topic

My favourite nbdev project

Quarto

  • nbdev builds on giant shoulders of quarto
  • Learning quarto is like a new programming language.

Quarto

My favourite quarto projects