ASR for Malayalam

Kurian Benoy | 2021MCS120014

Sunday, September 3, 2023

Automatic Speech Recognition

Automatic Speech Recognition(ASR) is the use of technology to process human speech into readable text.

Malayalam is a dravidian language which has 38 million+ speakers(2011 census).
Malayalam is morphologically complex language which has complex morphology compared to languages like Finnish, Estonian, English, Tamil, Hindi etc.[3]

Using HMM, [1] showed Malayalam speech recognition of numbers was possible when trained in a corpus of 420 sentences which contained 21 speakers. Similarly using HMM and ANN[4] demonstrated malayalam speech recognition. [1] and [4] demonstrated their resuts in internal test sets and claims to have word recongition accuracy of 91% and 86.67% respectively.

Kavya Manohar et.al[2] proposed a open vocabulary speech recognition in Malayalam. It involves building a hybrid ASR model with acoustic model ASR and that builds using language model and prounciation lexion. The study examined WER in medium and large OOV test which are open source and concluded it can give 10 to 7% improvement over simply using acoustic ASR.

There are ASR’s which is originally trained for multiple languages supporting Malayalam as well.
Whisper [5] which use a encoder-decoder based model which supports speech recognition in 99+ languages. For Malayalam in Common Voice 9 dataset for malayalam subset it reported a WER of 103.2 with largev2 model.
MMS [6] which uses a CTC model supports speech recognition in 1000+ languages. For Malayalam subset in Fleurs dataset it reported a WER of 39.7 with MMS L-1107 no LM checkpoint.

A number of existing techniques are available for Malayalam speech to text, yet there is no comparission on how well one existing models perform over another as most of analysis is done in private datasets.
Doesn’t support speech transcription of long-form audios with time-stamps
Open research methadology is not followed which makes it challenging to identify datasets and models.

Fine-tuned ASR model weights which are open source and can be tested in approporiate web app UI.
An ASR system which is able to support long form transcription.
A leaderboard of best ASR models available in Malayalam

Using hypothesised approach proposed in [7] for languages like English, French we hope to build similar models which can support long form audio.

Semester 1:

Semester 2:

Build open-source Malayalam ASR model which support long-form audio transcription.

[1] Speech Recognition of Malayalam Numbers,Cini Kurian, Kannan Balakrishnan
[2] Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam by Kavya Manohar, A.R Jayan, Rajeev Rajan
[3] Quantitative Analysis of the Morphological Complexity of Malayalam Language
[4] HMM/ANN hybrid model for continuous Malayalam speech recognition, Anuj Mohammed & K.N Ramachandran Nair

[5] Robust Speech Recognition via Large-Scale Weak Supervision, Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, Ilya Sutskever
[6] Scaling Speech Technology to 1,000+ Languages, Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
[7] WhisperX: Time-Accurate Speech Transcription of Long-Form Audio, Max Bain