ASR for Malayalam

Kurian Benoy | 2021MCS120014

Sunday, September 3, 2023

Automatic Speech Recognition

  • Automatic Speech Recognition(ASR) is the use of technology to process human speech into readable text.

About Malayalam

  • Malayalam is a dravidian language which has 38 million+ speakers(2011 census).
  • Malayalam is morphologically complex language which has complex morphology compared to languages like Finnish, Estonian, English, Tamil, Hindi etc.[3]

Literature Review

  • Using HMM, [1] showed Malayalam speech recognition of numbers was possible when trained in a corpus of 420 sentences which contained 21 speakers. Similarly using HMM and ANN[4] demonstrated malayalam speech recognition. [1] and [4] demonstrated their resuts in internal test sets and claims to have word recongition accuracy of 91% and 86.67% respectively.

Literature Review

  • Kavya Manohar[2] proposed a open vocabulary speech recognition in Malayalam. It involves building a hybrid ASR model with acoustic model ASR and that builds using language model and prounciation lexion. The study examined WER in medium and large OOV test which are open source and concluded it can give 10 to 7% improvement over simply using acoustic ASR.

Literature Review

  • There are ASR’s which is originally trained for multiple languages supporting Malayalam as well.
  • Whisper [5] which use a encoder-decoder based model which supports speech recognition in 99+ languages. For Malayalam in Common Voice 9 dataset for malayalam subset it reported a WER of 103.2 with largev2 model.
  • MMS [6] which uses a CTC model supports speech recognition in 1000+ languages. For Malayalam subset in Fleurs dataset it reported a WER of 39.7 with MMS L-1107 no LM checkpoint.


  • A number of existing techniques are available for Malayalam speech to text, yet there is no comparission on how well one existing models perform over another as most of analysis is done in private datasets.
  • Doesn’t support speech transcription of long-form audios with time-stamps
  • Open research methadology is not followed which makes it challenging to identify datasets and models.

Project Objectives

  1. Build an open-source based ASR model
  • Achieve a WER of less than 0.15
  1. Benchmark ASR models in datasets

  2. Support Malayalam long form audio speech transcription

Expected Outcome

  1. Fine-tuned ASR model weights which are open source and can be tested in approporiate web app UI.
  2. An ASR system which is able to support long form transcription.
  3. A leaderboard of best ASR models available in Malayalam

Proposed Methodology

  1. Build an open-source based ASR model
  • Collect dataset for training ASR model
  • Select appropriate base architecture
  • Then fine-tune models
  • Evaluate

Proposed Methodology

  1. Benchmark ASR models in datasets
  • Using appropriate selected datasets
  • Evaluate the performance of ASR model
  • The proposed metrics are WER, CER, time taken, model size etc.

Proposed Methodology

  1. Support Malayalam long form audio speech transcription
  • Using hypothesised approach proposed in [7] for languages like English, French we hope to build similar models which can support long form audio.

Project Plan

Semester 1:

  1. Build an open-source based ASR model

  2. Benchmark ASR models in datasets

Semester 2:

  1. Support Malayalam long form audio speech transcription


  • Build open-source Malayalam ASR model which support long-form audio transcription.


  • [1] Speech Recognition of Malayalam Numbers,Cini Kurian, Kannan Balakrishnan
  • [2] Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam by Kavya Manohar, A.R Jayan, Rajeev Rajan
  • [3] Quantitative Analysis of the Morphological Complexity of Malayalam Language
  • [4] HMM/ANN hybrid model for continuous Malayalam speech recognition, Anuj Mohammed & K.N Ramachandran Nair


  • [5] Robust Speech Recognition via Large-Scale Weak Supervision, Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, Ilya Sutskever
  • [6] Scaling Speech Technology to 1,000+ Languages, Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
  • [7] WhisperX: Time-Accurate Speech Transcription of Long-Form Audio, Max Bain