Automatic Speech Recognition(ASR) is the use of technology to process human speech into readable text.
About Malayalam
Malayalam is a dravidian language which has 38 million+ speakers(2011 census).
Malayalam is morphologically complex language which has complex morphology compared to languages like Finnish, Estonian, English, Tamil, Hindi etc.[3]
Literature Review
Using HMM, [1] showed Malayalam speech recognition of numbers was possible when trained in a corpus of 420 sentences which contained 21 speakers. Similarly using HMM and ANN[4] demonstrated malayalam speech recognition. [1] and [4] demonstrated their resuts in internal test sets and claims to have word recongition accuracy of 91% and 86.67% respectively.
Literature Review
Kavya Manohar et.al[2] proposed a open vocabulary speech recognition in Malayalam. It involves building a hybrid ASR model with acoustic model ASR and that builds using language model and prounciation lexion. The study examined WER in medium and large OOV test which are open source and concluded it can give 10 to 7% improvement over simply using acoustic ASR.
Literature Review
There are ASR’s which is originally trained for multiple languages supporting Malayalam as well.
Whisper [5] which use a encoder-decoder based model which supports speech recognition in 99+ languages. For Malayalam in Common Voice 9 dataset for malayalam subset it reported a WER of 103.2 with largev2 model.
MMS [6] which uses a CTC model supports speech recognition in 1000+ languages. For Malayalam subset in Fleurs dataset it reported a WER of 39.7 with MMS L-1107 no LM checkpoint.
Challenges
A number of existing techniques are available for Malayalam speech to text, yet there is no comparission on how well one existing models perform over another as most of analysis is done in private datasets.
Doesn’t support speech transcription of long-form audios with time-stamps
Open research methadology is not followed which makes it challenging to identify datasets and models.
Project Objectives
Build an open-source based ASR model
Achieve a WER of less than 0.15
Benchmark ASR models in datasets
Support Malayalam long form audio speech transcription
Expected Outcome
Fine-tuned ASR model weights which are open source and can be tested in approporiate web app UI.
An ASR system which is able to support long form transcription.
A leaderboard of best ASR models available in Malayalam
Proposed Methodology
Build an open-source based ASR model
Collect dataset for training ASR model
Select appropriate base architecture
Then fine-tune models
Evaluate
Proposed Methodology
Benchmark ASR models in datasets
Using appropriate selected datasets
Evaluate the performance of ASR model
The proposed metrics are WER, CER, time taken, model size etc.
Proposed Methodology
Support Malayalam long form audio speech transcription
Using hypothesised approach proposed in [7] for languages like English, French we hope to build similar models which can support long form audio.
Project Plan
Semester 1:
Build an open-source based ASR model
Benchmark ASR models in datasets
Semester 2:
Support Malayalam long form audio speech transcription
Summary
Build open-source Malayalam ASR model which support long-form audio transcription.
References
[1] Speech Recognition of Malayalam Numbers,Cini Kurian, Kannan Balakrishnan
[2] Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam by Kavya Manohar, A.R Jayan, Rajeev Rajan
[3] Quantitative Analysis of the Morphological Complexity of Malayalam Language
[4] HMM/ANN hybrid model for continuous Malayalam speech recognition, Anuj Mohammed & K.N Ramachandran Nair
References
[5] Robust Speech Recognition via Large-Scale Weak Supervision, Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, Ilya Sutskever