Building great datasets on your mother tongue with Open-Weight models

IndiaFOSS 2025

Kurian Benoy

Sunday, September 21, 2025

Outline

Why even generate synthetic data?
Generating Synthethic data with Wan2.2 Text to video models for Onam Images
Generating Malayalam Speech Dataset
Conclusion

$whoami - Kurian Benoy

An open-source developer who is the creator of “whisper_normalizer”
ML Engineer @ Sarvam
I love home food and watching birds
GOATs in various sport are: Leo Messi(Football), Roger Fedrer(Tennis), Virat Kohli(Cricket), Mary Kom(Boxing)
GOATs in chess IMO: Judith Polgar(Chess), D Gukesh(Chess)

$whoami -

Why I do FOSS?

Disclaimer

Why even Synthetic data?

The amount of real datasets is really low.

Providence of Synthic data

Cost aspect?

Proposed Method for Synthethic Speech Dataset (1/2)

Input Source (Malayalam Text) - Directly start with text (from dataset, Wikipedia, government docs, OER, etc.). - No speech-to-text conversion needed. Text Preprocessing - Clean the text: remove special characters, normalize spellings, handle punctuation. Malayalam Text Processing - Tokenize words, handle grammar-specific features, normalize Unicode characters. Grapheme-to-Phoneme (G2P) Conversion - Convert Malayalam letters (graphemes) into their sounds (phonemes). - Example: “കേരളം” → /keːɾɐɭɐm/

Proposed Method for Synthethic Speech Dataset (2/2)

Speech Synthesis (TTS) - Map phonemes to synthetic speech representation. - Uses models like IndicTTS, IndicF5, XTTS. Generate Spectrogram (Acoustic Model) - Convert phoneme sequence into a spectrogram (visual sound representation). Vocoder → Convert to Speech - Convert spectrogram into natural audio waveforms.

Proposed Model for Synthethic Speech Dataset

IndicF5: Open-source Indic TTS, supports Indian scripts.
XTTS-v2 – Multilingual, high-quality cross-lingual TTS.
FreeVC24 / Voice Cloning – Generate diverse speakers.

Domains used for Text Data

News Websities
Wikipedia
Literature and Books
OpenSubtitles.org
Crowdsourced Text Datasets

Building great datasets on your mother tongue with Open-Weight models

Outline

$whoami - Kurian Benoy

$whoami -

Why I do FOSS?

Disclaimer

Why even Synthetic data?

Providence of Synthic data

Cost aspect?

Proposed Method for Synthethic Speech Dataset (1/2)

Proposed Method for Synthethic Speech Dataset (2/2)

Proposed Model for Synthethic Speech Dataset

Domains used for Text Data

Public Dataset created

Creating Synthethic image dataset with Wan 2.2

What is WAN2.2 model

How well is it performing for Malayalam language

Thank you

Question and Answer