Building great datasets on your mother tongue with Open-Weight models
IndiaFOSS 2025
Sunday, September 21, 2025
Outline
- Why even generate synthetic data?
- Generating Synthethic data with Wan2.2 Text to video models for Onam Images
- Generating Malayalam Speech Dataset
- Conclusion
$whoami
- An open-source developer who is the creator of “whisper_normalizer” and ML developer @Sarvam
Disclaimer
Why even Synthehtic data
- The amount of real datasets is really low.