Building great datasets on your mother tongue with Open-Weight models

IndiaFOSS 2025

Kurian Benoy

Sunday, September 21, 2025

Outline

  • Why even generate synthetic data?
  • Generating Synthethic data with Wan2.2 Text to video models for Onam Images
  • Generating Malayalam Speech Dataset
  • Conclusion

$whoami

  • An open-source developer who is the creator of “whisper_normalizer” and ML developer @Sarvam

Disclaimer

Why even Synthehtic data

  • The amount of real datasets is really low.