Prevent Cadence from defaulting to Hindi punctuation for Marathi transcripts

#12
by vishald07 - opened

We are using Cadence for punctuation restoration on Marathi transcripts generated from speech-to-text. However, since the transcripts often contain broken or phonetically inconsistent words, Cadence applies Hindi-style punctuation instead of Marathi.

Is there a way to explicitly tell Cadence which language it's processing, or configure it to prefer Marathi-specific punctuation rules? Any best practices for handling noisy ASR output with Cadence would also be appreciated.

Sample Text
हिं्मत नाही म्हणून ुम्ही सगळ्यांनी निर्वसनी झालं पाहिजे नाहीतर ुम्हाला कुणालाच मुलांना सांगण्याचा अधिकार पोहोचणारच नाही

AI4Bharat org

What do you mean by "Marathi style" punctuation and "Hindi style" punctuation? Could you provide a detailed example showing where it uses hindi style inappropriately.
We did not train Cadence with any such preference. It follows the punctuations that are usually found in Marathi text on web (as that is what it was trained on).

If there is something wrong with how its punctuating marathi text, then fix it and release a new version.

Sign up or log in to comment