Prevent Cadence from defaulting to Hindi punctuation for Marathi transcripts

#12

by vishald07 - opened Jun 23

Jun 23

We are using Cadence for punctuation restoration on Marathi transcripts generated from speech-to-text. However, since the transcripts often contain broken or phonetically inconsistent words, Cadence applies Hindi-style punctuation instead of Marathi.

Is there a way to explicitly tell Cadence which language it's processing, or configure it to prefer Marathi-specific punctuation rules? Any best practices for handling noisy ASR output with Cadence would also be appreciated.

Sample Text
हिं्मत नाही म्हणून ुम्ही सगळ्यांनी निर्वसनी झालं पाहिजे नाहीतर ुम्हाला कुणालाच मुलांना सांगण्याचा अधिकार पोहोचणारच नाही

psidharth567

AI4Bharat org 11 days ago

What do you mean by "Marathi style" punctuation and "Hindi style" punctuation? Could you provide a detailed example showing where it uses hindi style inappropriately.
We did not train Cadence with any such preference. It follows the punctuations that are usually found in Marathi text on web (as that is what it was trained on).

If there is something wrong with how its punctuating marathi text, then fix it and release a new version.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment