"Whisper Hebrish": A Code-Switching ASR Fine-Tune for English-Hebrew Immigrant Speech Patterns

Community Article Published November 18, 2025

Yesterday I finally wrapped up a project that’s been percolating for almost a year: creating a personal fine-tune of OpenAI Whisper (Large Turbo V3). Because it’s trained on my voice, the model is private — nobody else sounds like me, after all.

When I first got deeply into voice tech and speech-to-text (STT), fine-tuning Whisper was a dream goal. I speak with a faint Irish accent, but more importantly, my voice is uniquely mine. Everyone’s is.

Fine-tuning ASR models is often associated with domain-specific vocab — medical, legal, technical — but it can substantially improve accuracy even if you speak in perfectly ordinary English and mostly talk about the weather. You don’t need to speak a rare language or work in radiology to benefit. If you're motivated enough to build the training dataset, the improvements are there for the taking.

It’s not glamorous work. In my case, it meant recording hundreds of strange LLM-generated sentences at odd hours of the night. The key is precision: the text transcript is the ground truth, and you want sentences that reflect the kind of things you actually say. If Whisper always turns “Hugging Face” into “Huffington Face,” then into the dataset it goes.

For Whisper, keeping samples under 30 seconds makes WhisperX word-level diarisation happy. Speak normally. Use a decent microphone. That’s the core of it.

Synthetic Data Makes This Easy

LLMs are excellent for generating synthetic training data. My early attempt involved reading Sherlock Holmes aloud — not recommended. Instead, I generated 500 short sentences on varied topics and then asked the model for targeted samples: for example, “Give me 10 sentences that use ‘Hugging Face’ and ‘GitHub.’”

Record the audio. Pair each sentence with its audio file in a JSON array. Congratulations — you’re already 80% of the way to a Whisper fine-tune. The rest is Python plus a GPU that costs roughly as much as a modest apartment.

From Dataset to Fine-Tune

I finished the dataset about a month ago and trained the fine-tune on Whisper Large V3 Turbo. (Turbo is itself an OpenAI fine-tune, so this is a fine-tune on a fine-tune.)

Using a Modal A100 GPU, the training ran so quickly I barely had time to ask myself “Was this worth it?” (Yes.) Modal’s documentation even walks you through computing baseline vs. post-training WER so you can quantify the gains.

Alongside my private model, I decided to create a public fine-tune as well. For this, I built a small English-Hebrew code-switching dataset yesterday and ran it through the same training pipeline.

The result is Whisper Hebrish — available on Hugging Face and on Replicate for inference.

What Is Code-Switching?

Code-switching is what happens when speakers mix languages mid-sentence. As an English-speaking immigrant in Israel, I live in a constant linguistic mash-up. Americans have “Spanglish.” We have “Hebrish.”

I do this unconsciously: “I’m popping down to the makolet” (makolet = grocery store).

Code-switching shows up everywhere. Hebrew borrows Arabic. Arabic speakers in Israel casually use bits of Hebrew, like “shoo il mushkikle ma'3a il-mazgan?!”

It’s a natural, pervasive part of multilingual communities. Linguists don’t design languages; humans improvise them.

But the API Wants a Language Tag?!

ASR models don’t have an innate notion of “language.” They operate on phonetics and token prediction — a Transformer guessing the most likely next unit of text based on the audio signal.

This approach is brilliant and powerful, but it has consequences. An API might ask you to specify a language code like en. Doing so typically routes your audio into a specialised English model. But what happens when you say a Hebrew word or two? If you switch to a multilingual model, you may lose accuracy.

Code-switching challenges the assumption of a single-language input. The model’s accuracy drops not because it can’t “understand” Hebrew, but because your voice no longer sounds like the monolingual data it was trained on.

Method

I created Whisper Hebrish to test whether a simple code-switching fine-tune could fix these edge cases. Based on my initial tests, the answer seems to be yes.

Here’s a sample:

My fine-tune produced:


I went to the makolet today to pick up some bread, and I also got my teudat zehut.

Stock Whisper V3 Large Turbo produced:


I went to the Macaulay today to pick up some bread and I also got my Theodette Sahoot.

“Macaulay” and “Theodette Sahoot” are, of course, nonsense.

You can try a side-by-side comparison in this demo:
https://huggingface.co/spaces/danielrosehill/Whisper-Hebrish

Dataset & Creation Process

To build the dataset, I asked Claude to generate a CSV of 500 words and phrases commonly used by English-speaking residents in Israel. It generated a delightfully eclectic list: place names, bureaucratic terminology, groceries, you name it.

I selected a subset of terms I knew Whisper struggled with, added my own, and generated sentence-level training data around them. Claude even produced a quick GUI to help me record each line efficiently.

My wife listened in confusion while I said things like “I need to buy challah and bring it to Yerushalayim” into a microphone. But it worked.
The dataset is here:
https://huggingface.co/datasets/danielrosehill/English-Hebrew-Mixed-Sentences

Training ran on a Modal A100 GPU, which chewed through the job effortlessly:

Fine-Tuning ASR Models Really Works

Whisper is one of the few open-source projects that genuinely feels like a gift to humanity. When you fine-tune it on your own voice, the improvements can be enormous — and measurable via WER.

I voice-type for around two hours per day. My personal fine-tune beats every commercial model I’ve tested, including stock Whisper via the OpenAI API. My local model — on my own machine — gets better results.

I’m not claiming brilliance; the credit goes to the Whisper team, the LLM that helped generate the dataset, and Modal’s GPU.

The larger point is that anyone can make their own ASR fine-tune and deploy it on Hugging Face or Replicate for serverless inference.

Usage

Whisper Hebrish was a fun (and slightly chaotic) afternoon experiment, but it reflects something I care about deeply: making STT genuinely usable for multilingual daily life.

If you’d like to chat about the project, collaborate, or try building your own ASR fine-tune, you’re welcome to reach out.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote