My voice phone agent uses two models: one to hear the caller, one to think. Gemma 4 can do both at once — so I tried deleting the speech-to-text model entirely. Across English, French, and Mandarin, here’s the head-to-head on response time and the thing that actually matters on a phone line: did it reply correctly. In English and French, one model beat two — faster and more accurate. In Chinese, it confidently told me a football field is 100 yards. Here’s the data.
The cascade, and the temptation to collapse it
My voice phone agent uses the same architecture as almost every spoken assistant: a cascade of specialist models.
caller audio ──▶ [ faster-whisper ] ──▶ text ──▶ [ Gemma LLM ] ──▶ reply ──▶ [ TTS ] ──▶ audio
speech-to-text reasoning
Two models loaded, two GPU residents, two sequential hops before the caller hears anything. Then Gemma 4 shipped in Ollama with an audio capability — it can take audio directly as input. Which raises the obvious question:
Why transcribe at all? If the LLM can hear the caller, it can understand and answer in one step. One model, one hop.
So I set up a clean head-to-head, holding the LLM constant so the only thing that changes is the input modality:
- Pipeline A — the cascade: audio → faster-whisper → text → gemma4:12b → reply
- Pipeline B — multimodal: audio → gemma4:12b (audio in) → reply
Same model doing the reasoning in both. I measured the two things that actually matter for a phone agent:
- Reply correctness — did the agent answer the question correctly?
- Response time — how long until the reply is ready.
A note on what to measure, because for a telephony system it changes the conclusion. The instinct is to score the speech-to-text by Word Error Rate. But a caller never hears the transcript — they hear the reply. What matters on a phone line is whether the agent did the right thing: answered the question, booked the table, cancelled the reservation. A transcript that drops a filler word or writes “7:00” for “seven o’clock” is harmless if the reply is still correct. So reply correctness is the primary metric here; transcription WER is only a diagnostic I use later to explain why the replies succeed or fail. Judge the system by the thing the caller actually experiences.
Methodology
I wrote 42 spoken caller turns (15 English, 15 French, 12 Mandarin) as natural speech via gTTS — factual questions, arithmetic, and task requests (bookings, cancellations) — each with a checkable answer key so correctness is objective, not vibes:
| Spoken turn | Correct iff reply contains |
|---|---|
| “What is the capital of France?” | paris |
| “What is fifteen plus twenty-seven?” | 42 / forty-two |
| “I’d like to book a table for four at seven tonight.” | (four and seven) |
| “法国的首都是哪里?” | 巴黎 |
| “二乘以八等于几?” | 16 / 十六 |
Grading is regex on the reply, case-insensitive, with Traditional→Simplified normalization for Chinese (faster-whisper and Gemma both emit Traditional characters; the keys are Simplified). Both pipelines used identical generation settings (temperature=0, reasoning disabled). Hardware: one 24 GB GPU, Ollama 0.30.7, gemma4:12b (Q4_K_M), faster-whisper medium.
One setup gotcha worth saving you an hour: Ollama’s native /api/chat silently ignores the audios field — you get “there’s no audio attached.” Audio only works through the OpenAI-compatible /v1/chat/completions endpoint as an input_audio content part (base64, 16 kHz mono).
English: the one-model dream, realized
| Pipeline | Reply accuracy | Median latency |
|---|---|---|
| A — cascade (faster-whisper → gemma4) | 93 % | 0.81 s |
| B — multimodal (gemma4 audio) | 100 % | 0.66 s |
The multimodal model won on both axes. Every English question got a correct answer, and it did so faster than the cascade — because one model call beats two sequential ones (transcribe-then-reason). Side by side, they’re indistinguishable in quality:
"What is two times eight?"
A: Two times eight is 16. (0.79 s)
B: Two times eight is sixteen. (0.60 s)
"How many legs does a spider have?"
A: A spider has eight legs. (0.76 s)
B: A spider has eight legs. (0.61 s)
(The single cascade “miss” wasn’t an error — asked to book a table, it replied “what’s the name of the restaurant?”, a sensible clarifying turn that just didn’t contain the answer-key words. The multimodal model happened to confirm the booking outright.)
For an English voice agent, this is the result you hoped for: drop a whole model, get faster and at-least-as-accurate replies.
French: not an English fluke
To check whether this is an English-only trick, I ran the full 15-item set in French, both pipelines, same as English:
| Pipeline | Reply accuracy | Median latency |
|---|---|---|
| A — cascade (faster-whisper → gemma4) | 100 % | 0.89 s |
| B — multimodal (gemma4 audio) | 93 % | 0.71 s |
French behaves like English: the multimodal model understands the audio and answers correctly. Same near-indistinguishable quality:
Quelle est la capitale de la France ?
A: La capitale de la France est Paris. B: La capitale de la France est Paris. ✓
Quel est le contraire de chaud ?
A: Le contraire de chaud est froid. B: Le contraire de chaud est froid. ✓
Combien de pattes a une araignée ?
A: …elle possède huit pattes articulées. B: Une araignée a généralement huit pattes. ✓
The single multimodal “miss” is revealing: asked “combien font quinze plus vingt-sept ?”, it answered “quarenta e dois” — the right answer (42), but in Portuguese. The audio was understood; the model just slipped languages. For a French phone line that’s still a failure (the caller hears Portuguese), so it’s scored wrong — but it’s a language-control wobble, not deafness. Nothing like the Chinese collapse you’re about to see. The single-model result clearly generalizes: gemma4’s audio works on major European languages, not just English.
Chinese: the same model, off a cliff
| Pipeline | Reply accuracy | Median latency |
|---|---|---|
| A — cascade (faster-whisper → gemma4) | 92 % | 0.83 s |
| B — multimodal (gemma4 audio) | 8 % † | 0.67 s |
† The single “correct” is a fluke — asked water’s boiling point, B answered “a football field is 100 yards”, which matched the “100” key. True accuracy is 0 / 12.
Same LLM. Same questions. The only change is feeding audio instead of a transcript — and the agent goes from 92 % to effectively zero. The failure isn’t subtle mistakes; it’s the model answering completely unrelated questions, because it can’t make out the Mandarin at all and confabulates:
法国的首都是哪里? (What's the capital of France?)
A: 法國的首都是巴黎。 ✓ (The capital of France is Paris.)
B: 法語的字母數量是26個。 ✗ (The French alphabet has 26 letters.)
一只蛛蛛有几条腿? (How many legs does a spider have?)
A: 蛛蛛有八條腿。 ✓ (A spider has eight legs.)
B: The nearest subway station is just around the corner. ✗
二乘以八等于几? (What is two times eight?)
A: 2乘以8等於16。 ✓
B: 法国一共个省。 ✗ (France has 12 provinces.)
The cascade, meanwhile, handles Chinese almost perfectly — because faster-whisper transcribes the Mandarin correctly and then the very same Gemma answers it correctly.
Why: it’s the ears, not the brain
The reasoning is identical across both pipelines, so the collapse must be in the audio understanding. A separate transcription benchmark on Google’s FLEURS read-speech set confirms it precisely:
For the WER/CER run I asked Gemma 4 to transcribe, nothing more — a single user turn (no system prompt), temperature=0, reasoning_effort="none":
Transcribe the speech in this audio exactly, word for word, in its original language. Output only the transcription text, nothing else.
| English (WER ↓) | Chinese (CER ↓) | |
|---|---|---|
faster-whisper medium | 3.6 % | 12.6 % |
| gemma4:12b (audio) | 9.2 % | 76.6 % |
(WER/CER are computed after normalization — Whisper’s English text normalizer so “7:00” equals “seven o’clock”, and Han-character-only comparison for Chinese — applied identically to both systems.)
Gemma 4’s audio encoder hears English well enough (9.2 % word error — it even paraphrases sensibly) that the downstream answer comes out right. On Chinese it’s at 76.6 % character error — it isn’t transcribing, it’s hallucinating — so the brain, however capable, is reasoning over noise. Garbage in, confident garbage out. The audio front-end isn’t English-centric (French works just as well) — it’s major-European-language-centric, with Mandarin coverage simply not there yet at this model size and quantization.
The latency nuance (and a landmine)
The multimodal path’s latency edge (0.66 s vs 0.81 s) is real but modest: collapsing two sequential calls into one saves a hop, and faster-whisper itself is cheap (~0.2–0.3 s). The bigger win is operational — one model to load, serve, and monitor instead of two.
One landmine, though: Gemma 4 is a reasoning model, and on the audio endpoint its thinking trace is hard to suppress reliably (reasoning_effort, think:false, and enable_thinking were all flaky in my tests). Most calls were snappy, but in separate runs a single transcription occasionally ballooned to ~40 seconds of internal monologue, sometimes leaking raw <channel|> control tokens. For a real-time phone line, an unpredictable 40 s stall is its own disqualifier until the control surface stabilizes.
How it plugs into a phone agent
In a SIP/telephony stack the multimodal swap is small. Your phone bridge (e.g. pyVoIP over a VoIP provider) already buffers the caller’s audio and uses voice-activity detection to decide when they’ve stopped talking. Today that buffer goes to a speech-to-text model and the text goes to an LLM. With Gemma 4 you collapse those two steps into one call:
┌─ before ─────────────────────────────────────────────┐
caller ▶ VAD ▶ buffer ▶ [faster-whisper] ▶ text ▶ [LLM] ▶ reply ▶ TTS ▶ caller
└─────────────────────────────────────────────────────┘
┌─ after ──────────────────────────────────────────────┐
caller ▶ VAD ▶ buffer ▶──────────▶ [Gemma 4 audio] ▶ reply ▶ TTS ▶ caller
└─────────────────────────────────────────────────────┘
The turn handler becomes one function — resample the buffered turn to 16 kHz mono, attach it as an input_audio part, and keep the running conversation as ordinary text history:
def handle_turn(turn_wav_16k_mono, history): # history = prior turns (text)
audio_b64 = base64.b64encode(turn_wav_16k_mono).decode()
messages = [{"role": "system", "content": AGENT_PROMPT}, *history,
{"role": "user", "content": [
{"type": "input_audio",
"input_audio": {"data": audio_b64, "format": "wav"}}]}]
r = requests.post("http://localhost:11434/v1/chat/completions", json={
"model": "gemma4:12b", "messages": messages,
"stream": True, "temperature": 0.0, "reasoning_effort": "none"})
reply = stream_to_tts(r) # speak as tokens arrive
history += [{"role": "user", "content": "(caller audio)"},
{"role": "assistant", "content": reply}]
return reply
Four things that matter in practice:
- Use the OpenAI-compatible endpoint. The audio goes only through
/v1/chat/completionsasinput_audio; the native/api/chataudiosfield is silently dropped. - History stays text. Only the current turn is audio; past turns ride along as transcribed/assistant text, which keeps the prompt small and the model grounded.
- Stream into the TTS. Start synthesizing the reply on the first clause so the caller hears audio in well under a second — the same trick that makes the cascade feel fast.
- Guard the reasoning blowup. Cap the response (timeout / max tokens) and fall back to the cascade if a turn stalls, so one runaway thinking trace can’t freeze the call.
That’s the whole change — for an English- or French-language line. For Chinese you keep faster-whisper in front, and Gemma 4 stays purely the text brain.
Verdict: one model, but mind the language
| English | French | Chinese | |
|---|---|---|---|
| Replace the cascade with one multimodal model? | Yes | Yes | No |
| Reply accuracy, multimodal | 100 % | 93 % | ~0 % |
| Reply accuracy, cascade | 93 % | 100 % | 92 % |
| Response time, multimodal vs cascade | 0.66 s vs 0.80 s | 0.71 s vs 0.89 s | 0.67 s vs 0.83 s |
| Why | hears English well | hears French well | can’t hear Mandarin |
(Reply correctness — what the caller actually experiences — is the column that decides this. Transcription WER/CER is the “Why,” not the verdict.)
- English- or French-first agent: the single multimodal model is a genuine win — faster, simpler, and at least as accurate on the metric that matters (correct replies). Just budget for the reasoning-mode latency tax.
- Mandarin (or any language the encoder can’t hear) in the loop: keep the cascade. A dedicated ASR model isn’t legacy baggage — it’s the only reason the Chinese agent answers correctly at all. And note French works fine, so this is about which languages the audio encoder covers, not “English vs the rest.”
- The metric that decides it is reply correctness, not WER — because a phone caller is served by the answer, not the transcript. Test the language(s) you’ll actually deploy in, and grade the reply.
- My bilingual (Chinese/English) phone agent keeps faster-whisper + an LLM. The one-model future is clearly coming — for English and French it’s already here — but it arrives one language at a time.
The meta-lesson: a spec sheet that lists “audio input” is telling you the plumbing exists, not that the quality does — and a single model that aces a task in one language can score zero in another. Hold the reasoning constant, vary one thing, and measure the thing your users actually feel: did it answer, and how fast.
Reproduce it
e2e/items.json # 27 spoken turns + answer keys (en + zh)
e2e/gen_audio.py # gTTS -> audio/<id>.mp3
e2e/stt.py # cascade stage 1: faster-whisper -> stt.tsv (+ latency)
e2e/run_e2e.py # A (whisper text -> gemma4) vs B (audio -> gemma4); replies + latency + correctness
e2e/regrade.py # Traditional->Simplified-aware scoring -> summary tables
Plus a standalone transcription benchmark (build_corpus.py / run_fw.py / run_gemma4.py / score.py) on Google FLEURS for the WER/CER mechanism numbers. Hardware: one 24 GB GPU, Ollama 0.30.7. If you run it on a larger Gemma 4 or other languages, I’d love to see the numbers.