My voice phone agent uses two models: one to hear the caller, one to think. Gemma 4 can do both at once — so I tried deleting the speech-to-text model entirely. Across English, French, and Mandarin, here’s the head-to-head on response time and the thing that actually matters on a phone line: did it reply correctly. In English and French, one model beat two — faster and more accurate. In Chinese, it confidently told me a football field is 100 yards. Here’s the data.


The cascade, and the temptation to collapse it

My voice phone agent uses the same architecture as almost every spoken assistant: a cascade of specialist models.

caller audio ──▶ [ faster-whisper ] ──▶ text ──▶ [ Gemma LLM ] ──▶ reply ──▶ [ TTS ] ──▶ audio
                   speech-to-text                  reasoning

Two models loaded, two GPU residents, two sequential hops before the caller hears anything. Then Gemma 4 shipped in Ollama with an audio capability — it can take audio directly as input. Which raises the obvious question:

Why transcribe at all? If the LLM can hear the caller, it can understand and answer in one step. One model, one hop.

So I set up a clean head-to-head, holding the LLM constant so the only thing that changes is the input modality:

Same model doing the reasoning in both. I measured the two things that actually matter for a phone agent:

  1. Reply correctness — did the agent answer the question correctly?
  2. Response time — how long until the reply is ready.

A note on what to measure, because for a telephony system it changes the conclusion. The instinct is to score the speech-to-text by Word Error Rate. But a caller never hears the transcript — they hear the reply. What matters on a phone line is whether the agent did the right thing: answered the question, booked the table, cancelled the reservation. A transcript that drops a filler word or writes “7:00” for “seven o’clock” is harmless if the reply is still correct. So reply correctness is the primary metric here; transcription WER is only a diagnostic I use later to explain why the replies succeed or fail. Judge the system by the thing the caller actually experiences.

Methodology

I wrote 42 spoken caller turns (15 English, 15 French, 12 Mandarin) as natural speech via gTTS — factual questions, arithmetic, and task requests (bookings, cancellations) — each with a checkable answer key so correctness is objective, not vibes:

Spoken turnCorrect iff reply contains
“What is the capital of France?”paris
“What is fifteen plus twenty-seven?”42 / forty-two
“I’d like to book a table for four at seven tonight.”(four and seven)
“法国的首都是哪里?”巴黎
“二乘以八等于几?”16 / 十六

Grading is regex on the reply, case-insensitive, with Traditional→Simplified normalization for Chinese (faster-whisper and Gemma both emit Traditional characters; the keys are Simplified). Both pipelines used identical generation settings (temperature=0, reasoning disabled). Hardware: one 24 GB GPU, Ollama 0.30.7, gemma4:12b (Q4_K_M), faster-whisper medium.

One setup gotcha worth saving you an hour: Ollama’s native /api/chat silently ignores the audios field — you get “there’s no audio attached.” Audio only works through the OpenAI-compatible /v1/chat/completions endpoint as an input_audio content part (base64, 16 kHz mono).


English: the one-model dream, realized

PipelineReply accuracyMedian latency
A — cascade (faster-whisper → gemma4)93 %0.81 s
B — multimodal (gemma4 audio)100 %0.66 s

The multimodal model won on both axes. Every English question got a correct answer, and it did so faster than the cascade — because one model call beats two sequential ones (transcribe-then-reason). Side by side, they’re indistinguishable in quality:

"What is two times eight?"
  A: Two times eight is 16.          (0.79 s)
  B: Two times eight is sixteen.     (0.60 s)

"How many legs does a spider have?"
  A: A spider has eight legs.        (0.76 s)
  B: A spider has eight legs.        (0.61 s)

(The single cascade “miss” wasn’t an error — asked to book a table, it replied “what’s the name of the restaurant?”, a sensible clarifying turn that just didn’t contain the answer-key words. The multimodal model happened to confirm the booking outright.)

For an English voice agent, this is the result you hoped for: drop a whole model, get faster and at-least-as-accurate replies.


French: not an English fluke

To check whether this is an English-only trick, I ran the full 15-item set in French, both pipelines, same as English:

PipelineReply accuracyMedian latency
A — cascade (faster-whisper → gemma4)100 %0.89 s
B — multimodal (gemma4 audio)93 %0.71 s

French behaves like English: the multimodal model understands the audio and answers correctly. Same near-indistinguishable quality:

Quelle est la capitale de la France ?
  A: La capitale de la France est Paris.        B: La capitale de la France est Paris.   ✓
Quel est le contraire de chaud ?
  A: Le contraire de chaud est froid.           B: Le contraire de chaud est froid.       ✓
Combien de pattes a une araignée ?
  A: …elle possède huit pattes articulées.      B: Une araignée a généralement huit pattes. ✓

The single multimodal “miss” is revealing: asked “combien font quinze plus vingt-sept ?”, it answered “quarenta e dois” — the right answer (42), but in Portuguese. The audio was understood; the model just slipped languages. For a French phone line that’s still a failure (the caller hears Portuguese), so it’s scored wrong — but it’s a language-control wobble, not deafness. Nothing like the Chinese collapse you’re about to see. The single-model result clearly generalizes: gemma4’s audio works on major European languages, not just English.


Chinese: the same model, off a cliff

PipelineReply accuracyMedian latency
A — cascade (faster-whisper → gemma4)92 %0.83 s
B — multimodal (gemma4 audio)8 %0.67 s

† The single “correct” is a fluke — asked water’s boiling point, B answered “a football field is 100 yards”, which matched the “100” key. True accuracy is 0 / 12.

Same LLM. Same questions. The only change is feeding audio instead of a transcript — and the agent goes from 92 % to effectively zero. The failure isn’t subtle mistakes; it’s the model answering completely unrelated questions, because it can’t make out the Mandarin at all and confabulates:

法国的首都是哪里?   (What's the capital of France?)
  A: 法國的首都是巴黎。                    ✓  (The capital of France is Paris.)
  B: 法語的字母數量是26個。               ✗  (The French alphabet has 26 letters.)

一只蛛蛛有几条腿?   (How many legs does a spider have?)
  A: 蛛蛛有八條腿。                        ✓  (A spider has eight legs.)
  B: The nearest subway station is just around the corner.   ✗

二乘以八等于几?     (What is two times eight?)
  A: 2乘以8等於16。                        ✓
  B: 法国一共�个省。                    ✗  (France has 12 provinces.)

The cascade, meanwhile, handles Chinese almost perfectly — because faster-whisper transcribes the Mandarin correctly and then the very same Gemma answers it correctly.


Why: it’s the ears, not the brain

The reasoning is identical across both pipelines, so the collapse must be in the audio understanding. A separate transcription benchmark on Google’s FLEURS read-speech set confirms it precisely:

For the WER/CER run I asked Gemma 4 to transcribe, nothing more — a single user turn (no system prompt), temperature=0, reasoning_effort="none":

Transcribe the speech in this audio exactly, word for word, in its original language. Output only the transcription text, nothing else.
English (WER ↓)Chinese (CER ↓)
faster-whisper medium3.6 %12.6 %
gemma4:12b (audio)9.2 %76.6 %

(WER/CER are computed after normalization — Whisper’s English text normalizer so “7:00” equals “seven o’clock”, and Han-character-only comparison for Chinese — applied identically to both systems.)

Gemma 4’s audio encoder hears English well enough (9.2 % word error — it even paraphrases sensibly) that the downstream answer comes out right. On Chinese it’s at 76.6 % character error — it isn’t transcribing, it’s hallucinating — so the brain, however capable, is reasoning over noise. Garbage in, confident garbage out. The audio front-end isn’t English-centric (French works just as well) — it’s major-European-language-centric, with Mandarin coverage simply not there yet at this model size and quantization.


The latency nuance (and a landmine)

The multimodal path’s latency edge (0.66 s vs 0.81 s) is real but modest: collapsing two sequential calls into one saves a hop, and faster-whisper itself is cheap (~0.2–0.3 s). The bigger win is operational — one model to load, serve, and monitor instead of two.

One landmine, though: Gemma 4 is a reasoning model, and on the audio endpoint its thinking trace is hard to suppress reliably (reasoning_effort, think:false, and enable_thinking were all flaky in my tests). Most calls were snappy, but in separate runs a single transcription occasionally ballooned to ~40 seconds of internal monologue, sometimes leaking raw <channel|> control tokens. For a real-time phone line, an unpredictable 40 s stall is its own disqualifier until the control surface stabilizes.


How it plugs into a phone agent

In a SIP/telephony stack the multimodal swap is small. Your phone bridge (e.g. pyVoIP over a VoIP provider) already buffers the caller’s audio and uses voice-activity detection to decide when they’ve stopped talking. Today that buffer goes to a speech-to-text model and the text goes to an LLM. With Gemma 4 you collapse those two steps into one call:

         ┌─ before ─────────────────────────────────────────────┐
caller ▶ VAD ▶ buffer ▶ [faster-whisper] ▶ text ▶ [LLM] ▶ reply ▶ TTS ▶ caller
         └─────────────────────────────────────────────────────┘
         ┌─ after ──────────────────────────────────────────────┐
caller ▶ VAD ▶ buffer ▶──────────▶ [Gemma 4 audio] ▶ reply ▶ TTS ▶ caller
         └─────────────────────────────────────────────────────┘

The turn handler becomes one function — resample the buffered turn to 16 kHz mono, attach it as an input_audio part, and keep the running conversation as ordinary text history:

def handle_turn(turn_wav_16k_mono, history):           # history = prior turns (text)
    audio_b64 = base64.b64encode(turn_wav_16k_mono).decode()
    messages = [{"role": "system", "content": AGENT_PROMPT}, *history,
                {"role": "user", "content": [
                    {"type": "input_audio",
                     "input_audio": {"data": audio_b64, "format": "wav"}}]}]
    r = requests.post("http://localhost:11434/v1/chat/completions", json={
        "model": "gemma4:12b", "messages": messages,
        "stream": True, "temperature": 0.0, "reasoning_effort": "none"})
    reply = stream_to_tts(r)                            # speak as tokens arrive
    history += [{"role": "user", "content": "(caller audio)"},
                {"role": "assistant", "content": reply}]
    return reply

Four things that matter in practice:

That’s the whole change — for an English- or French-language line. For Chinese you keep faster-whisper in front, and Gemma 4 stays purely the text brain.


Verdict: one model, but mind the language

EnglishFrenchChinese
Replace the cascade with one multimodal model?YesYesNo
Reply accuracy, multimodal100 %93 %~0 %
Reply accuracy, cascade93 %100 %92 %
Response time, multimodal vs cascade0.66 s vs 0.80 s0.71 s vs 0.89 s0.67 s vs 0.83 s
Whyhears English wellhears French wellcan’t hear Mandarin

(Reply correctness — what the caller actually experiences — is the column that decides this. Transcription WER/CER is the “Why,” not the verdict.)

The meta-lesson: a spec sheet that lists “audio input” is telling you the plumbing exists, not that the quality does — and a single model that aces a task in one language can score zero in another. Hold the reasoning constant, vary one thing, and measure the thing your users actually feel: did it answer, and how fast.


Reproduce it

e2e/items.json     # 27 spoken turns + answer keys (en + zh)
e2e/gen_audio.py   # gTTS -> audio/<id>.mp3
e2e/stt.py         # cascade stage 1: faster-whisper -> stt.tsv (+ latency)
e2e/run_e2e.py     # A (whisper text -> gemma4) vs B (audio -> gemma4); replies + latency + correctness
e2e/regrade.py     # Traditional->Simplified-aware scoring -> summary tables

Plus a standalone transcription benchmark (build_corpus.py / run_fw.py / run_gemma4.py / score.py) on Google FLEURS for the WER/CER mechanism numbers. Hardware: one 24 GB GPU, Ollama 0.30.7. If you run it on a larger Gemma 4 or other languages, I’d love to see the numbers.