Gemma 4 for Telephony: I Replaced Two AI Models With One in My Voice Phone Agent -- Papaya Assist

My voice phone agent uses two models: one to hear the caller, one to think. Gemma 4 can do both at once — so I tried deleting the speech-to-text model entirely. Across English, French, and Mandarin, here’s the head-to-head on response time and the thing that actually matters on a phone line: did it reply correctly. In English and French, one model beat two — faster and more accurate. In Chinese, it confidently told me a football field is 100 yards. Here’s the data.

The cascade, and the temptation to collapse it

My voice phone agent uses the same architecture as almost every spoken assistant: a cascade of specialist models.

caller audio ──▶ [ faster-whisper ] ──▶ text ──▶ [ Gemma LLM ] ──▶ reply ──▶ [ TTS ] ──▶ audio
                   speech-to-text                  reasoning

Two models loaded, two GPU residents, two sequential hops before the caller hears anything. Then Gemma 4 shipped in Ollama with an audio capability — it can take audio directly as input. Which raises the obvious question:

Why transcribe at all? If the LLM can hear the caller, it can understand and answer in one step. One model, one hop.

So I set up a clean head-to-head, holding the LLM constant so the only thing that changes is the input modality:

Pipeline A — the cascade: audio → faster-whisper → text → gemma4:12b → reply
Pipeline B — multimodal: audio → gemma4:12b (audio in) → reply

Same model doing the reasoning in both. I measured the two things that actually matter for a phone agent:

Reply correctness — did the agent answer the question correctly?
Response time — how long until the reply is ready.

A note on what to measure, because for a telephony system it changes the conclusion. The instinct is to score the speech-to-text by Word Error Rate. But a caller never hears the transcript — they hear the reply. What matters on a phone line is whether the agent did the right thing: answered the question, booked the table, cancelled the reservation. A transcript that drops a filler word or writes “7:00” for “seven o’clock” is harmless if the reply is still correct. So reply correctness is the primary metric here; transcription WER is only a diagnostic I use later to explain why the replies succeed or fail. Judge the system by the thing the caller actually experiences.

Methodology

I wrote 42 spoken caller turns (15 English, 15 French, 12 Mandarin) as natural speech via gTTS — factual questions, arithmetic, and task requests (bookings, cancellations) — each with a checkable answer key so correctness is objective, not vibes:

Spoken turn	Correct iff reply contains
“What is the capital of France?”	paris
“What is fifteen plus twenty-seven?”	42 / forty-two
“I’d like to book a table for four at seven tonight.”	(four and seven)
“法国的首都是哪里？”	巴黎
“二乘以八等于几？”	16 / 十六

Grading is regex on the reply, case-insensitive, with Traditional→Simplified normalization for Chinese (faster-whisper and Gemma both emit Traditional characters; the keys are Simplified). Both pipelines used identical generation settings (temperature=0, reasoning disabled). Hardware: one 24 GB GPU, Ollama 0.30.7, gemma4:12b (Q4_K_M), faster-whisper medium.

One setup gotcha worth saving you an hour: Ollama’s native /api/chat silently ignores the audios field — you get “there’s no audio attached.” Audio only works through the OpenAI-compatible /v1/chat/completions endpoint as an input_audio content part (base64, 16 kHz mono).

English: the one-model dream, realized

Pipeline	Reply accuracy	Median latency
A — cascade (faster-whisper → gemma4)	93 %	0.81 s
B — multimodal (gemma4 audio)	100 %	0.66 s

The multimodal model won on both axes. Every English question got a correct answer, and it did so faster than the cascade — because one model call beats two sequential ones (transcribe-then-reason). Side by side, they’re indistinguishable in quality:

"What is two times eight?"
  A: Two times eight is 16.          (0.79 s)
  B: Two times eight is sixteen.     (0.60 s)

"How many legs does a spider have?"
  A: A spider has eight legs.        (0.76 s)
  B: A spider has eight legs.        (0.61 s)

(The single cascade “miss” wasn’t an error — asked to book a table, it replied “what’s the name of the restaurant?”, a sensible clarifying turn that just didn’t contain the answer-key words. The multimodal model happened to confirm the booking outright.)

For an English voice agent, this is the result you hoped for: drop a whole model, get faster and at-least-as-accurate replies.

French: not an English fluke

To check whether this is an English-only trick, I ran the full 15-item set in French, both pipelines, same as English:

Pipeline	Reply accuracy	Median latency
A — cascade (faster-whisper → gemma4)	100 %	0.89 s
B — multimodal (gemma4 audio)	93 %	0.71 s

French behaves like English: the multimodal model understands the audio and answers correctly. Same near-indistinguishable quality:

Quelle est la capitale de la France ?
  A: La capitale de la France est Paris.        B: La capitale de la France est Paris.   ✓
Quel est le contraire de chaud ?
  A: Le contraire de chaud est froid.           B: Le contraire de chaud est froid.       ✓
Combien de pattes a une araignée ?
  A: …elle possède huit pattes articulées.      B: Une araignée a généralement huit pattes. ✓

The single multimodal “miss” is revealing: asked “combien font quinze plus vingt-sept ?”, it answered “quarenta e dois” — the right answer (42), but in Portuguese. The audio was understood; the model just slipped languages. For a French phone line that’s still a failure (the caller hears Portuguese), so it’s scored wrong — but it’s a language-control wobble, not deafness. Nothing like the Chinese collapse you’re about to see. The single-model result clearly generalizes: gemma4’s audio works on major European languages, not just English.

Chinese: the same model, off a cliff

Pipeline	Reply accuracy	Median latency
A — cascade (faster-whisper → gemma4)	92 %	0.83 s
B — multimodal (gemma4 audio)	8 % †	0.67 s

† The single “correct” is a fluke — asked water’s boiling point, B answered “a football field is 100 yards”, which matched the “100” key. True accuracy is 0 / 12.

Same LLM. Same questions. The only change is feeding audio instead of a transcript — and the agent goes from 92 % to effectively zero. The failure isn’t subtle mistakes; it’s the model answering completely unrelated questions, because it can’t make out the Mandarin at all and confabulates:

法国的首都是哪里？   (What's the capital of France?)
  A: 法國的首都是巴黎。                    ✓  (The capital of France is Paris.)
  B: 法語的字母數量是26個。               ✗  (The French alphabet has 26 letters.)

一只蛛蛛有几条腿？   (How many legs does a spider have?)
  A: 蛛蛛有八條腿。                        ✓  (A spider has eight legs.)
  B: The nearest subway station is just around the corner.   ✗

二乘以八等于几？     (What is two times eight?)
  A: 2乘以8等於16。                        ✓
  B: 法国一共�个省。                    ✗  (France has 12 provinces.)

The cascade, meanwhile, handles Chinese almost perfectly — because faster-whisper transcribes the Mandarin correctly and then the very same Gemma answers it correctly.

Why: it’s the ears, not the brain

The reasoning is identical across both pipelines, so the collapse must be in the audio understanding. A separate transcription benchmark on Google’s FLEURS read-speech set confirms it precisely:

For the WER/CER run I asked Gemma 4 to transcribe, nothing more — a single user turn (no system prompt), temperature=0, reasoning_effort="none":

Transcribe the speech in this audio exactly, word for word, in its original language. Output only the transcription text, nothing else.

	English (WER ↓)	Chinese (CER ↓)
faster-whisper `medium`	3.6 %	12.6 %
gemma4:12b (audio)	9.2 %	76.6 %

(WER/CER are computed after normalization — Whisper’s English text normalizer so “7:00” equals “seven o’clock”, and Han-character-only comparison for Chinese — applied identically to both systems.)

Gemma 4’s audio encoder hears English well enough (9.2 % word error — it even paraphrases sensibly) that the downstream answer comes out right. On Chinese it’s at 76.6 % character error — it isn’t transcribing, it’s hallucinating — so the brain, however capable, is reasoning over noise. Garbage in, confident garbage out. The audio front-end isn’t English-centric (French works just as well) — it’s major-European-language-centric, with Mandarin coverage simply not there yet at this model size and quantization.

The latency nuance (and a landmine)

The multimodal path’s latency edge (0.66 s vs 0.81 s) is real but modest: collapsing two sequential calls into one saves a hop, and faster-whisper itself is cheap (~0.2–0.3 s). The bigger win is operational — one model to load, serve, and monitor instead of two.

One landmine, though: Gemma 4 is a reasoning model, and on the audio endpoint its thinking trace is hard to suppress reliably (reasoning_effort, think:false, and enable_thinking were all flaky in my tests). Most calls were snappy, but in separate runs a single transcription occasionally ballooned to ~40 seconds of internal monologue, sometimes leaking raw <channel|> control tokens. For a real-time phone line, an unpredictable 40 s stall is its own disqualifier until the control surface stabilizes.

How it plugs into a phone agent

In a SIP/telephony stack the multimodal swap is small. Your phone bridge (e.g. pyVoIP over a VoIP provider) already buffers the caller’s audio and uses voice-activity detection to decide when they’ve stopped talking. Today that buffer goes to a speech-to-text model and the text goes to an LLM. With Gemma 4 you collapse those two steps into one call:

         ┌─ before ─────────────────────────────────────────────┐
caller ▶ VAD ▶ buffer ▶ [faster-whisper] ▶ text ▶ [LLM] ▶ reply ▶ TTS ▶ caller
         └─────────────────────────────────────────────────────┘
         ┌─ after ──────────────────────────────────────────────┐
caller ▶ VAD ▶ buffer ▶──────────▶ [Gemma 4 audio] ▶ reply ▶ TTS ▶ caller
         └─────────────────────────────────────────────────────┘

The turn handler becomes one function — resample the buffered turn to 16 kHz mono, attach it as an input_audio part, and keep the running conversation as ordinary text history:

def handle_turn(turn_wav_16k_mono, history):           # history = prior turns (text)
    audio_b64 = base64.b64encode(turn_wav_16k_mono).decode()
    messages = [{"role": "system", "content": AGENT_PROMPT}, *history,
                {"role": "user", "content": [
                    {"type": "input_audio",
                     "input_audio": {"data": audio_b64, "format": "wav"}}]}]
    r = requests.post("http://localhost:11434/v1/chat/completions", json={
        "model": "gemma4:12b", "messages": messages,
        "stream": True, "temperature": 0.0, "reasoning_effort": "none"})
    reply = stream_to_tts(r)                            # speak as tokens arrive
    history += [{"role": "user", "content": "(caller audio)"},
                {"role": "assistant", "content": reply}]
    return reply

Four things that matter in practice:

Use the OpenAI-compatible endpoint. The audio goes only through /v1/chat/completions as input_audio; the native /api/chat audios field is silently dropped.
History stays text. Only the current turn is audio; past turns ride along as transcribed/assistant text, which keeps the prompt small and the model grounded.
Stream into the TTS. Start synthesizing the reply on the first clause so the caller hears audio in well under a second — the same trick that makes the cascade feel fast.
Guard the reasoning blowup. Cap the response (timeout / max tokens) and fall back to the cascade if a turn stalls, so one runaway thinking trace can’t freeze the call.

That’s the whole change — for an English- or French-language line. For Chinese you keep faster-whisper in front, and Gemma 4 stays purely the text brain.

Verdict: one model, but mind the language

	English	French	Chinese
Replace the cascade with one multimodal model?	Yes	Yes	No
Reply accuracy, multimodal	100 %	93 %	~0 %
Reply accuracy, cascade	93 %	100 %	92 %
Response time, multimodal vs cascade	0.66 s vs 0.80 s	0.71 s vs 0.89 s	0.67 s vs 0.83 s
Why	hears English well	hears French well	can’t hear Mandarin

(Reply correctness — what the caller actually experiences — is the column that decides this. Transcription WER/CER is the “Why,” not the verdict.)

English- or French-first agent: the single multimodal model is a genuine win — faster, simpler, and at least as accurate on the metric that matters (correct replies). Just budget for the reasoning-mode latency tax.
Mandarin (or any language the encoder can’t hear) in the loop: keep the cascade. A dedicated ASR model isn’t legacy baggage — it’s the only reason the Chinese agent answers correctly at all. And note French works fine, so this is about which languages the audio encoder covers, not “English vs the rest.”
The metric that decides it is reply correctness, not WER — because a phone caller is served by the answer, not the transcript. Test the language(s) you’ll actually deploy in, and grade the reply.
My bilingual (Chinese/English) phone agent keeps faster-whisper + an LLM. The one-model future is clearly coming — for English and French it’s already here — but it arrives one language at a time.

The meta-lesson: a spec sheet that lists “audio input” is telling you the plumbing exists, not that the quality does — and a single model that aces a task in one language can score zero in another. Hold the reasoning constant, vary one thing, and measure the thing your users actually feel: did it answer, and how fast.

Reproduce it

e2e/items.json     # 27 spoken turns + answer keys (en + zh)
e2e/gen_audio.py   # gTTS -> audio/<id>.mp3
e2e/stt.py         # cascade stage 1: faster-whisper -> stt.tsv (+ latency)
e2e/run_e2e.py     # A (whisper text -> gemma4) vs B (audio -> gemma4); replies + latency + correctness
e2e/regrade.py     # Traditional->Simplified-aware scoring -> summary tables

Plus a standalone transcription benchmark (build_corpus.py / run_fw.py / run_gemma4.py / score.py) on Google FLEURS for the WER/CER mechanism numbers. Hardware: one 24 GB GPU, Ollama 0.30.7. If you run it on a larger Gemma 4 or other languages, I’d love to see the numbers.