All articlesStack

OpenAI's voice latency write-up: a four-layer read for production deployments

OpenAI published a deep dive on how they keep Realtime API latency low. Here's the four-layer read, plus what we've found running voice agents on a different stack.

May 5, 2026 3 min read
voice-agentsopenailatencyproduction-infrastructure

OpenAI published an infrastructure post this week walking through how their Realtime voice stack stays responsive under load. It's an unusually concrete read — not a marketing piece but actual edge topology, streaming protocols, and chunk sizes. The headline numbers (median latencies) are in the post itself; rather than restate them here, I'd point you at the original. What follows is the four-layer pattern they describe and what we've learned running voice agents on a different provider.

1. Edge proximity

OpenAI runs inference on edge nodes close to the user. This is standard CDN topology applied to model serving. The implication for anyone building voice: the hosting provider matters more than the model. We migrated Goldie, the voice agent for Golden Plate, between providers without changing the model behind it, and the latency profile changed materially. The model was held constant. The infrastructure underneath wasn't.

2. Streaming the whole pipeline

Audio in, tokens out, audio back — every layer streams. No "wait for the full sentence" blocking. This is the single biggest gap between a voice agent that feels natural and one that doesn't. A non-streaming setup adds enough dead air that users perceive the agent as slow regardless of the model's actual response time.

If you take one thing from OpenAI's post and apply it tomorrow, this is the thing.

3. Audio chunking

OpenAI processes short audio frames rather than full utterances. The model maintains conversation state across chunks. The practical effect: the system can begin formulating a response while the user is still talking, which is what makes interruption handling tractable.

Chunk size is a tuning parameter. Smaller chunks mean lower latency and a higher chance of cutoff artifacts. Larger chunks mean cleaner audio and slower response. We tune by use case. A sales-style agent on a phone line wants tighter chunks. A support agent in a noisy environment can tolerate looser ones.

4. Model co-design

GPT-4o was trained for streaming voice from the start, not adapted from a text model. That choice shows up in the audio output quality and in how the model behaves around interruptions. It also explains why "use any LLM behind a TTS layer" doesn't reach the same naturalness.

What the post doesn't cover

Three things we hit constantly that the OpenAI write-up doesn't address.

The first is variance, not median. A voice agent that responds in 300ms most of the time but stalls to a full second every fifth call performs worse than a consistent 400ms agent. Users tolerate predictable latency. They don't tolerate random pauses. Measure the 95th percentile, not the median, and look at the spread between them.

The second is interruption handling. OpenAI mentions it; the post doesn't dwell. In production, this is the hardest part. The threshold for what counts as a user interruption versus background noise has to be tuned per environment. Too sensitive and a passing siren cuts off the agent. Not sensitive enough and the agent talks over the customer for a full second after they've started speaking. We landed on roughly half a second of detected user speech triggering an interrupt, but the right value depends on the call context.

The third is cost. The Realtime API is priced for first-party convenience, not for high-volume deployment. For a voice agent doing thousands of calls a day, the per-minute math changes which provider makes sense.

How we measure it

Every voice agent we deploy logs end-to-end latency from user-speech-end to agent-speech-start, plus a regional breakdown and a count of calls with pauses over a second. The deployment also runs a synthetic test call on a fixed cadence — a handful of canned conversations replayed against the live agent, with latency distributions logged. That synthetic test caught a regional outage on our voice provider weeks ago, before any customer call had actually failed.

If you're running a voice agent and you don't have the latency distribution in front of you, that's the place to start. Vendor dashboards report medians. Medians lie. Get the 95th percentile and the variance and look at them weekly.

The OpenAI post is a useful artifact because it makes the underlying topology legible. Most voice latency problems are not model problems. They're infrastructure problems wearing a model costume, and you fix them with edge placement, chunk tuning, and aggressive streaming, not with a smarter prompt.

/ 06 — Start hereOne business day response

Tell us what you'd like built.

Send us a paragraph about the workflow, phone line, or tool you want built. We'll reply within one business day with a one-page plan, a fixed price, and a delivery date you can put on a calendar.

  • 30-min scoping call, free
  • Written proposal within 48 hours
  • Fixed price before we start
  • Most builds delivered in 2–8 weeks